OneFileLLM: One-click integration of massive data sources

One-click data integration is a powerful tool to improve LLM efficiency.
Core content:
1. Multiple data source integration, unified format output
2. Support for local files, GitHub, academic papers and other data types
3. Simple installation and intuitive use
The biggest function of OneFileLLM is to integrate multiple data sources, output them in a unified format , and organize them into LLM contextual information.
Sources include but are not limited to local files/directories, GitHub repositories, GitHub PRs, GitHub Issues, ArXiv academic papers, YouTube video subtitles, web documents, Sci-Hub papers identified by DOI or PMID, etc.
No matter where your data comes from, it will eventually be compiled into a single text file , which can then be easily copied into LLM for use.
OneFileLLM is the Swiss Army Knife of data integration:
- Automatic source type detection : Automatically detects data types based on the provided path, URL, or identifier
- Multi-source support : support for local files/directories, GitHub repositories, GitHub PRs, GitHub Issues, ArXiv academic papers, YouTube video subtitles, web documents, Sci-Hub papers identified by DOI or PMID
- Multi-format processing : Able to process multiple file formats such as Jupyter Notebook, PDF, etc.
- Web crawling : can extract link page content of specified depth
- Sci-Hub integration : automatically download research papers using DOI or PMID
- Text preprocessing : including compressed and uncompressed output, stop word removal, and lowercase conversion
- Auto-copy function : automatically copies uncompressed text to the clipboard for easy pasting into LLM
- Token count reporting : Reports the number of tokens for both compressed and uncompressed output
- XML packaging : Use XML structured output to improve LLM comprehension
You can see that OneFileLLM can fully cover most daily scenarios, especially when you need to input a lot of information into LLM.
Scientific research research paper analysis : quickly obtain and process academic papers directly through ArXiv ID or DOI.
Programmers need to understand the code base : just enter the GitHub repository URL to quickly get an overview of the code base.
For the commonly used video site YouTube, subtitles can be directly extracted and processed.
Some long online documents are directly crawled and downloaded, and copied to LLM for study.
Installing OneFileLLM is very simple. Here are the installation steps using UV Package Manager:
# Clone the repository git clone https://github.com/jimmc414/onefilellm.git cd onefilellm # Use UV to install dependencies uv pip install -U -r requirements.txt # Or create a virtual environment uv venv # Activate the virtual environment (Windows) .venv\Scripts\activate # Activate the virtual environment (Linux/Mac) source .venv/bin/activate # Install dependencies uv pip install -U -r requirements.txt
The usage is also very intuitive:
# Basic use python onefilellm.py # Or directly pass in the URL/path python onefilellm.py https://github.com/jimmc414/onefilellm
The workflow of OneFileLLM is very simple and clear:
The user provides an input URL or path, the tool detects the source type, then calls the corresponding processing module to preprocess the text (clean, compress, etc.) and finally generates an output file.
All output results are encapsulated in XML tags . This structure can improve LLM's ability to understand and process input.
OneFileLLM is a very useful tool that greatly simplifies the process of entering multi-source data into LLM.
Research, development and learning often require providing a lot of structured information to LLM. You might as well try this tool, which may save you a lot of time and energy.