Chat With Your Data
Mihai-Cristian Farcaș / 8 February 2025
LLM chatbot for your personal data.

PDF Chatbot using LLM and Streamlit
This post is about taking a new look at your personal data through a chatbot that can answer your questions based on the PDF files you give it. Pretty cool, right?
The chatbot works on CPU and uses Retrieval-Augmented Generation (RAG) to answer questions based on information stored in PDF documents. This is achieved using LangChain for document processing and retrieval, FAISS for vector storage, and Hugging Face's transformers library for language model processing.
Features
- Interactive Q&A: Ask questions related to the contents of PDF files, and get answers generated by an LLM.
- PDF Document Retrieval: PDFs stored in the
docs
directory are loaded and processed for easy access during chats. - Conversational Memory (To be implemented): The chatbot maintains a chat history to provide contextually relevant responses within the conversation.
- Streamlit Chat UI: Simple, intuitive interface using Streamlit, supporting conversation-based interaction with your PDF data.
Requirements
- Python 3.8 or higher
- Install the dependencies listed in
requirements.txt
using
pip install -r requirements.txt
If you are working with the notebook, running the first block is sufficient.
- 16GB of RAM
How to use the Chatbot for your data
1. Setup PDF Data
Add the PDF files to a docs
directory. The chatbot will load these PDFs, process the text, and create a vector store for retrieval. In the notebook example, I loaded and processed my results report from the Understand Myself personality test.
2. Run the Application
To start the chatbot:
streamlit run chatbot.py
3. Ask Questions
Use the text input field to ask questions about the data within your PDF documents. The chatbot will retrieve relevant information and generate answers based on the content of your PDFs.
Code Overview
- Model Loading: The LaMini-T5-738M model is loaded from Hugging Face as a text2text-generation pipeline.
- PDF Loading and Text Splitting: PDFs are processed using the PyPDFLoader and split into smaller chunks with RecursiveCharacterTextSplitter.
- Vector Store Creation: Text chunks are converted into embeddings and stored in a FAISS vector store for efficient retrieval.
- Question-Answering Chain: A Conversational Retrieval Chain combines LLM responses with retrieved content to provide informed answers.
- Streamlit Interface: The chatbot UI displays past user inputs and generated responses.
You can find the source code on my Github.
Thanks for reading! Cheers! 🍻
P.S. If you enjoyed this post, consider giving me some feedback and subscribing to my newsletter for more insights and updates. You can do so from my contact page. 🚀