Using LangChain to make an information analyst

I'm working on a project with a friend that requires building "Custom GPTs" via the API

Tuesday, January 7, 2025

I haven't used LangChain since all the hype around it a year or so again. Time to relearn it. I'm going to refer to the LangChain entity in the abstract as "the agent."

Here are the patterns I need to implement -

Requests will be made via web server requests (I'm going to implement this last)
End user can upload documents.
The agent can read the documents and associate the information with the system prompt.
The agent can search the web for additional information.

I'm going to start with bullet 3.

Creating an agent that can parse and read documents

First off, I found this three hour LangChain crash course on YouTube. I snagged the full video transcript using this website and passed it into ChatGPT o1 model. This way, I can have a conversation with the whole document.

I then asked ChatGPT to write me a bare bones document loader /w the ability to ask questions about them to Open AI. After a little bit of massaging, I ended up with this .

1from langchain_openai import ChatOpenAI
2from langchain.schema import SystemMessage, HumanMessage
3from langchain_community.document_loaders import PyPDFLoader, UnstructuredWordDocumentLoader
4import nltk
5import dotenv
6
7dotenv.load_dotenv()
8# Add these lines at the start to download required NLTK data
9nltk.download('punkt')
10nltk.download('punkt_tab')
11nltk.download('averaged_perceptron_tagger_eng')
12
13def load_pdf_text(pdf_path: str) -> str:
14    """Load all text from a PDF file into a single string."""
15    loader = PyPDFLoader(pdf_path)
16    documents = loader.load()  # returns a list of Documents
17    # each Document has .page_content; concatenate them
18    text_chunks = [doc.page_content for doc in documents]
19    return "\n".join(text_chunks)
20
21def load_word_text(word_path: str) -> str:
22    """Load all text from a Word (.docx) file into a single string."""
23    loader = UnstructuredWordDocumentLoader(word_path)
24    documents = loader.load()
25    text_chunks = [doc.page_content for doc in documents]
26    return "\n".join(text_chunks)
27
28def combine_texts(*all_texts) -> str:
29    """Combine multiple doc strings into one big text with separators."""
30    return "\n---\n".join(text.strip() for text in all_texts if text.strip())
31
32def answer_question(system_prompt: str, doc_text: str, user_query: str) -> str:
33    """
34    Takes a system prompt, doc text, and user question, 
35    then injects them into an LLM call, returning the final answer.
36    """
37    # 1. Build an LLM that can handle large contexts (gpt-4 or gpt-4-32k, for example)
38    llm = ChatOpenAI(
39        model_name="gpt-4o-mini",      # or "gpt-3.5-turbo-16k" / "gpt-4-32k"
40        temperature=0.0
41    )
42
43    # 2. Build your combined prompt
44    system_msg = SystemMessage(content=system_prompt)
45    user_msg = HumanMessage(content=f"""
46Here is the text from your documents:
47
48=== DOCUMENT CONTENT START ===
49{doc_text}
50=== DOCUMENT CONTENT END ===
51
52You MUST only use the above text to answer the question below.
53If the answer is not in the text, say 'Not found in the document'.
54
55User's Question:
56{user_query}
57""")
58
59    # 3. Call the model
60    response = llm([system_msg, user_msg])
61    return response.content
62
63if __name__ == "__main__":
64
65    # Example usage
66    # ---------------------------------------------------
67    # 1. Load PDF text
68    pdf_text = load_pdf_text("pdf.pdf")
69
70    # 2. Load Word text (docx)
71    word_text = load_word_text("word-doc.docx")
72
73    # 3. Combine them all in memory
74    combined_doc_text = combine_texts(pdf_text, word_text)
75    
76    # Example system prompt
77    system_instructions = (
78        "You are a helpful assistant. "
79        "Answer questions based only on the provided document text. "
80        "If you cannot find the answer, say so."
81    )
82    
83    # 4. Ask a question
84    user_question = "write me a summary of what the documents tell us"
85    
86    # 5. Get an answer from the LLM
87    final_answer = answer_question(system_instructions, combined_doc_text, user_question)
88    
89    print("=== AI ANSWER ===")
90    print(final_answer)
python

Limitations

It can't read scanned PDFs. It views them as having zero data. I'm going to have to figure that one out.
There's nothing "agentic" about this, currently. It's a fixed workflow of read file 1, read file 2, summarize the files.

More to come.

Nate's Blog

Creating an agent that can parse and read documents

Limitations