Using LangChain to make an information analyst

I'm working on a project with a friend that requires building "Custom GPTs" via the API

I haven't used LangChain since all the hype around it a year or so again. Time to relearn it. I'm going to refer to the LangChain entity in the abstract as "the agent."

Here are the patterns I need to implement -

  1. Requests will be made via web server requests (I'm going to implement this last)
  2. End user can upload documents.
  3. The agent can read the documents and associate the information with the system prompt.
  4. The agent can search the web for additional information.

I'm going to start with bullet 3.

Creating an agent that can parse and read documents

First off, I found this three hour LangChain crash course on YouTube. I snagged the full video transcript using this website and passed it into ChatGPT o1 model. This way, I can have a conversation with the whole document.

I then asked ChatGPT to write me a bare bones document loader /w the ability to ask questions about them to Open AI. After a little bit of massaging, I ended up with this .

1from langchain_openai import ChatOpenAI 2from langchain.schema import SystemMessage, HumanMessage 3from langchain_community.document_loaders import PyPDFLoader, UnstructuredWordDocumentLoader 4import nltk 5import dotenv 6 7dotenv.load_dotenv() 8# Add these lines at the start to download required NLTK data 9nltk.download('punkt') 10nltk.download('punkt_tab') 11nltk.download('averaged_perceptron_tagger_eng') 12 13def load_pdf_text(pdf_path: str) -> str: 14 """Load all text from a PDF file into a single string.""" 15 loader = PyPDFLoader(pdf_path) 16 documents = loader.load() # returns a list of Documents 17 # each Document has .page_content; concatenate them 18 text_chunks = [doc.page_content for doc in documents] 19 return "\n".join(text_chunks) 20 21def load_word_text(word_path: str) -> str: 22 """Load all text from a Word (.docx) file into a single string.""" 23 loader = UnstructuredWordDocumentLoader(word_path) 24 documents = loader.load() 25 text_chunks = [doc.page_content for doc in documents] 26 return "\n".join(text_chunks) 27 28def combine_texts(*all_texts) -> str: 29 """Combine multiple doc strings into one big text with separators.""" 30 return "\n---\n".join(text.strip() for text in all_texts if text.strip()) 31 32def answer_question(system_prompt: str, doc_text: str, user_query: str) -> str: 33 """ 34 Takes a system prompt, doc text, and user question, 35 then injects them into an LLM call, returning the final answer. 36 """ 37 # 1. Build an LLM that can handle large contexts (gpt-4 or gpt-4-32k, for example) 38 llm = ChatOpenAI( 39 model_name="gpt-4o-mini", # or "gpt-3.5-turbo-16k" / "gpt-4-32k" 40 temperature=0.0 41 ) 42 43 # 2. Build your combined prompt 44 system_msg = SystemMessage(content=system_prompt) 45 user_msg = HumanMessage(content=f""" 46Here is the text from your documents: 47 48=== DOCUMENT CONTENT START === 49{doc_text} 50=== DOCUMENT CONTENT END === 51 52You MUST only use the above text to answer the question below. 53If the answer is not in the text, say 'Not found in the document'. 54 55User's Question: 56{user_query} 57""") 58 59 # 3. Call the model 60 response = llm([system_msg, user_msg]) 61 return response.content 62 63if __name__ == "__main__": 64 65 # Example usage 66 # --------------------------------------------------- 67 # 1. Load PDF text 68 pdf_text = load_pdf_text("pdf.pdf") 69 70 # 2. Load Word text (docx) 71 word_text = load_word_text("word-doc.docx") 72 73 # 3. Combine them all in memory 74 combined_doc_text = combine_texts(pdf_text, word_text) 75 76 # Example system prompt 77 system_instructions = ( 78 "You are a helpful assistant. " 79 "Answer questions based only on the provided document text. " 80 "If you cannot find the answer, say so." 81 ) 82 83 # 4. Ask a question 84 user_question = "write me a summary of what the documents tell us" 85 86 # 5. Get an answer from the LLM 87 final_answer = answer_question(system_instructions, combined_doc_text, user_question) 88 89 print("=== AI ANSWER ===") 90 print(final_answer)
python

Limitations

  1. It can't read scanned PDFs. It views them as having zero data. I'm going to have to figure that one out.
  2. There's nothing "agentic" about this, currently. It's a fixed workflow of read file 1, read file 2, summarize the files.

More to come.