Building an LLM application

Storing content as vector embeddings in Pinecone

Monday, June 3, 2024

With all the buzz about AI, my curiosity got the better of me, and I've started building myself an "AI application." I've heard a lot about Pinecone , a vector database used for this sort of thing.

For example, say I have a niche biology book saved with each page as its own record. With this technology, I'll be able to query my application about a topic, have my database return the most relevant pages from the book about this topic and then pass that info along with my query to the LLM. The purpose is to potentially give the LLM information to answer about that it wasn't trained on, as well as avoiding hallucination by being explicit about the information it needs to answer about.

High Level Setup

Adding vector data to Pinecone -

Identify and secure content to query
Chunk that text into smaller pieces. Each piece will be a record in the database.
Pass each chunk to an embedding model to convert the string value to a vector.
Save the vector in Pinecone. store the original text content in the metadata section of the payload to make it easy to view the text itself on retrieval.

Querying the data -

Construct your query
Perform the same string-to-vector step as step 3 of the previous workflow. Send your string for embedding to the model.
send the vector value to Pinecone, as well as the count of returns you want back.

Progress

Setting up Pinecone

So far I've created my initial Pinecone db instance. It was fast to setup, though there was a core concept I needed to understand. I needed to create an index for my Pinecone database - a top level collection of vector data - and needed to specify a parameter count. This count is dictated by the embedding model I planned on using, since it's the output of that model that I'm saving. I opted for OpenAI's text-embedding-3-small model (more info here) , which has a size of 1536.

Adding text content to the database

In my hunt for text to save to the database, I found https://gutenberg.org/, a massive repository of public domain e-books. To start, I downloaded Romeo and Juliet.

If I load the entire play into the database as a single record, this wouldn't serve much help, as I'd be passing the entire play into my LLM instead of a specific part. This is both expensive and error prone, as the model would need to parse the massive content. Therefore, I need to "chunk" it - split the text into different pieces and insert the individual pieces as individual records.

Originally I thought I'd chunk down to individual scenes and then maybe chunk further, but that degree of precision would require a lot of work. So since this is purely for educational purposes, I decided to chunk by paragraph instead. I wrote a function in node to create chunks of 7 paragraphs length, with a 2 paragraph overlap with the previous chunk. This follows a passing best practice I heard, in which you want some overlap in your chunks to ensure contexts is preserved.

1 function chunkParagraphs(
2      paragraphs: string[],
3      chunkSize: number,
4      overlapSize: number
5    ): string[][] {
6      const chunks: string[][] = [];
7      for (let i = 0; i < paragraphs.length; i += chunkSize - overlapSize) {
8        const chunk: string[] = paragraphs.slice(i, i + chunkSize);
9        if (chunk.length > 0) {
10          chunks.push(chunk);
11        }
12        if (i + chunkSize >= paragraphs.length) {
13          break;
14        }
15      }
16      return chunks;
17    }
typescript

1const paragraphs: string[] = data.split(/\n\n+/);
2const chunkSize = 7;
3const overlapSize = 2;
4const chunkedParagraphs: string[][] = chunkParagraphs(
5      paragraphs,
6      chunkSize,
7      overlapSize
8    );
typescript

Then once I have my chunks, it's time to send them to Pinecone! I loop through each chunk, send it to openai for create the embedding vector and then save that vector value to the database. In addition, I add the chunk to my metadata object. This is so that when I query the database and get my results, I can look at the raw text without needing to convert the embedding back to plain text.

1const pc = new Pinecone({ apiKey: PINECONE_API_KEY });
2const index = pc.Index(INDEX_NAME);
3
4const chunks = await getRomeoAndJulietChunks();
5
6chunks.forEach(async (chunk, i) => {
7    const embeddingRaw = await openai.embeddings.create({
8      model: "text-embedding-3-small",
9      input: chunk,
10      encoding_format: "float",
11    });
12    
13	const embedding = embeddingRaw.data[0].embedding;
14	const id = "rj" + i;
15	const value = {
16      id,
17      values: embedding,
18      metadata: {
19        text: chunk,
20      },
21    };
22    
23	const result = await index.upsert([value]);
24})
25
typescript

Querying the database

I can now query my database using a similar process to the initial embedding creation step.

1  const query = await openai.embeddings.create({
2    model: "text-embedding-3-small",
3    input: "QUERY TEXT",
4    encoding_format: "float",
5  });
6
7  const queryResponse1 = await index.query({
8    topK: 3,
9    vector: query.data[0].embedding,
10    includeValues: false,
11    includeMetadata: true,
12  });
13
14  console.log({
15    queryResponse1: queryResponse1.matches.map((match) => [
16      match.metadata?.text.toString(),
17      match.score,
18    ]),
19  });
typescript

The above code will give me the top 3 records in the index that are semantically similar to my input of QUERY INPUT. It converts that string to its vector value, and then compares that vector value against the vectors in the database, returning the top 5 most similar. It also includes a match score, indicating how strongly similar the passage is to the query.

Demo output

Since this is Romeo and Juliet, I'm going to change my query input to "Our love will survive, against all odds". And the below chunk was the result with a similarity score of 0.314391255. This is also the resulting chunk for "Our love is more important than our family.", with a score of 0.278758824. I only queried for the top result, which is why I've only gotten one response.

1BENVOLIO.\n' +
2        'In love?\n' +
3        '\n' +
4        'ROMEO.\n' +
5        'Out.\n' +
6        '\n' +
7        'BENVOLIO.\n' +
8        'Of love?\n' +
9        '\n' +
10        'ROMEO.\n' +
11        'Out of her favour where I am in love.\n' +
12        '\n' +
13        'BENVOLIO.\n' +
14        'Alas that love so gentle in his view,\n' +
15        'Should be so tyrannous and rough in proof.\n' +
16        '\n' +
17        'ROMEO.\n' +
18        'Alas that love, whose view is muffled still,\n' +
19        'Should, without eyes, see pathways to his will!\n' +
20        'Where shall we dine? O me! What fray was here?\n' +
21        'Yet tell me not, for I have heard it all.\n' +
22        'Here’s much to do with hate, but more with love:\n' +
23        'Why, then, O brawling love! O loving hate!\n' +
24        'O anything, of nothing first create!\n' +
25        'O heavy lightness! serious vanity!\n' +
26        'Misshapen chaos of well-seeming forms!\n' +
27        'Feather of lead, bright smoke, cold fire, sick health!\n' +
28        'Still-waking sleep, that is not what it is!\n' +
29        'This love feel I, that feel no love in this.\n' +
30        'Dost thou not laugh?\n' +
31        '\n' +
32        'BENVOLIO.\n' +
33        'No coz, I rather weep.',
text

Parting thoughts

The whole concept of taking a block of text, converting it to a vector and comparing other vectors against it based on meaning still feels like magic. But now, I little less, I guess.

The main thing I want to dig deeper into is chunking strategy. I believe that if I'd chunked more finely, I would get higher similarity scores, as each chunk would smaller and more specifically about a single thing. That said, if I chunk too finely, I lose all context of the passage.

A suggestion a friend made would be to keep my chunk size as is, but to then pass it to an LLM with the instruction of only return the parts of this passage most related to the original prompt.

I'm also wondering if I can implement a sort of cascading chunking strategy where I go through the original text a bunch of times with different chunking intervals. With this strategy, I'd need to keep fine grained tracking of where in the text I am in the metadata. In theory, this would allow me to identify the most semantically similar chunks and then evaluate how similar the values are when I view them in larger chunk sizes. As I'm writing this, I'm unsure exactly how i'd pull this off, but it's a fun thought experiment.

Also also - since these chunks have no knowledge of eachother, there's no overarching context. The metadata object exists, but I'm pretty sure that's not queryable, and is instead purely used to give additional information to the end user in the query response. Therefore, it might make sense to inject additional context into each chunk. In the Romeo and Juliet example, maybe a line saying this is from act 1, scene 3. Or maybe even a rolling summary of here's what's happened in the play so far. Though that sounds expensive, as for every chunk I'd need to pass in the previous summary and the new text into a llm to get the new summary. And would that mess up my query results by adding too much overarching meaning to the specific section?

All things to experiment. It's an exciting and slightly overwhelming feeling when answering one question yields four more.

Thanks for reading.

Nate's Blog