Most ergonomic way to create continous embeddings of a fast changing doc?

Question

We need to create continuous embeddings of a fast-changing document, and store them in a vector database. I'm looking for any best practices to do so. At a given point, the data on the Doc and the vector store have to map without duplicates or stale data. All while keeping costs low using batching as much as possible

Any one worked on something like this before?

Any one worked on something like this before?

Greg Raileanu · Accepted Answer

It all depends on your stack. For example I'd be using my usual stack then will be generating embeddings using OpenAI text-embedding-3-small model and store in ElasticSearch 9+ . 

If you'd be willing to continuously reembed data using OpenAI might become cost prohibitive so I would look to open models and self-hosting. For storage - it all depends on your stack. Pgvector, Quadrant, etc

Praneeth Pike · Answer

Thanks for the response!
I find turbopuffer to be a great vector store. Using gemini-embedding-001 at the moment. It's working great. The only issue I've had to solve was to maintain an embeddings_cache of the documents with timestamps for last_updated, and SH256 of the content to prevent duplicates. Storing the  SH256 of the vectors are also useful apparently, as some people have suggested.

Ben Katz · Answer

Simplest solution for low scale is Postgres + pgvector:

1. Create a content (text) column with a unique key to prevent dupes
2. Create an embedding column to store the embedding
3. Write a scheduled bg job that runs every so often, checks for docs that haven’t been updated recently, and updates the same “last processed at” column whenever embeddings are updated for a given doc.

For high scale, consider an elasticsearch cluster and have the scheduled job produce messages to be written to a queue + horizontally scalable worker  to process the queue.

Ben Katz · Answer

Define “fast changing” and “decent scale”

Praneeth Pike · Answer

thanks, but i think this strategy is for low scale and less frequently changing docs. looking for something at a decent scale, and fast changing documents

Go to Homepage	`g` `h`
Go to Done Todos	`g` `d`
Compose a New Todo	`n`
Go to Search	`/`
Show this dialog	`?`

Most ergonomic way to create continous embeddings of a fast changing doc?

Keyboard Shortcuts