Back
Post
Posted

Most ergonomic way to create continous embeddings of a fast changing doc?

We need to create continuous embeddings of a fast-changing document, and store them in a vector database. I'm looking for any best practices to do so. At a given point, the data on the Doc and the vector store have to map without duplicates or stale data. All while keeping costs low using batching as much as possible 

Any one worked on something like this before?  


It all depends on your stack. For example I'd be using my usual stack then will be generating embeddings using OpenAI text-embedding-3-small model and store in ElasticSearch 9+ .

If you'd be willing to continuously reembed data using OpenAI might become cost prohibitive so I would look to open models and self-hosting. For storage - it all depends on your stack. Pgvector, Quadrant, etc

Thanks for the response!
I find turbopuffer to be a great vector store. Using gemini-embedding-001 at the moment. It's working great. The only issue I've had to solve was to maintain an embeddingscache of the documents with timestamps for lastupdated, and SH256 of the content to prevent duplicates. Storing the SH256 of the vectors are also useful apparently, as some people have suggested.

Simplest solution for low scale is Postgres + pgvector:

  1. Create a content (text) column with a unique key to prevent dupes
  2. Create an embedding column to store the embedding
  3. Write a scheduled bg job that runs every so often, checks for docs that haven’t been updated recently, and updates the same “last processed at” column whenever embeddings are updated for a given doc.

For high scale, consider an elasticsearch cluster and have the scheduled job produce messages to be written to a queue + horizontally scalable worker to process the queue.

thanks, but i think this strategy is for low scale and less frequently changing docs. looking for something at a decent scale, and fast changing documents

Define “fast changing” and “decent scale”

Home
Search
Messages
Notifications
More