Back
Post
Most ergonomic way to create continous embeddings of a fast changing doc?
We need to create continuous embeddings of a fast-changing document, and store them in a vector database. I'm looking for any best practices to do so. At a given point, the data on the Doc and the vector store have to map without duplicates or stale data. All while keeping costs low using batching as much as possibleÂ
Any one worked on something like this before? Â
Any one worked on something like this before? Â
đź‘‹ Join WIP to participate
It all depends on your stack. For example I'd be using my usual stack then will be generating embeddings using OpenAI text-embedding-3-small model and store in ElasticSearch 9+ .
If you'd be willing to continuously reembed data using OpenAI might become cost prohibitive so I would look to open models and self-hosting. For storage - it all depends on your stack. Pgvector, Quadrant, etc
Thanks for the response!
I find turbopuffer to be a great vector store. Using gemini-embedding-001 at the moment. It's working great. The only issue I've had to solve was to maintain an embeddingscache of the documents with timestamps for lastupdated, and SH256 of the content to prevent duplicates. Storing the SH256 of the vectors are also useful apparently, as some people have suggested.
Simplest solution for low scale is Postgres + pgvector:
For high scale, consider an elasticsearch cluster and have the scheduled job produce messages to be written to a queue + horizontally scalable worker to process the queue.
thanks, but i think this strategy is for low scale and less frequently changing docs. looking for something at a decent scale, and fast changing documents
Define “fast changing” and “decent scale”