Simplest solution for low scale is Postgres + pgvector:
Create a content (text) column with a unique key to prevent dupes
Create an embedding column to store the embedding
Write a scheduled bg job that runs every so often, checks for docs that haven’t been updated recently, and updates the same “last processed at” column whenever embeddings are updated for a given doc.
For high scale, consider an elasticsearch cluster and have the scheduled job produce messages to be written to a queue + horizontally scalable worker to process the queue.
thanks, but i think this strategy is for low scale and less frequently changing docs. looking for something at a decent scale, and fast changing documents
Simplest solution for low scale is Postgres + pgvector:
For high scale, consider an elasticsearch cluster and have the scheduled job produce messages to be written to a queue + horizontally scalable worker to process the queue.
thanks, but i think this strategy is for low scale and less frequently changing docs. looking for something at a decent scale, and fast changing documents
Define “fast changing” and “decent scale”