Back
Question
Asked

AI to work with a lot of data (two usecases)

Hey everyone! This one is especially for all the AI enthusiasts!

So I’ve got two questions, which might be related:

1. I want a system where I can put in all my stuff (e.g. folder with PDF files, documents, etc.) as well as some connections with things like invoicing software, help centre, etc. and then I want an LLM where I can chat with this and get useful information back. As in, where can I find this file, or I can ask stuff like “how much revenue did we invoice for in 2024?” etc.


2. I have a large database of millions of rows of data (some shorter text, some numbers, some fulltext fields) and I want to have a system where I can have a chatbox and query something and it basically returns some of the rows or maybe does some calculations or analysis. This would basically be a MySQL database that’s currently 100’s of GB in size. How would I approach this in a time and cost efficient way?


With a lot of data, you can't stick everything in the prompt. You can make a selection of data first. It's very hip to use cosine similarity with vectors of your documents. It's great because it works with unstructured data. But if your data is structured or structure-able, you could just use SQL. That's nicer because it's less black box on what is going on and it's fast and cheap.

For example, take a text query and use an LLM with knowledge of your entity structure to write SQL, execute that, then use the results as context for your prompt. If you created a SQL db as an index of example (1), that would work. It would also work for example (2).

Hth!

Thanks so much! Very interesting. So that's sort of text-to-sql (or vice versa)?

Any resources where I can go down a bit of a rabbit hole for a few hours and learn more about it?

So my understanding would be that in case 2 I would send in the prompt itself something like

table: my_table
fields: id INT; name VARCHAR, email VARCHAR, country VARCHAR

And then if I e.g. ask how many people in my app are from the UK, I'd send the query + the table structure in the prompt and it would probably return something along the lines of SELECT count(id) FROM my_table WHERE country = UK, right?

So depending on the complexity of the database (and how often it changes), it still might be a bit tricky, especially if I have relational tables that connect multiple tables via IDs, etc. as this might need additional explanations in the prompt. If I'd use one of the big LLMs, such as OpenAI, what sort of model would work best with this?

for the first one, i think klu.so/ might work

you can connect your Gmail, Google Drive, Calendar, Dropbox, Notion, Slack, etc and ask it questions

Thanks, I'll check it out! Not a lot integrations yet, but still worth a check!