Back
Question
Asked

AI to work with a lot of data (two usecases)

Hey everyone! This one is especially for all the AI enthusiasts!

So I’ve got two questions, which might be related:

1. I want a system where I can put in all my stuff (e.g. folder with PDF files, documents, etc.) as well as some connections with things like invoicing software, help centre, etc. and then I want an LLM where I can chat with this and get useful information back. As in, where can I find this file, or I can ask stuff like “how much revenue did we invoice for in 2024?” etc.


2. I have a large database of millions of rows of data (some shorter text, some numbers, some fulltext fields) and I want to have a system where I can have a chatbox and query something and it basically returns some of the rows or maybe does some calculations or analysis. This would basically be a MySQL database that’s currently 100’s of GB in size. How would I approach this in a time and cost efficient way?


With a lot of data, you can't stick everything in the prompt. You can make a selection of data first. It's very hip to use cosine similarity with vectors of your documents. It's great because it works with unstructured data. But if your data is structured or structure-able, you could just use SQL. That's nicer because it's less black box on what is going on and it's fast and cheap.

For example, take a text query and use an LLM with knowledge of your entity structure to write SQL, execute that, then use the results as context for your prompt. If you created a SQL db as an index of example (1), that would work. It would also work for example (2).

Hth!

Thanks so much! Very interesting. So that's sort of text-to-sql (or vice versa)?

Any resources where I can go down a bit of a rabbit hole for a few hours and learn more about it?

So my understanding would be that in case 2 I would send in the prompt itself something like

table: my_table
fields: id INT; name VARCHAR, email VARCHAR, country VARCHAR

And then if I e.g. ask how many people in my app are from the UK, I'd send the query + the table structure in the prompt and it would probably return something along the lines of SELECT count(id) FROM my_table WHERE country = UK, right?

So depending on the complexity of the database (and how often it changes), it still might be a bit tricky, especially if I have relational tables that connect multiple tables via IDs, etc. as this might need additional explanations in the prompt. If I'd use one of the big LLMs, such as OpenAI, what sort of model would work best with this?

Yes, it's text-to-sql. I would just dump your entity relationships into the prompt and ask for the SQL for some example questions. Then, I'd take those and correct the SQL so that it works -- either asking the LLM to correct bugs you see or just doing it by hand. Then I'd put those input-output examples into the prompt and try it on new things and rinse & repeat.

o1-preview will do best, but once you have a good prompt that works well, I would try that out on 4-o and smaller models

Awesome! This really helps. I started going down a rabbit hole of text-to-sql too :)

Cool! Happy to help. I'd love to hear how you get on.

for the first one, i think klu.so/ might work

you can connect your Gmail, Google Drive, Calendar, Dropbox, Notion, Slack, etc and ask it questions

Thanks, I'll check it out! Not a lot integrations yet, but still worth a check!

For 1, you might find that using Google Gemini might be quite life changing. It can pull from your google drive, your gmail, etc. the only caveat is the "links" via api to external stuff, like invoicing software. helpdesk, depending on what you use. if you notice, zendesk has a zapier zap that integrates with google gemini

  1. is what you're trying to do vector search in mysql? or is it some text-to-sql? if it is the latter, there are plenty of libraries for you to poke at. if it is vectors - i suggest playing around with mariadb

Thanks! I keep seeing the Google Gemini ads in my Gmail, but didn't know it also integrates with Gdrive.

Regarding (2) I'm not yet sure. I am playing around with text-to-sql at the moment. It might be sufficient. But I'll also research vector search!