I find you need to create your own benchmarking tool, to focus on what you want. I've done one to focus on RAG pipelines, with a framework for creating, managing, and running benchmarks with various question types and validation methods. basically a test+eval tool, and thats similarly applied to testing workflow. Can use an LLM (VSCode+Copilot, Cursor) to help you do this in python, with a bit of work.
I find you need to create your own benchmarking tool, to focus on what you want. I've done one to focus on RAG pipelines, with a framework for creating, managing, and running benchmarks with various question types and validation methods. basically a test+eval tool, and thats similarly applied to testing workflow. Can use an LLM (VSCode+Copilot, Cursor) to help you do this in python, with a bit of work.
Interesting. So, no framework really worked for you?
@Nomadsteve did you have one in mind, particularly? i ended up writing my own, might open source down the line
trigger.dev looks very promising.
I think a hybrid approach might be the fastest/easiest. Use n8n.io/ai with real code if you need it.
Honestly, I need to build out more agents to understand fully. I have a lot of gaps in my knowledge.
ah so this wasn't the "testing them" only part ;) i focused on tests only! do not use either of the solutions, fwiw.