Back
Post
Posted

Those building AI agents. What's the biggest pain when building and using them?

I'm trying to understand the biggest pain points when building and using agents. For me, it's testing them. How the hell do you test the workflow?


I find you need to create your own benchmarking tool, to focus on what you want. I've done one to focus on RAG pipelines, with a framework for creating, managing, and running benchmarks with various question types and validation methods. basically a test+eval tool, and thats similarly applied to testing workflow. Can use an LLM (VSCode+Copilot, Cursor) to help you do this in python, with a bit of work.

Interesting. So, no framework really worked for you?

@Nomadsteve did you have one in mind, particularly? i ended up writing my own, might open source down the line

trigger.dev looks very promising.

I think a hybrid approach might be the fastest/easiest. Use n8n.io/ai with real code if you need it.

Honestly, I need to build out more agents to understand fully. I have a lot of gaps in my knowledge.

ah so this wasn't the "testing them" only part ;) i focused on tests only! do not use either of the solutions, fwiw.

So far, I have tested by hand, but I have a few scripts where I turned the temperature down, and I seem to get the same results 100% of the time. This is for things like parsing salary details from a job posting, etc.

I don't struggle with writing them or testing them, but I have yet to find a good way to share and run them. I have several I can run from my terminal for scheduling meetings, but I don't really want a dozen agents running their own chat servers.

I am looking into PydanticAI because I use Pydantic for almost every one of my apps these days. I have also been using Anthropic's MCP framework which is amazing, but it only runs locally so far. I suspect one or both of those are going to make it easier for me to run Agents across dozens of apps, but I'm not there yet.

I also haven't seen a great way to chain Agents together as some of the DAG frameworks (like Airflow) do with tasks. I could be overthinking it, but this feels like it should be tackled.

Interesting. So would you want some kind of push to deploy and run them kind of tool?

It's more of an orchestration and weaving them together than a deployment and hosting issue.