Those building AI agents. What's the biggest pain when building and using them?

Question

I'm trying to understand the biggest pain points when building and using agents. For me, it's testing them. How the hell do you test the workflow?

Jeff Triplett ✨ · Accepted Answer

So far, I have tested by hand, but I have a few scripts where I turned the temperature down, and I seem to get the same results 100% of the time. This is for things like parsing salary details from a job posting, etc. 

I don't struggle with writing them or testing them, but I have yet to find a good way to share and run them. I have several I can run from my terminal for scheduling meetings, but I don't really want a dozen agents running their own chat servers. 

I am looking into PydanticAI because I use Pydantic for almost every one of my apps these days. I have also been using Anthropic's MCP framework which is amazing, but it only runs locally so far. I suspect one or both of those are going to make it easier for me to run Agents across dozens of apps, but I'm not there yet. 

I also haven't seen a great way to chain Agents together as some of the DAG frameworks (like Airflow) do with tasks. I could be overthinking it, but this feels like it should be tackled.

colin charles · Answer

ah so this wasn't the "testing them" only part ;) i focused on tests only! do not use either of the solutions, fwiw.

Steven Irby · Answer

trigger.dev looks very promising. 

I think a hybrid approach might be the fastest/easiest. Use https://n8n.io/ai with real code if you need it.

Honestly, I need to build out more agents to understand fully. I have a lot of gaps in my knowledge.

colin charles · Answer

@Nomadsteve did you have one in mind, particularly? i ended up writing my own, might open source down the line

Steven Irby · Answer

Interesting. So, no framework really worked for you?

Jeff Triplett ✨ · Answer

It's more of an orchestration and weaving them together than a deployment and hosting issue.

colin charles · Answer

I find you need to create your own benchmarking tool, to focus on what you want. I've done one to focus on RAG pipelines, with a framework for creating, managing, and running benchmarks with various question types and validation methods. basically a test+eval tool, and thats similarly applied to testing workflow. Can use an LLM (VSCode+Copilot, Cursor) to help you do this in python, with a bit of work.

Steven Irby · Answer

Interesting. So would you want some kind of push to deploy and run them kind of tool?

Go to Homepage	`g` `h`
Go to Done Todos	`g` `d`
Compose a New Todo	`n`
Go to Search	`/`
Show this dialog	`?`

Those building AI agents. What's the biggest pain when building and using them?

Keyboard Shortcuts