We're building AI agents that take real actions — refunds, database writes, API calls.
Prompt instructions like "never do X" don't hold up. LLMs ignore them when context is long or users push hard.
Curious how others are handling this:
- Hard-coded checks before every action?
- Some middleware layer?
- Just hoping for the best?
We built a control layer for this — different methods for structured data, unstructured outputs, and guardrails (https://limits.dev). Genuinely want to learn how others approach it.
In my setup, agents propose actions and write structured reports. A deterministic quality advisory then runs — no LLM involved — producing a verdict (approve, hold, redispatch) based on pre-registered rules and open items. The agent can hallucinate all it wants inside its context window, but the only way its work reaches production is through a receipt that links output to a specific git commit, with a quality gate in between.
For anything with real consequences (database writes, API calls, refunds), the pattern is: LLM proposes → deterministic validator checks → human approves. The LLM never has direct write access to anything that matters.
"Just hoping for the best" works until it doesn't. We tracked every agent decision in an append-only ledger — after a few hundred entries, you start seeing exactly where and how agents fail. That pattern data is more useful than any prompt guard.
reply