An agent that sounds right and an agent that is right are indistinguishable at the moment they answer. The confident tone is the same, the formatting is the same, the ticker is real either way. The only thing that separates them is whether you can trace the answer back to something that exists. For market claims, SEC filings are that something: public, timestamped, and free to read. This post is a practical loop for using them to check what your agent says.

Why finance answers need an audit trail

In most domains, a wrong AI answer costs you a correction. In finance it can cost you a decision made on a number that was never real. That asymmetry is why we argue for source-linked data in why AI agents keep inventing market numbers, and it is why evaluation matters even after you fix the data layer. Source-linked data raises the floor. Evaluation is how you find out where the floor actually is. None of this is a recommendation to buy or sell; it is a recommendation to check.

Rule one: no citation, no credit

The simplest evaluation policy is also the strictest: if the agent makes a factual market claim without pointing at a source, score it wrong, even if the claim happens to be true. A true answer the agent cannot support is luck, and luck does not generalize. This rule also changes agent behavior upstream. Once your prompts and tools make citing easy, “cite the source you used” stops being a burden and starts being the default path. We closed our MCP how-to with exactly that advice; this post is the expanded version of the loop.

Resolving a claim to EDGAR by hand

The manual loop is unglamorous and works:

  • Take one concrete claim from the answer, for example “an officer of company X bought shares this week.”
  • Open EDGAR full-text search and find the company’s recent filings, or scan the latest filings feed if the claim is about something recent.
  • Find the specific filing the claim should rest on and read the relevant lines.
  • Compare: the entity, the date, the direction, the rough size.

If the agent’s answer included a source link, this takes a minute. If you had to hunt for the filing yourself, that is a finding too: the answer was not auditable as delivered.

Automating the check: compare tool output to the answer

Once the manual loop works, automate the comparison, but be careful what you compare. The useful check is not “does this answer seem right,” judged by another model with no data. It is: the agent called sec_insider_trades or get_signal, the tool returned structured rows, and the final answer’s claims should match those rows. Entity matches, direction matches, date matches. That check is mechanical, cheap, and does not require trusting a second model’s vibes. Anything the answer asserts that appears in no tool output gets flagged.

What to log

You cannot evaluate what you did not record. For each agent run, keep four things: the question asked, the tools called with their parameters, the sources each tool returned, and the final claims in the answer. With those four, every failure becomes diagnosable: you can see whether the agent asked the wrong question, called the wrong tool, received bad data, or received good data and misstated it. Without them, every failure looks the same, which is to say, invisible.

What checking actually catches

In practice this loop surfaces a few recurring failure modes. Misread data: the tool returned a sale and the answer said buy. Wrong entity: a similarly named company or a person with a common name. Stale grounding: the answer rests on an older filing when a newer one exists. And unsupported precision: a specific figure that appears in no returned source. None of these are exotic, and all of them survive a good data layer, because a model can still misread good data. QuantConomy’s side of the loop is making the trail exist: signals carry their reasoning and source trail, and insider rows link back to the EDGAR filings they came from, so the comparison step has something real to compare against.

Where this leaves you

Checking does not scale to every answer, and it does not have to. Check a sample, log everything, and tighten wherever the trail breaks. An answer you can trace is worth more than an answer that merely sounds certain, and the only way to know which one your agent gives you is to follow the trail yourself.