12,400 news sources is not the product. Nobody wants to read 12,400 sources, and nobody should. The product is what survives the ranking: the short list of scored, sourced events that reach you after everything else has been absorbed, grouped, and set aside. This post is about what that volume forces you to build. We have written a conceptual walkthrough of the pipeline before; this is the scale side of the same story.
The scale problem, stated plainly
The numbers are simple to state and awkward to live with: 12,400 news sources across 2,600 publishers, 29.4 million articles processed so far, 363,600 assets monitored, with entries processed in about six seconds. At that rate, “read everything” stops being a virtue and becomes a liability. Every article you ingest is an article someone could be shown. The engineering question is never how to read more. It is how to make sure almost none of it reaches a person.
Publishers are not interchangeable
One early consequence of scale: you cannot treat sources as an undifferentiated pile. A wire report, a regional outlet, and a press-release aggregator can carry the same sentence and mean very different things. That is why every entry carries its publisher, along with its language, sentiment, and the entities and assets it touches. Publisher identity is not metadata for its own sake — it is what lets you follow the outlets you trust, and it travels with the entry all the way to the output, so you always know who said what.
The same story, forty times
The second consequence is duplication. A material event does not arrive once; it arrives as a wire item, then as rewrites, syndicated copies, and follow-ups across dozens of outlets. Left alone, one event becomes forty rows, and the feed rewards whoever repeats things loudest. So related coverage is grouped into clusters and topics: the copies attach to the story they belong to instead of piling up as separate items. Forty articles collapse toward one event, and the volume of coverage becomes information about the event rather than noise on top of it.
Where volume becomes judgment
Enrichment and grouping still leave you with far too much. The compression step is scoring. Each event gets a type, a direction from very bullish to very bearish, a strength from 0 to 100, and a short plain-English reason — the structure described in what a market signal is. Scoring is where the pipeline stops describing the world and starts making a judgment about what is unusual enough to surface. Most things score low. That is the point. A ranked feed only works if the ranking is allowed to bury things.
Why the source link survives
Every signal keeps a link back to the article or filing it came from, and the publisher stays attached. This is deliberate. A compressed feed asks you to trust a lot of machinery — enrichment, clustering, scoring — and the only honest way to earn that trust is to make every output checkable in one click. If a signal looks wrong, you open the source and decide for yourself. A feed that cannot be audited is just an opinion with a timestamp.
What this looks like from the API side
The same layers are exposed programmatically. The financial news API serves the enriched entries — sentiment, language, publisher, entities, clusters — and the signal endpoints serve what survived the scoring, each row carrying its type, direction, strength, reason, and source. search_entries reads the raw-but-enriched layer; list_signals reads the compressed one. Which layer you want depends on whether you are building your own judgment or borrowing ours. More on how the whole system fits together is on the about page.
None of this is a recommendation to buy or sell anything; a signal is a structured fact with a source, and what you do with it is your call. But the design principle behind the whole pipeline is worth stating once, plainly: the number that matters is not how much we read. It is how little we make you read.