Cloud-native news chatbot
Real-time news summarization for one of Italy's major digital outlets. Multi-agent architecture with persistent memory, streaming responses, and semantic retrieval over a live corpus.
why this problem was worth solving
News consumption is fragmented. Readers want context fast — what happened, why it matters, what to read next — but the editorial product is still organized as a stream of articles. The client needed a conversational layer on top of their content that could summarize, compare, and recommend across their entire live corpus, without producing hallucinations or stale answers.
The hard constraints were three: sub-200ms latency at scale (chat is unusable past that threshold), strict factual grounding in the client's editorial archive (no LLM guesses), and serverless cost predictability at unknown traffic shapes — peaks driven by breaking news, not by user growth curves.
"A news chatbot that hallucinates is a liability. The architecture had to make hallucination structurally hard, not just statistically rare."
technical design and system components
The system runs as a multi-agent orchestration on AWS Lambda. A coordinator agent receives the user query, dispatches to specialized retrieval and generation agents, and streams the synthesized response back. Retrieval happens against a continuously indexed corpus of editorial content — updated in near-real-time as the newsroom publishes.
Every answer is grounded in retrieved snippets with explicit citations. The generation agent runs with a strict retrieval-or-refuse policy: if no relevant context is found, the bot says so rather than fabricating.
the trade-offs that mattered
Traffic shape was unpredictable — flat during the night, vertical spikes on breaking news. AWS Lambda + provisioned concurrency gave us cost-predictability and sub-cold-start latency without paying for idle capacity.
Articles were split by editorial unit (lede, body sections, pull quotes), not by fixed token windows. Retrieval quality improved measurably because the LLM received self-contained context units instead of fragments.
Perceived latency drops dramatically when the first token lands fast. The orchestrator streams the LLM output through FastAPI's SSE channel — users see the answer forming, not a spinner.
When no relevant article was found, the bot refuses to answer rather than fall back to general world knowledge. This single rule eliminated the entire class of "confidently wrong about Italian news" failure modes.
measured in production
honest retrospective
The semantic chunking pass was implemented after the first quality complaints from the editorial team. In hindsight, investing in retrieval quality early would have saved roughly six weeks of prompt-engineering bandaids trying to compensate for noisy context. The mantra "retrieval beats prompting" feels obvious now, but only after the fact.
The second lesson is on cost observability: serverless cost is non-trivial to forecast when LLM provider tokens dominate the bill. I'd build the cost dashboard before the launch, not after.
Let's talk architecture before code.
If you're scoping a production AI system — retrieval, agents, or cloud-native infra — a 30-min technical chat usually saves weeks downstream.