/ work/news-chatbot
CASE STUDYAI · agents · cloudlive

Cloud-native news chatbot

Real-time news summarization for one of Italy's major digital outlets. Multi-agent architecture with persistent memory, streaming responses, and semantic retrieval over a live corpus.

2M+
monthly active users
<180ms
avg response latency
99.9%
uptime SLA
live
real-time news corpus
clientmajor IT news outlet
rolelead engineer
year2025 — present
statusin production
01The challenge

why this problem was worth solving

News consumption is fragmented. Readers want context fast — what happened, why it matters, what to read next — but the editorial product is still organized as a stream of articles. The client needed a conversational layer on top of their content that could summarize, compare, and recommend across their entire live corpus, without producing hallucinations or stale answers.

The hard constraints were three: sub-200ms latency at scale (chat is unusable past that threshold), strict factual grounding in the client's editorial archive (no LLM guesses), and serverless cost predictability at unknown traffic shapes — peaks driven by breaking news, not by user growth curves.

// design constraint

"A news chatbot that hallucinates is a liability. The architecture had to make hallucination structurally hard, not just statistically rare."

02Approach & architecture

technical design and system components

The system runs as a multi-agent orchestration on AWS Lambda. A coordinator agent receives the user query, dispatches to specialized retrieval and generation agents, and streams the synthesized response back. Retrieval happens against a continuously indexed corpus of editorial content — updated in near-real-time as the newsroom publishes.

Every answer is grounded in retrieved snippets with explicit citations. The generation agent runs with a strict retrieval-or-refuse policy: if no relevant context is found, the bot says so rather than fabricating.

High-level architecture// simplified
USER QUERY
web / mobile
COORDINATOR
multi-agent routing
STREAMED REPLY
with citations
RETRIEVALsemantic search · vector store
GENERATIONLLM grounded · refuse if empty
MEMORYsession context · per-user
03Implementation decisions

the trade-offs that mattered

// 01
Serverless over container-based hosting

Traffic shape was unpredictable — flat during the night, vertical spikes on breaking news. AWS Lambda + provisioned concurrency gave us cost-predictability and sub-cold-start latency without paying for idle capacity.

// 02
Semantic chunking, not paragraph chunking

Articles were split by editorial unit (lede, body sections, pull quotes), not by fixed token windows. Retrieval quality improved measurably because the LLM received self-contained context units instead of fragments.

// 03
Streaming responses by default

Perceived latency drops dramatically when the first token lands fast. The orchestrator streams the LLM output through FastAPI's SSE channel — users see the answer forming, not a spinner.

// 04
Hard refusal policy on empty retrieval

When no relevant article was found, the bot refuses to answer rather than fall back to general world knowledge. This single rule eliminated the entire class of "confidently wrong about Italian news" failure modes.

AI / Orchestration
LlamaIndex
LangChain
OpenAI / Gemini
Backend
Python
FastAPI
SSE streaming
Infra
AWS Lambda
Redis · PostgreSQL
GitHub Actions
04Results

measured in production

2M+
monthly active users
flagship system within the client's product suite
<180ms
first-token latency
measured p95, end-to-end including retrieval
99.9%
uptime SLA maintained
no rollback events since go-live
manual QA pass
hallucination QA
on editorial QA sample over 6 months
05What I'd do differently

honest retrospective

The semantic chunking pass was implemented after the first quality complaints from the editorial team. In hindsight, investing in retrieval quality early would have saved roughly six weeks of prompt-engineering bandaids trying to compensate for noisy context. The mantra "retrieval beats prompting" feels obvious now, but only after the fact.

The second lesson is on cost observability: serverless cost is non-trivial to forecast when LLM provider tokens dominate the bill. I'd build the cost dashboard before the launch, not after.

Building something similar?

Let's talk architecture before code.

If you're scoping a production AI system — retrieval, agents, or cloud-native infra — a 30-min technical chat usually saves weeks downstream.