/ work/news-chatbot

CASE STUDYAI · agents · cloudlive

Cloud-native news chatbot

Real-time news summarization and conversational discovery for one of Italy's major digital outlets. Multi-agent architecture with persistent memory, streaming responses, and semantic retrieval over a live editorial corpus.

8M+

monthly readers reached

<3s

end-to-end response time

99.9%

uptime SLA

live

real-time news corpus

clientmajor IT news outlet

rolelead engineer

year2025 — present

statusin production

01— The challenge

why this problem was worth solving

News consumption is fragmented. Readers want context fast — what happened, why it matters, what to read next — but the editorial product is still organized as a stream of articles. The client needed a conversational layer on top of their content that could summarize, compare, and recommend across their entire live corpus, plus surface trending topics from outside their archive — without producing hallucinations or stale answers.

The hard constraints were three: low latency at scale (conversation is unusable past a few seconds), strict factual grounding in the client's editorial archive (no LLM guesses), and predictable infrastructure cost at unknown traffic shapes — peaks driven by breaking news, not by user growth curves.

// design constraint

"A news chatbot that hallucinates is a liability. The architecture had to make hallucination structurally hard, not just statistically rare."

02— Approach & architecture

technical design and system components

The system runs as a multi-agent orchestration on Google Cloud. A coordinator agent receives the user query, dispatches to specialized retrieval, generation, and trending-detection agents, and streams the synthesized response back. Retrieval happens against a continuously indexed corpus of editorial content — updated in near-real-time as the newsroom publishes. A separate path detects trending topics from public sources and cross-references them with the archive, so the bot can suggest relevant reads beyond the question asked.

Every answer is grounded in retrieved snippets with explicit citations. The generation agent runs with a strict retrieval-or-refuse policy: if no relevant context is found, the bot says so rather than fabricating.

High-level architecture// simplified

USER QUERY

web / mobile

→

COORDINATOR

multi-agent routing

→

STREAMED REPLY

with citations

RETRIEVALsemantic search · Firestore vectors

GENERATIONGemini grounded · refuse if empty

TRENDINGexternal signals · cross-referenced

03— Implementation decisions

the trade-offs that mattered

// 01

GCP-native over multi-cloud

Cloud Run + Firestore + Gemini share the same provider, the same auth, the same observability surface. The simplicity savings — one IAM model, one billing surface, one tracing stack — outweighed any portability concern for a system this opinionated about its LLM.

// 02

Semantic chunking, not paragraph chunking

Articles were split by editorial unit (lede, body sections, pull quotes), not by fixed token windows. Retrieval quality improved measurably because the LLM received self-contained context units instead of fragments.

// 03

Streaming responses by default

Perceived latency drops dramatically when the first token lands fast. The orchestrator streams the LLM output through FastAPI's SSE channel — users see the answer forming, not a spinner.

// 04

Hard refusal policy on empty retrieval

When no relevant article is found, the bot refuses to answer rather than fall back to general world knowledge. This single rule eliminated the entire class of "confidently wrong about Italian news" failure modes.

AI / Orchestration

LlamaIndex

Gemini

Vertex AI

Backend

Python

FastAPI

SSE streaming

Frontend

Next.js

TypeScript

Tailwind

Infra

GCP Cloud Run

Firestore

GitHub Actions

04— Results

measured in production

8M+

monthly readers reached

deployed across the publisher's flagship product

<3s

end-to-end response time

including retrieval, generation, streaming

99.9%

uptime SLA maintained

no rollback events since go-live

live corpus

real-time indexing

newsroom publishes → bot answers in minutes

05— What I'd do differently

honest retrospective

The semantic chunking pass was implemented after the first quality complaints from the editorial team. In hindsight, investing in retrieval quality early would have saved roughly six weeks of prompt-engineering bandaids trying to compensate for noisy context. The mantra "retrieval beats prompting" feels obvious now, but only after the fact.

The second lesson is on cost observability: LLM token spend is non-trivial to forecast when traffic is bursty and questions vary wildly in retrieval depth. I'd build the cost-per-conversation dashboard before launch, not after.

Building something similar?

Let's talk architecture before code.

If you're scoping a production AI system — retrieval, agents, or cloud-native infra — a 30-min technical chat usually saves weeks downstream.