I Built a RAG System That Listens to Live BBC News and Answers Questions About "What Happened 10 Minutes Ago"
The Problem Nobody Talks About in RAG
Every RAG tutorial shows you how to query static documents. Upload PDFs, chunk them, embed them, done. But what if your knowledge base is constantly changing? What if information flows in real-time and you need to ask "what happened in the last 30 minutes?"
Traditional RAG breaks down completely. You cannot ask temporal questions like "what was the breaking news at 9 AM?" or "summarize channel 0 from the past hour" because documents have no concept of time.
I spent a weekend fixing this.
What I Built
A live audio streaming RAG system that continuously captures BBC World Service radio (http://stream.live.vc.bbcmedia.co.uk/bbc_world_service), transcribes it in real-time, and lets you query across temporal windows with natural language.
Not just semantic search. Time-aware semantic search.
Ask it "what were the main topics in the last 10 minutes" and it filters documents by timestamp, retrieves relevant chunks, reranks them for accuracy, and generates an answer citing specific broadcast times. Every answer includes sources with precise UTC timestamps like "In the 14:23 UTC segment..."
The system runs 24/7 in the background, capturing 60-second audio chunks, transcribing them with NVIDIA Riva ASR, embedding them with NeMo Retriever, and indexing them with temporal metadata. Within seconds of broadcast, the content becomes queryable.
How It Actually Works
Think of it as three parallel processes that never stop:
The Listening Loop: Every minute, FFmpeg captures live audio from BBC World Service. The chunk gets saved with a UTC timestamp in its filename. No audio is lost. No gaps.
The Intelligence Layer: NVIDIA Riva transcribes each audio chunk into text. The transcript gets embedded using NeMo Retriever's 300M parameter model and stored in ChromaDB. But here is the key: every document carries metadata with Unix timestamps for when the audio started and ended.
The Query Engine: When you ask a question, the system first applies time filters. "Last 30 minutes" translates to a database filter on Unix timestamps. Then vector search happens only within that time window. NVIDIA's Llama 3.2 reranker scores the top candidates. Ministral 14B generates the final answer using only those time-filtered, reranked sources.
The result is a conversational interface where time is a first-class citizen. Not an afterthought.
Beyond News – Real Applications
While this demo indexes BBC World Service, the architecture was inspired by NVIDIA's Software-Defined Radio blueprint. The same pipeline works with any audio stream.
Defense and intelligence applications are obvious. Monitor multiple radio frequencies simultaneously, query across channels for specific keywords or topics within defined time windows, detect pattern anomalies in real-time communications. The temporal query capability means analysts can ask "what suspicious activity occurred between 2 AM and 4 AM on channel 7" and get immediate, sourced answers.
Emergency response scenarios benefit similarly. Index emergency radio channels, query for developing situations, track how information evolves minute by minute. Corporate compliance teams could monitor trading floor communications with temporal audit trails.
The system handles multi-channel ingestion, so you can monitor dozens of streams concurrently, each maintaining independent temporal indexes while sharing the same query interface.
The Technical Breakthrough
The hard part was not the transcription or the embeddings. Those are solved problems. The breakthrough was designing metadata that supports both semantic similarity and temporal filtering simultaneously.
Each transcript chunk stores: channel ID, Unix start timestamp, Unix end timestamp, duration, word count, and the full text. ChromaDB's metadata filtering combines with vector search, so you can ask semantically complex questions within precisely defined time windows.
The reranking layer is critical. Initial vector search retrieves 20 candidates. NVIDIA's cross-encoder reranker rescores them and selects the top 5 most relevant. This two-stage retrieval dramatically improves accuracy compared to embeddings alone.
Background processing was the other challenge. The system must capture and index new audio while simultaneously serving queries on existing data. Threading separates these concerns. One thread runs the continuous capture loop. The main thread handles web requests and query processing. They share access to the same vector database.
What You Can Build With This
This is not a research project. It is a blueprint for production systems.
Replace the BBC stream URL with any audio source: podcasts for automated show notes with timestamp citations, customer service calls for query-able support archives, corporate meeting recordings for searchable discussion histories, radio monitoring for media analysis and trend detection.
The entire stack runs on NVIDIA NIM microservices through their API. No custom infrastructure required. No GPU hosting costs. Deploy it in a Google Colab notebook and serve it via Cloudflare tunnel. The demo video shows it running end-to-end in minutes.
Try It Yourself
The demo video shows the system in action. Watch background capture continuously index new transcripts while queries run in parallel. See temporal filters in action. Notice how every answer cites precise broadcast times.
This is what RAG looks like when it understands time.
Built this over a weekend to prove temporal RAG is not just possible – it is practical.