
GeoThreadBot is an end-to-end AI news automation system that monitors geopolitical events across the internet and publishes verified stories to Twitter/X. The pipeline runs 7 stages in sequence, and I refactored it from a single 953-line prototype into a modular Python package with 25 modules across 8 packages and 19 unit tests.
Stage 1 is concurrent data collection. Three fetchers run in separate threads: the Twitter fetcher monitors accounts and hashtags via the Twitter API v2, the YouTube fetcher extracts video transcripts using the YouTube Data API v3, and the RSS fetcher monitors news feeds via feedparser. All three run simultaneously and feed into a shared processing queue.
Stage 2 handles language detection and translation. The system supports 12+ languages and automatically detects the source language using langdetect. Non-English content gets translated via deep-translator before processing.
Stage 3 is NLP filtering. SpaCy extracts named entities and keywords from each article. VADER scores the emotional tone. A relevance scoring algorithm combines keyword frequency, entity matching, and domain relevance to filter out low-quality content before the expensive transformer inference runs.
Stage 4 is AI summarisation. The BART model (facebook/bart-large-cnn) generates concise summaries. The model is lazy-loaded on first use to avoid blocking startup. NLTK handles sentence tokenisation for key point extraction.
Stage 5 is fact verification. Each story gets cross-referenced against credible sources (BBC, Reuters, AP News, Al Jazeera, NY Times, The Hindu). Confidence scoring starts at 0 and increments by 0.4 for each credible source that corroborates the story. Stories need a confidence score of 0.6 or higher to pass verification.
Stage 6 is human moderation. Verified stories enter a review queue accessible through a Flask API. I built this checkpoint because fully autonomous publishing of geopolitical news is risky. Only approved stories proceed.
Stage 7 is distribution. Approved stories get formatted into Twitter threads with the 280-character limit, domain hashtags, and source citations. The tweepy client handles posting via the Twitter API.
The SQLite database tracks all claims, processed content, and pipeline metrics. The whole system runs on a configurable cron schedule.
Key Features
- 7-stage automated pipeline with concurrent data fetching
- 3 data sources running in parallel: Twitter, YouTube, RSS
- 12+ language detection with automatic translation
- SpaCy + VADER NLP filtering with relevance scoring
- BART transformer summarisation with lazy loading
- Cross-source fact verification with confidence scoring
- Human moderation queue via Flask API
- Refactored from 953-line monolith to 25 modules across 8 packages
- 19 unit tests with API mocking via pytest
Tech Stack
Next project
HELEC