GeoThreadBot

7-stage AI pipeline that fetches, analyses, and publishes geopolitics news

GeoThreadBot is an end-to-end AI news automation system that monitors geopolitical events across the internet and publishes verified stories to Twitter/X. The pipeline runs 7 stages in sequence, and I refactored it from a single 953-line prototype into a modular Python package with 25 modules across 8 packages and 19 unit tests.

Stage 1 is concurrent data collection. Three fetchers run in separate threads: the Twitter fetcher monitors accounts and hashtags via the Twitter API v2, the YouTube fetcher extracts video transcripts using the YouTube Data API v3, and the RSS fetcher monitors news feeds via feedparser. All three run simultaneously and feed into a shared processing queue.

Stage 2 handles language detection and translation. The system supports 12+ languages and automatically detects the source language using langdetect. Non-English content gets translated via deep-translator before processing.

Stage 3 is NLP filtering. SpaCy extracts named entities and keywords from each article. VADER scores the emotional tone. A relevance scoring algorithm combines keyword frequency, entity matching, and domain relevance to filter out low-quality content before the expensive transformer inference runs.

Stage 4 is AI summarisation. The BART model (facebook/bart-large-cnn) generates concise summaries. The model is lazy-loaded on first use to avoid blocking startup. NLTK handles sentence tokenisation for key point extraction.

Stage 5 is fact verification. Each story gets cross-referenced against credible sources (BBC, Reuters, AP News, Al Jazeera, NY Times, The Hindu). Confidence scoring starts at 0 and increments by 0.4 for each credible source that corroborates the story. Stories need a confidence score of 0.6 or higher to pass verification.

Stage 6 is human moderation. Verified stories enter a review queue accessible through a Flask API. I built this checkpoint because fully autonomous publishing of geopolitical news is risky. Only approved stories proceed.

Stage 7 is distribution. Approved stories get formatted into Twitter threads with the 280-character limit, domain hashtags, and source citations. The tweepy client handles posting via the Twitter API.

The SQLite database tracks all claims, processed content, and pipeline metrics. The whole system runs on a configurable cron schedule.

Key Features

7-stage automated pipeline with concurrent data fetching
3 data sources running in parallel: Twitter, YouTube, RSS
12+ language detection with automatic translation
SpaCy + VADER NLP filtering with relevance scoring
BART transformer summarisation with lazy loading
Cross-source fact verification with confidence scoring
Human moderation queue via Flask API
Refactored from 953-line monolith to 25 modules across 8 packages
19 unit tests with API mocking via pytest

Tech Stack

PythonFlaskTensorFlowHugging Face BARTSpaCyVADERTwitter API v2YouTube Data APISQLiteNLTKdeep-translatorfeedparser

Next project

HELEC