Methodology
Data Sources
Warspy ingests from two free, public APIs:
- ›GDELT 2.1 DOC API — A global media monitoring system that indexes news articles from thousands of outlets worldwide. We query for conflict-related keywords and collect article titles and URLs. No full article text is fetched.
- ›ReliefWeb — UN OCHA's humanitarian information platform. Reports tagged with Security, Conflict and Violence, or Peacekeeping themes. Snippet = first 500 characters of body text only.
We store only titles, URLs, metadata, and brief snippets. No full article content is reproduced. All outbound links open the original publisher.
Deduplication & Clustering
Multiple sources often report the same event. We cluster overlapping reports deterministically using a weighted similarity score:
sim = 0.35 × textSim(title, clusterHeadline)
+ 0.25 × jaccardSim(keywords, clusterKeywords)
+ 0.25 × timeProximity(reportedAt, clusterUpdated)
+ 0.15 × geoProximity(distanceKm)
If similarity ≥ 0.70, the report is attached to the existing cluster. Otherwise a new cluster is created. No machine learning is used.
Scoring Formula
Each cluster is assigned a score (0–100):
score = 0.45 × severity + 0.35 × credibility + 0.20 × recency
- ›Severity (0–100): Based on keywords in titles/snippets. High-severity terms (mass casualty, nuclear, etc.) +20. Medium terms (killed, attack, etc.) +10. Casualty count patterns +10 each up to +20.
- ›Credibility (0–100): Number of distinct source domains (+10 each, max 40), bonus for known high-quality sources (Reuters, BBC, AP, etc.), +10 for ReliefWeb presence.
- ›Recency (0–100): Exponential decay: 100 × e^(−ageHours/10). An event 10 hours old scores ~37 on recency.
Confidence Labels
High≥3 distinct source domains, or ReliefWeb (UN-verified) is among the sources. Does not mean the event is verified — it means multiple independent outlets have reported it.
Med2 distinct source domains. Corroborated but limited.
LowSingle source. Treat with caution; may be preliminary or unverified.
Summaries
All summaries are extractive only — sentences are drawn directly from article titles and the first 500 characters of ReliefWeb body text. No language model or paraphrasing is applied. "According to [source]" prefixes identify which outlet provided each sentence.
Limitations
- ›Coverage is limited to outlets indexed by GDELT and to ReliefWeb reports. Local-language media may be underrepresented.
- ›Clustering is heuristic. Distinct events in the same country and timeframe may be incorrectly merged, or the same event may appear as multiple clusters.
- ›Severity scoring uses keyword matching and cannot assess context (e.g., historical vs. active operations).
- ›No editorial review. All data is automated. Do not use for life-safety decisions.
Legal Note
Warspy is an aggregator. We store only titles, URLs, and brief metadata. No full article content is cached or reproduced. All article links open the original publisher. GDELT data is published under an open license. ReliefWeb content is provided under Creative Commons.