Methodology
Last updated: N/A • Data source: World Bank (WDI & related)
ReproducibleGroundedLLM-safe
1) Data ingestion & scope
- API: World Bank v2 JSON endpoints (
/sources
,/indicators
,/country
,/country/all/indicator/{code}
). - Coverage: All indicators, countries & aggregates; years 1960–2025 where available.
- Storage: MongoDB (
worldbank_raw
), 1 row per indicator × country × year; key{indicator|country|year}
. - Versioning: Every dump stamped with
dump_as_of
(UTC); derived tables reference the same snapshot.
2) Data quality & governance
- Metadata: We keep id, name, source_id, unit, notes/definitions per indicator.
- Aggregates: World/regions/income groups are tagged and excluded from country-only rankings by default.
- Missing values: Shown as
N/A
. Short gaps (≤2 years) may be linearly interpolated and flagged asfilled
. - Outliers: Winsorized per indicator-year across countries (default P5–P95).
- Units: We never mix unit families (e.g., current USD vs constant USD vs PPP).
3) Transformations & features
- Per-capita:
x_pc = x / POP
; % of GDP:x_%GDP = 100 * x / GDP
. - Log transform (levels):
y = ln(x + ε)
. Logit (bounded %):z = ln(p/(1−p))
, withp = x/100
. - Growth: YoY%
100*(xt/xt−1 − 1)
; log-growthln(xt+ε) − ln(xt−1+ε)
; CAGR(x1/x0)^(1/n) − 1
. - Rolling: mean/std (e.g., 3y) and OLS slope on last 5y.
- Polarity: indicators marked higher- or lower-is-better (e.g., unemployment, CO₂ are lower-better).
4) Normalization & ranks
- Percentile (world): empirical percentile per year across countries (after winsorization), scaled 0–100.
- Robust z-score:
(x − median) / (1.4826 * MAD)
. - Polarity handling: for lower-better we invert percentile (
100 − s
). Raw values are never inverted. - Optional scopes: Region and Income-group percentiles for benchmarking.
5) Headline KPIs (final)
Code | Label | Unit | Polarity | YoY | Notes |
---|---|---|---|---|---|
NY.GDP.PCAP.KD | GDP per capita (constant USD) | USD/person | Higher ↑ | % | Real, chained USD |
FP.CPI.TOTL.ZG | Inflation, consumer prices | % (annual) | Lower ↓ | Δ pp | Rate; clip extremes in charts |
NE.EXP.GNFS.ZS | Exports of goods & services | % of GDP | Higher ↑ | Δ pp | Openness proxy |
SL.UEM.TOTL.ZS | Unemployment, total | % of labor force | Lower ↓ | Δ pp | Youth unemployment tracked in personas |
SL.TLF.CACT.ZS | Labor force participation | % of 15+ | Higher ↑ | Δ pp | |
SP.POP.TOTL | Population | persons | — | % | Level; show YoY % |
SP.DYN.LE00.IN | Life expectancy at birth | years | Higher ↑ | Δ years | |
SE.TER.ENRR | Tertiary enrollment (gross) | % | Higher ↑ | Δ pp | |
SE.ADT.LITR.ZS | Adult literacy (15+) | % | Higher ↑ | Δ pp | If missing, show latest only |
IT.NET.USER.ZS | Individuals using the Internet | % of population | Higher ↑ | Δ pp | |
IT.CEL.SETS.P2 | Mobile cellular subscriptions | per 100 people | Higher ↑ | Δ | Connectivity proxy |
EN.ATM.CO2E.PC | CO₂ emissions | t/person | Lower ↓ | % | Environmental pressure proxy |
6) Persona indices
Composite scores (0–100) built from per-indicator percentiles (polarity applied). Equal weights across pillars and within pillars. Score published if ≥60% indicators are present; otherwise flagged low_coverage
.
Job Seeker
Employment health
SL.UEM.TOTL.ZS
— Unemployment, total (Lower ↓)SL.UEM.1524.ZS
— Youth unemployment (optional) (Lower ↓)
Participation & skills
SL.TLF.CACT.ZS
— Labor force participation (Higher ↑)SE.TER.ENRR
— Tertiary enrollment (Higher ↑)
Momentum
NY.GDP.MKTP.KD.ZG
— Real GDP growth (Higher ↑)NE.EXP.GNFS.ZS
— Exports % GDP (Higher ↑)
Digital access
IT.NET.USER.ZS
— Internet users % (Higher ↑)IT.CEL.SETS.P2
— Mobile subs per 100 (Higher ↑)
Entrepreneur
Regulatory & legal
IC.LGL.CRED.XQ
— Strength of legal rights (Higher ↑)IC.BUS.NDNS.ZS
— New business density (Higher ↑)
Access to finance
FS.AST.PRVT.GD.ZS
— Credit to private sector % GDP (Higher ↑)FB.AST.NPER.ZS
— NPLs % of total (if present) (Lower ↓)
Infrastructure & power
EG.ELC.ACCS.ZS
— Access to electricity (Higher ↑)EG.ELC.RNEW.ZS
— Renewable electricity output (Higher ↑)
Innovation & high-tech trade
TX.VAL.TECH.MF.ZS
— High-tech exports share (Higher ↑)IP.JRN.ARTC.SC
— Sci/tech journal articles (Higher ↑ (log before percentile))
Digital Nomad
Connectivity
IT.NET.USER.ZS
— Internet users % (Higher ↑)IT.NET.BBND.P2
— Fixed broadband per 100 (if present) (Higher ↑)IT.CEL.SETS.P2
— Mobile subs per 100 (Higher ↑)
Affordability & stability
PA.NUS.PPPC.RF
— Price level ratio (Lower ↓)FP.CPI.TOTL.ZG
— Inflation % (Lower ↓)
Livability & safety
SP.DYN.LE00.IN
— Life expectancy (Higher ↑)EN.ATM.PM25.MC.M3
— PM2.5 exposure (Lower ↓)SH.STA.HOMIC.ZS
— Homicide rate (if present) (Lower ↓)
Expat Family
Health
SP.DYN.LE00.IN
— Life expectancy (Higher ↑)SH.XPD.CHEX.PC.CD
— Health expend. per capita (Higher ↑ (log before percentile))SH.IMM.MEAS.ZS
— Measles immunization (Higher ↑)
Education
SE.SEC.ENRR
— Secondary enrollment (Higher ↑)SE.TER.ENRR
— Tertiary enrollment (Higher ↑)SE.ADT.LITR.ZS
— Adult literacy (Higher ↑)
Safety & environment
SH.STA.HOMIC.ZS
— Homicide rate (Lower ↓)EN.ATM.PM25.MC.M3
— PM2.5 exposure (Lower ↓)EN.ATM.CO2E.PC
— CO₂ per capita (Lower ↓)
7) Country profiles & comparisons
- Profiles: latest value + year + unit + YoY + world percentile; 10–20y trends; benchmarks (World/Region/Income).
- Comparisons: side-by-side KPIs, percentile bars, trend overlays; ranks exclude aggregates by default.
8) Opportunity mapping
- Level percentile L ∈ [0,100] and Trend percentile T ∈ [0,100] (YoY or 5y CAGR).
- Score: geometric mean
O = √(L · T)
with a small volatility penalty; coverage ≥70%, latest ≤2y, inflation within bounds.
9) Forecasts (projections)
- Scope: smooth, well-covered annual series (GDP pc, inflation, unemployment, internet users, life expectancy, CO₂ pc).
- Transforms: log for levels; logit for bounded %.
- Models: RWD, ARIMA(0,1,1) with drift, or ETS; chosen via rolling-origin CV (sMAPE/MASE).
- Uncertainty: 80/95% intervals; forecasts shown as dashed and labeled “Projection” with as_of_year.
10) Alerts
- Triggers: threshold/percentile crossings, large YoY/log-growth, trend breaks, anomalies (robust z), forecast breaches.
- Noise control: hysteresis (2 consecutive observations), cooldown windows, recency & coverage checks.
- Audit: each alert stores indicator code, year, value, percentile, and the rule that fired.
11) LLM grounding
- Evidence bundle: compact JSON of numbers/years/units/sources from Mongo; the model only narrates from this evidence.
- Discipline: every numeric claim includes year + unit + source code (e.g.,
NY.GDP.PCAP.KD, 2023, WDI
); no on-the-fly calculations.
12) Reproducibility & versioning
- Each release references a single
dump_as_of
and pipeline commit. - Derived tables (
features
,country_views
,persona_scores
,opportunities
,forecasts
,alerts
) are rebuilt end-to-end. - Any page/report can be reproduced by (dump_as_of, country, year).
13) Limitations & ethics
- Some indicators have gaps or long lags; we surface the latest year explicitly.
- Method changes may affect comparability across time; versioning exposes changes.
- Aggregates (e.g., WLD/EUU) are benchmarks; rankings default to countries only.
- Forecasts are scenarios with uncertainty; they should complement expert judgement.
14) Contact
Questions or feedback? Email support@sufoniq.com.