Methodology

Last updated: N/A • Data source: World Bank (WDI & related)

ReproducibleGroundedLLM-safe

1) Data ingestion & scope

API: World Bank v2 JSON endpoints (/sources, /indicators, /country, /country/all/indicator/{code}).
Coverage: All indicators, countries & aggregates; years 1960–2025 where available.
Storage: MongoDB (worldbank_raw), 1 row per indicator × country × year; key {indicator|country|year}.
Versioning: Every dump stamped with dump_as_of (UTC); derived tables reference the same snapshot.

2) Data quality & governance

Metadata: We keep id, name, source_id, unit, notes/definitions per indicator.
Aggregates: World/regions/income groups are tagged and excluded from country-only rankings by default.
Missing values: Shown as N/A. Short gaps (≤2 years) may be linearly interpolated and flagged as filled.
Outliers: Winsorized per indicator-year across countries (default P5–P95).
Units: We never mix unit families (e.g., current USD vs constant USD vs PPP).

3) Transformations & features

Per-capita: x_pc = x / POP; % of GDP: x_%GDP = 100 * x / GDP.
Log transform (levels): y = ln(x + ε). Logit (bounded %): z = ln(p/(1−p)), with p = x/100.
Growth: YoY% 100*(x_t/x_t−1 − 1); log-growth ln(x_t+ε) − ln(x_t−1+ε); CAGR (x1/x0)^(1/n) − 1.
Rolling: mean/std (e.g., 3y) and OLS slope on last 5y.
Polarity: indicators marked higher- or lower-is-better (e.g., unemployment, CO₂ are lower-better).

4) Normalization & ranks

Percentile (world): empirical percentile per year across countries (after winsorization), scaled 0–100.
Robust z-score: (x − median) / (1.4826 * MAD).
Polarity handling: for lower-better we invert percentile (100 − s). Raw values are never inverted.
Optional scopes: Region and Income-group percentiles for benchmarking.

5) Headline KPIs (final)

Code	Label	Unit	Polarity	YoY	Notes
NY.GDP.PCAP.KD	GDP per capita (constant USD)	USD/person	Higher ↑	%	Real, chained USD
FP.CPI.TOTL.ZG	Inflation, consumer prices	% (annual)	Lower ↓	Δ pp	Rate; clip extremes in charts
NE.EXP.GNFS.ZS	Exports of goods & services	% of GDP	Higher ↑	Δ pp	Openness proxy
SL.UEM.TOTL.ZS	Unemployment, total	% of labor force	Lower ↓	Δ pp	Youth unemployment tracked in personas
SL.TLF.CACT.ZS	Labor force participation	% of 15+	Higher ↑	Δ pp
SP.POP.TOTL	Population	persons	—	%	Level; show YoY %
SP.DYN.LE00.IN	Life expectancy at birth	years	Higher ↑	Δ years
SE.TER.ENRR	Tertiary enrollment (gross)	%	Higher ↑	Δ pp
SE.ADT.LITR.ZS	Adult literacy (15+)	%	Higher ↑	Δ pp	If missing, show latest only
IT.NET.USER.ZS	Individuals using the Internet	% of population	Higher ↑	Δ pp
IT.CEL.SETS.P2	Mobile cellular subscriptions	per 100 people	Higher ↑	Δ	Connectivity proxy
EN.ATM.CO2E.PC	CO₂ emissions	t/person	Lower ↓	%	Environmental pressure proxy

6) Persona indices

Composite scores (0–100) built from per-indicator percentiles (polarity applied). Equal weights across pillars and within pillars. Score published if ≥60% indicators are present; otherwise flagged low_coverage.

Job Seeker

Employment health

SL.UEM.TOTL.ZS — Unemployment, total (Lower ↓)
SL.UEM.1524.ZS — Youth unemployment (optional) (Lower ↓)

Participation & skills

SL.TLF.CACT.ZS — Labor force participation (Higher ↑)
SE.TER.ENRR — Tertiary enrollment (Higher ↑)

Momentum

NY.GDP.MKTP.KD.ZG — Real GDP growth (Higher ↑)
NE.EXP.GNFS.ZS — Exports % GDP (Higher ↑)

Digital access

IT.NET.USER.ZS — Internet users % (Higher ↑)
IT.CEL.SETS.P2 — Mobile subs per 100 (Higher ↑)

Entrepreneur

Regulatory & legal

IC.LGL.CRED.XQ — Strength of legal rights (Higher ↑)
IC.BUS.NDNS.ZS — New business density (Higher ↑)

Access to finance

FS.AST.PRVT.GD.ZS — Credit to private sector % GDP (Higher ↑)
FB.AST.NPER.ZS — NPLs % of total (if present) (Lower ↓)

Infrastructure & power

EG.ELC.ACCS.ZS — Access to electricity (Higher ↑)
EG.ELC.RNEW.ZS — Renewable electricity output (Higher ↑)

Innovation & high-tech trade

TX.VAL.TECH.MF.ZS — High-tech exports share (Higher ↑)
IP.JRN.ARTC.SC — Sci/tech journal articles (Higher ↑ (log before percentile))

Digital Nomad

Connectivity

IT.NET.USER.ZS — Internet users % (Higher ↑)
IT.NET.BBND.P2 — Fixed broadband per 100 (if present) (Higher ↑)
IT.CEL.SETS.P2 — Mobile subs per 100 (Higher ↑)

Affordability & stability

PA.NUS.PPPC.RF — Price level ratio (Lower ↓)
FP.CPI.TOTL.ZG — Inflation % (Lower ↓)

Livability & safety

SP.DYN.LE00.IN — Life expectancy (Higher ↑)
EN.ATM.PM25.MC.M3 — PM2.5 exposure (Lower ↓)
SH.STA.HOMIC.ZS — Homicide rate (if present) (Lower ↓)

Expat Family

Health

SP.DYN.LE00.IN — Life expectancy (Higher ↑)
SH.XPD.CHEX.PC.CD — Health expend. per capita (Higher ↑ (log before percentile))
SH.IMM.MEAS.ZS — Measles immunization (Higher ↑)

Education

SE.SEC.ENRR — Secondary enrollment (Higher ↑)
SE.TER.ENRR — Tertiary enrollment (Higher ↑)
SE.ADT.LITR.ZS — Adult literacy (Higher ↑)

Safety & environment

SH.STA.HOMIC.ZS — Homicide rate (Lower ↓)
EN.ATM.PM25.MC.M3 — PM2.5 exposure (Lower ↓)
EN.ATM.CO2E.PC — CO₂ per capita (Lower ↓)

7) Country profiles & comparisons

Profiles: latest value + year + unit + YoY + world percentile; 10–20y trends; benchmarks (World/Region/Income).
Comparisons: side-by-side KPIs, percentile bars, trend overlays; ranks exclude aggregates by default.

8) Opportunity mapping

Level percentile L ∈ [0,100] and Trend percentile T ∈ [0,100] (YoY or 5y CAGR).
Score: geometric mean O = √(L · T) with a small volatility penalty; coverage ≥70%, latest ≤2y, inflation within bounds.

9) Forecasts (projections)

Scope: smooth, well-covered annual series (GDP pc, inflation, unemployment, internet users, life expectancy, CO₂ pc).
Transforms: log for levels; logit for bounded %.
Models: RWD, ARIMA(0,1,1) with drift, or ETS; chosen via rolling-origin CV (sMAPE/MASE).
Uncertainty: 80/95% intervals; forecasts shown as dashed and labeled “Projection” with as_of_year.

10) Alerts

Triggers: threshold/percentile crossings, large YoY/log-growth, trend breaks, anomalies (robust z), forecast breaches.
Noise control: hysteresis (2 consecutive observations), cooldown windows, recency & coverage checks.
Audit: each alert stores indicator code, year, value, percentile, and the rule that fired.

11) LLM grounding

Evidence bundle: compact JSON of numbers/years/units/sources from Mongo; the model only narrates from this evidence.
Discipline: every numeric claim includes year + unit + source code (e.g., NY.GDP.PCAP.KD, 2023, WDI); no on-the-fly calculations.

12) Reproducibility & versioning

Each release references a single dump_as_of and pipeline commit.
Derived tables (features, country_views, persona_scores, opportunities, forecasts, alerts) are rebuilt end-to-end.
Any page/report can be reproduced by (dump_as_of, country, year).

13) Limitations & ethics

Some indicators have gaps or long lags; we surface the latest year explicitly.
Method changes may affect comparability across time; versioning exposes changes.
Aggregates (e.g., WLD/EUU) are benchmarks; rankings default to countries only.
Forecasts are scenarios with uncertainty; they should complement expert judgement.

14) Contact

Questions or feedback? Email support@sufoniq.com.