Documentation

How ForecastRank Works

A deep dive into the models, metrics, and data pipeline behind objective weather forecast verification.

Composite Skill Scores5 Variable Groups6 NWP Models4,000+ Stations

Ranking System

ForecastRank ranks models using a composite skill score — a single number that measures how much better (or worse) a model performs compared to a naive persistence forecast.

What is Composite Skill?

The composite skill score is computed by aggregating error metrics (RMSE, MAE) across all lead hours and stations within a time period, then comparing against a persistence baseline. A score above 0 means the model beats persistence; below 0 means it does not.

Adds Value

Score > 0. The model outperforms a naive persistence forecast.

+0.312
Strong Skill

Score > 0.5. Consistently strong improvement over persistence.

+0.721
Below Persistence

Score < 0. The model performs worse than simply repeating the last observation.

−0.184

Persistence baseline: A persistence forecast assumes conditions at the current observation time will continue unchanged. It is the simplest possible forecast and sets the floor for skill — any useful model must beat it.

Variable Groups

Rankings are broken down across five meteorological variable groups. Each group aggregates the relevant observed variables into a single composite skill score for that dimension.

temp

Temperature

Surface air temperature (2m). Driven by solar radiation, advection, and boundary-layer mixing.

dewpoint

Dewpoint

2m dewpoint — a direct measure of low-level moisture content. Critical for comfort indices and convective initiation.

wind_speed

Wind Speed

Surface wind speed. Errors here compound significantly for energy and aviation applications.

precip_amount

Precip Amount

Quantitative precipitation — how much rain or snow fell. One of the hardest variables to forecast accurately.

precip_freq

Precip Frequency

Categorical detection of precipitation events (wet/dry). Scored separately from amount to isolate timing errors.

Reading the Breakdown Table

The variable group breakdown table displays composite skill scores for each model × variable combination. Scores are color-coded by magnitude.

Example
ModelTemperatureDewpointWind SpeedPrecip AmountPrecip Freq
ECMWF IFS0.7210.3120.198−0.0420.001
HRRR0.4810.3900.205−0.612
Adds Value
Moderate Skill (>0)
Weak < Persistence
Strong < Persistence (<−0.5)
No Data

Weather Models

We verify the world's most advanced numerical weather prediction (NWP) models.

HRRR

High-Resolution Rapid Refresh

A NOAA model providing hourly updates, specialized in short-term mesoscale events like thunderstorms and severe weather.

CoverageCONUS and Alaska
Update CycleHourly
Spatial Res.3 km
Temporal Res.1-hourly
Horizon48h (00/06/12/18Z) · 18h (intermediate cycles)

NBM

National Blend of Models

A calibrated blend of GFS, HRRR, ECMWF and others designed to reduce systematic biases across CONUS, Hawaii, and Guam.

CoverageCONUS, Hawaii, and Guam
Update CycleHourly
Spatial Res.2.5 km
Temporal Res.1-hourly / 3-hourly
Horizon1–36h (1h) · 36–192h (3h) · 192–264h (6h)

GFS

Global Forecast System

NCEP's primary global model. Covers dozens of atmospheric and land-soil variables from surface temperature and winds to ozone concentration.

CoverageGlobal
Update Cycle4× / day — 00/06/12/18Z
Spatial Res.13 km
Temporal Res.1-hourly / 3-hourly
Horizon1–120h (1h) · 120–384h (3h)

ECMWF IFS

Integrated Forecasting System

The public-tier version of ECMWF's operational system — widely regarded as the gold standard for global medium-range forecasting. A higher-resolution paid tier exists at 9 km.

CoverageGlobal
Update Cycle2× / day — 00/12Z
Spatial Res.28 km (0.25°)
Temporal Res.3-hourly
HorizonUp to 15 days

ECMWF AIFS

AI

AI Forecast System

ECMWF's data-driven ML model using deep learning to predict atmospheric variables at competitive accuracy with a fraction of the compute cost of IFS.

CoverageGlobal
Update Cycle2× / day — 00/12Z
Spatial Res.28 km (0.25°)
Temporal Res.6-hourly
HorizonUp to 15 days

GraphCast GFS

AI

NCEP ML Weather Prediction

An experimental NCEP system built on Google DeepMind's pre-trained GraphCast architecture for medium-range global forecasts.

CoverageGlobal
Update Cycle4× / day — 00/06/12/18Z
Spatial Res.28 km (0.25°)
Temporal Res.6-hourly
HorizonUp to 16 days

Verification Metrics

Industry-standard metrics used to evaluate model performance.

RMSE (Root Mean Square Error)

sqrt(mean((Forecast - Observation)²))

Measures the average magnitude of error, penalizing larger deviations more heavily than MAE. Lower is better.

MAE (Mean Absolute Error)

mean(abs(Forecast - Observation))

The linear average of all absolute errors. It provides a straightforward measure of how much, on average, the forecast differs from the actual value.

Bias (Mean Error)

mean(Forecast - Observation)

Indicates systematic over-forecasting (positive) or under-forecasting (negative). A bias of 0 is ideal.

Precipitation Categorical Scores

All six metrics are derived from a 2×2 contingency table built by classifying every forecast–observation pair as wet or dry using a threshold of 1.0 mm. Each cell of the table has a name used in the formulas below.

Forecast WetForecast Dry
Observed WetHit (a)Miss (b)
Observed DryFalse Alarm (c)Correct Rejection (d)
POD

Probability of Detection

↑ higher
a / (a + b)

What fraction of observed wet events did the model correctly predict? Also called the hit rate. A POD of 1 means no wet event was missed.

Range: 0 – 1Perfect: 1
FAR

False Alarm Ratio

↓ lower
c / (a + c)

Of all the times the model forecast rain, what fraction was wrong? A FAR of 0 means every wet forecast verified.

Range: 0 – 1Perfect: 0
CSI

Critical Success Index

↑ higher
a / (a + b + c)

Also called the Threat Score. Combines hits, misses, and false alarms into one number. Does not credit correct dry-day rejections, so it is a stricter measure than accuracy.

Range: 0 – 1Perfect: 1
ETS

Equitable Threat Score

↑ higher
(a − aᵣ) / (a + b + c − aᵣ)

CSI corrected for hits expected by chance (aᵣ = (a+b)(a+c) / N). Scores near 0 indicate no skill above random; negative scores are below random.

Range: −⅓ – 1Perfect: 1
HSS

Heidke Skill Score

↑ higher
2(ad − bc) / [(a+b)(b+d) + (a+c)(c+d)]

Measures the fractional improvement of the forecast over a random forecast. Accounts for both wet and dry correct predictions. A score of 0 means no skill over random.

Range: −∞ – 1Perfect: 1
Freq Bias

Frequency Bias

= 1.0
(a + c) / (a + b)

Ratio of how often the model predicted rain to how often it actually rained. > 1 means the model over-forecasts precipitation (wet bias); < 1 means under-forecasting (dry bias).

Range: 0 – ∞Perfect: 1.0

Our Methodology

How we turn billions of data points into actionable rankings.

Data Ingestion

We ingest real-time METAR observations from thousands of stations. These are our "ground truth" reference points.

Verification Pairing

For every station we identify the surrounding model grid points and apply bilinear interpolation. Before comparison, we correct for systematic elevation differences between the station's true altitude and the model's terrain height using a standard lapse-rate adjustment. Pairs are then matched by valid time.

Global Station Network

ForecastRank collects real-time observations from about 4,000 weather stations globally. Observations are updated hourly from multiple public real-time sources, including NOAA. Model error statistics are updated daily as new observation and model forecast data are consolidated. Note that observations may contain errors and may be missing due to data outages.

4,000+ METAR StationsElevation-corrected Pairing1.0mm Precip ThresholdDaily Accuracy Updates