Documentation

How ForecastRank Works

A deep dive into the models, metrics, and data pipeline behind objective weather forecast verification.

Composite Skill Scores5 Variable Groups6 NWP Models4,000+ Stations

Ranking System

ForecastRank ranks models using a composite skill score — a single number that measures how much better (or worse) a model performs compared to a naive persistence forecast.

What is Composite Skill?

The composite skill score is computed by aggregating error metrics (RMSE, MAE) across all lead hours and stations within a time period, then comparing against a persistence baseline. A score above 0 means the model beats persistence; below 0 means it does not.

Adds Value

Score > 0. The model outperforms a naive persistence forecast.

+0.312

Strong Skill

Score > 0.5. Consistently strong improvement over persistence.

+0.721

Below Persistence

Score < 0. The model performs worse than simply repeating the last observation.

−0.184

Persistence baseline: A persistence forecast assumes conditions at the current observation time will continue unchanged. It is the simplest possible forecast and sets the floor for skill — any useful model must beat it.

Variable Groups

Rankings are broken down across five meteorological variable groups. Each group aggregates the relevant observed variables into a single composite skill score for that dimension.

temp

Temperature

Surface air temperature (2m). Driven by solar radiation, advection, and boundary-layer mixing.

dewpoint

Dewpoint

2m dewpoint — a direct measure of low-level moisture content. Critical for comfort indices and convective initiation.

wind_speed

Wind Speed

Surface wind speed. Errors here compound significantly for energy and aviation applications.

precip_amount

Precip Amount

Quantitative precipitation — how much rain or snow fell. One of the hardest variables to forecast accurately.

precip_freq

Precip Frequency

Categorical detection of precipitation events (wet/dry). Scored separately from amount to isolate timing errors.

Reading the Breakdown Table

The variable group breakdown table displays composite skill scores for each model × variable combination. Scores are color-coded by magnitude.

Example

Model	Temperature	Dewpoint	Wind Speed	Precip Amount	Precip Freq
ECMWF IFS	0.721	0.312	0.198	−0.042	0.001
HRRR	0.481	0.390	0.205	—	−0.612

Adds Value

Moderate Skill (>0)

Weak < Persistence

Strong < Persistence (<−0.5)

No Data

Weather Models

We verify the world's most advanced numerical weather prediction (NWP) models.

A NOAA model providing hourly updates, specialized in short-term mesoscale events like thunderstorms and severe weather.

CoverageCONUS and Alaska

Update CycleHourly

Spatial Res.3 km

Temporal Res.1-hourly

Horizon48h (00/06/12/18Z) · 18h (intermediate cycles)

A calibrated blend of GFS, HRRR, ECMWF and others designed to reduce systematic biases across CONUS, Hawaii, and Guam.

CoverageCONUS, Hawaii, and Guam

Update CycleHourly

Spatial Res.2.5 km

Temporal Res.1-hourly / 3-hourly

Horizon1–36h (1h) · 36–192h (3h) · 192–264h (6h)

NCEP's primary global model. Covers dozens of atmospheric and land-soil variables from surface temperature and winds to ozone concentration.

CoverageGlobal

Update Cycle4× / day — 00/06/12/18Z

Spatial Res.13 km

Temporal Res.1-hourly / 3-hourly

Horizon1–120h (1h) · 120–384h (3h)

The public-tier version of ECMWF's operational system — widely regarded as the gold standard for global medium-range forecasting. A higher-resolution paid tier exists at 9 km.

CoverageGlobal

Update Cycle2× / day — 00/12Z

Spatial Res.28 km (0.25°)

Temporal Res.3-hourly

HorizonUp to 15 days

ECMWF's data-driven ML model using deep learning to predict atmospheric variables at competitive accuracy with a fraction of the compute cost of IFS.

CoverageGlobal

Update Cycle2× / day — 00/12Z

Spatial Res.28 km (0.25°)

Temporal Res.6-hourly

HorizonUp to 15 days

An experimental NCEP system built on Google DeepMind's pre-trained GraphCast architecture for medium-range global forecasts.

CoverageGlobal

Update Cycle4× / day — 00/06/12/18Z

Spatial Res.28 km (0.25°)

Temporal Res.6-hourly

HorizonUp to 16 days

Verification Metrics

Industry-standard metrics used to evaluate model performance.

RMSE (Root Mean Square Error)

sqrt(mean((Forecast - Observation)²))

Measures the average magnitude of error, penalizing larger deviations more heavily than MAE. Lower is better.

MAE (Mean Absolute Error)

mean(abs(Forecast - Observation))

The linear average of all absolute errors. It provides a straightforward measure of how much, on average, the forecast differs from the actual value.

Bias (Mean Error)

mean(Forecast - Observation)

Indicates systematic over-forecasting (positive) or under-forecasting (negative). A bias of 0 is ideal.

Precipitation Categorical Scores

All six metrics are derived from a 2×2 contingency table built by classifying every forecast–observation pair as wet or dry using a threshold of 1.0 mm. Each cell of the table has a name used in the formulas below.

	Forecast Wet	Forecast Dry
Observed Wet	Hit (a)	Miss (b)
Observed Dry	False Alarm (c)	Correct Rejection (d)

POD

Probability of Detection

↑ higher

a / (a + b)

What fraction of observed wet events did the model correctly predict? Also called the hit rate. A POD of 1 means no wet event was missed.

Range: 0 – 1Perfect: 1

FAR

False Alarm Ratio

↓ lower

c / (a + c)

Of all the times the model forecast rain, what fraction was wrong? A FAR of 0 means every wet forecast verified.

Range: 0 – 1Perfect: 0

CSI

Critical Success Index

↑ higher

a / (a + b + c)

Also called the Threat Score. Combines hits, misses, and false alarms into one number. Does not credit correct dry-day rejections, so it is a stricter measure than accuracy.

Range: 0 – 1Perfect: 1

ETS

Equitable Threat Score

↑ higher

(a − aᵣ) / (a + b + c − aᵣ)

CSI corrected for hits expected by chance (aᵣ = (a+b)(a+c) / N). Scores near 0 indicate no skill above random; negative scores are below random.

Range: −⅓ – 1Perfect: 1

HSS

Heidke Skill Score

↑ higher

2(ad − bc) / [(a+b)(b+d) + (a+c)(c+d)]

Measures the fractional improvement of the forecast over a random forecast. Accounts for both wet and dry correct predictions. A score of 0 means no skill over random.

Range: −∞ – 1Perfect: 1

Freq Bias

Frequency Bias

= 1.0

(a + c) / (a + b)

Ratio of how often the model predicted rain to how often it actually rained. > 1 means the model over-forecasts precipitation (wet bias); < 1 means under-forecasting (dry bias).

Range: 0 – ∞Perfect: 1.0

Our Methodology

How we turn billions of data points into actionable rankings.

Data Ingestion

We ingest real-time METAR observations from thousands of stations. These are our "ground truth" reference points.

Verification Pairing

For every station we identify the surrounding model grid points and apply bilinear interpolation. Before comparison, we correct for systematic elevation differences between the station's true altitude and the model's terrain height using a standard lapse-rate adjustment. Pairs are then matched by valid time.

Global Station Network

ForecastRank collects real-time observations from about 4,000 weather stations globally. Observations are updated hourly from multiple public real-time sources, including NOAA. Model error statistics are updated daily as new observation and model forecast data are consolidated. Note that observations may contain errors and may be missing due to data outages.

4,000+ METAR StationsElevation-corrected Pairing1.0mm Precip ThresholdDaily Accuracy Updates

How ForecastRank Works

Ranking System

What is Composite Skill?

Variable Groups

Temperature

Dewpoint

Wind Speed

Precip Amount

Precip Frequency

Reading the Breakdown Table

Weather Models

HRRR

NBM

GFS

ECMWF IFS

ECMWF AIFS

GraphCast GFS

Verification Metrics

RMSE (Root Mean Square Error)

MAE (Mean Absolute Error)

Bias (Mean Error)

Precipitation Categorical Scores

Our Methodology

Data Ingestion

Verification Pairing

Global Station Network