Documentation
How ForecastRank Works
A deep dive into the models, metrics, and data pipeline behind objective weather forecast verification.
Ranking System
ForecastRank ranks models using a composite skill score — a single number that measures how much better (or worse) a model performs compared to a naive persistence forecast.
What is Composite Skill?
The composite skill score is computed by aggregating error metrics (RMSE, MAE) across all lead hours and stations within a time period, then comparing against a persistence baseline. A score above 0 means the model beats persistence; below 0 means it does not.
Score > 0. The model outperforms a naive persistence forecast.
Score > 0.5. Consistently strong improvement over persistence.
Score < 0. The model performs worse than simply repeating the last observation.
Persistence baseline: A persistence forecast assumes conditions at the current observation time will continue unchanged. It is the simplest possible forecast and sets the floor for skill — any useful model must beat it.
Variable Groups
Rankings are broken down across five meteorological variable groups. Each group aggregates the relevant observed variables into a single composite skill score for that dimension.
Temperature
Surface air temperature (2m). Driven by solar radiation, advection, and boundary-layer mixing.
Dewpoint
2m dewpoint — a direct measure of low-level moisture content. Critical for comfort indices and convective initiation.
Wind Speed
Surface wind speed. Errors here compound significantly for energy and aviation applications.
Precip Amount
Quantitative precipitation — how much rain or snow fell. One of the hardest variables to forecast accurately.
Precip Frequency
Categorical detection of precipitation events (wet/dry). Scored separately from amount to isolate timing errors.
Reading the Breakdown Table
The variable group breakdown table displays composite skill scores for each model × variable combination. Scores are color-coded by magnitude.
| Model | Temperature | Dewpoint | Wind Speed | Precip Amount | Precip Freq |
|---|---|---|---|---|---|
| ECMWF IFS | 0.721 | 0.312 | 0.198 | −0.042 | 0.001 |
| HRRR | 0.481 | 0.390 | 0.205 | — | −0.612 |
Weather Models
We verify the world's most advanced numerical weather prediction (NWP) models.
HRRR
High-Resolution Rapid Refresh
A NOAA model providing hourly updates, specialized in short-term mesoscale events like thunderstorms and severe weather.
NBM
National Blend of Models
A calibrated blend of GFS, HRRR, ECMWF and others designed to reduce systematic biases across CONUS, Hawaii, and Guam.
GFS
Global Forecast System
NCEP's primary global model. Covers dozens of atmospheric and land-soil variables from surface temperature and winds to ozone concentration.
ECMWF IFS
Integrated Forecasting System
The public-tier version of ECMWF's operational system — widely regarded as the gold standard for global medium-range forecasting. A higher-resolution paid tier exists at 9 km.
ECMWF AIFS
AIAI Forecast System
ECMWF's data-driven ML model using deep learning to predict atmospheric variables at competitive accuracy with a fraction of the compute cost of IFS.
GraphCast GFS
AINCEP ML Weather Prediction
An experimental NCEP system built on Google DeepMind's pre-trained GraphCast architecture for medium-range global forecasts.
Verification Metrics
Industry-standard metrics used to evaluate model performance.
RMSE (Root Mean Square Error)
sqrt(mean((Forecast - Observation)²))Measures the average magnitude of error, penalizing larger deviations more heavily than MAE. Lower is better.
MAE (Mean Absolute Error)
mean(abs(Forecast - Observation))The linear average of all absolute errors. It provides a straightforward measure of how much, on average, the forecast differs from the actual value.
Bias (Mean Error)
mean(Forecast - Observation)Indicates systematic over-forecasting (positive) or under-forecasting (negative). A bias of 0 is ideal.
Precipitation Categorical Scores
All six metrics are derived from a 2×2 contingency table built by classifying every forecast–observation pair as wet or dry using a threshold of 1.0 mm. Each cell of the table has a name used in the formulas below.
| Forecast Wet | Forecast Dry | |
|---|---|---|
| Observed Wet | Hit (a) | Miss (b) |
| Observed Dry | False Alarm (c) | Correct Rejection (d) |
Probability of Detection
a / (a + b)What fraction of observed wet events did the model correctly predict? Also called the hit rate. A POD of 1 means no wet event was missed.
False Alarm Ratio
c / (a + c)Of all the times the model forecast rain, what fraction was wrong? A FAR of 0 means every wet forecast verified.
Critical Success Index
a / (a + b + c)Also called the Threat Score. Combines hits, misses, and false alarms into one number. Does not credit correct dry-day rejections, so it is a stricter measure than accuracy.
Equitable Threat Score
(a − aᵣ) / (a + b + c − aᵣ)CSI corrected for hits expected by chance (aᵣ = (a+b)(a+c) / N). Scores near 0 indicate no skill above random; negative scores are below random.
Heidke Skill Score
2(ad − bc) / [(a+b)(b+d) + (a+c)(c+d)]Measures the fractional improvement of the forecast over a random forecast. Accounts for both wet and dry correct predictions. A score of 0 means no skill over random.
Frequency Bias
(a + c) / (a + b)Ratio of how often the model predicted rain to how often it actually rained. > 1 means the model over-forecasts precipitation (wet bias); < 1 means under-forecasting (dry bias).
Our Methodology
How we turn billions of data points into actionable rankings.
Data Ingestion
We ingest real-time METAR observations from thousands of stations. These are our "ground truth" reference points.
Verification Pairing
For every station we identify the surrounding model grid points and apply bilinear interpolation. Before comparison, we correct for systematic elevation differences between the station's true altitude and the model's terrain height using a standard lapse-rate adjustment. Pairs are then matched by valid time.
Global Station Network
ForecastRank collects real-time observations from about 4,000 weather stations globally. Observations are updated hourly from multiple public real-time sources, including NOAA. Model error statistics are updated daily as new observation and model forecast data are consolidated. Note that observations may contain errors and may be missing due to data outages.