This document explains (step by step) how raw external datasets are ingested, processed, and merged into the weekly tables (europe.csv
, austria.csv
) consumed by the modelling code and the Flask dashboard.
# generate region centroids and GeoJSON
python scripts/build_geojson.py
# pull every external feed, run all transforms,
# and write europe.csv / austria.csv under data/processed/
python scripts/build_csv.py
Note Only CORDEX NetCDF files must be downloaded manually (see section 2).
data/
├─ era5_land/ # monthly ERA5-Land .nc (21 GB for 2000-2025)
├─ rcp45/ # CORDEX RCP4.5 .nc (downloaded via wget)
├─ rcp85/ # CORDEX RCP8.5 .nc (downloaded via wget)
├─ europe.csv # weekly feature table (country-level)
└─ austria.csv # weekly feature table (NUTS-3 level)
All scripts write inside data/
; nothing is stored outside the repository.
Source | Script(s) | Raw data | Coverage | Notes |
---|---|---|---|---|
ERA5-Land | scripts/era5.py |
hourly NetCDF (t2m , tp ) |
2000 - present | 0.25° grid; ~21 GB so far |
CORDEX | data/rcp45/wget.sh , data/rcp85/wget.sh , ccee/cordex.py |
monthly NetCDF (tas ) |
1971-2100 | EUR-11 domain (0.25°) |
Eurostat API | ccee/eurostat.py (called by build_csv.py ) |
JSON - CSV | varies | Population, density, weekly deaths |
EEA API | ccee/eea.py (called by build_csv.py ) |
hourly gridded CSV | 2013 - present | O3, NOx, PM10 |
Source: Copernicus Climate Data Store
How to authorize the execution of the Python code on Windows? Follow https://cds.climate.copernicus.eu/how-to-api (only once)
The file starting with a dot can be created using Notepad: File > Save as > Type: All files > File name: .cdsfapirc
Once you have completed the steps above, the ERA5 data can be downloaded using the functions inside ccee/era5.py
. This script downloads one NetCDF per month into data/era5_land/
. It is recommended that you execute this script before the first run of build_csv.py
to ensure that all required data is available.
ccee/era5.py
(triggered by build_csv.py
):
Column | Definition | Units |
---|---|---|
temp_era5_q05 |
5-th percentile of hourly temperature within the week | °C |
temp_era5_q50 |
Median weekly temperature | °C |
temp_era5_q95 |
95-th percentile | °C |
Source: ESGF Data Browser (LiU Node)
Before running scripts/build_csv.py
you will need to have the CORDEX data downloaded. This is done via a WGET script that you can generate from the ESGF Data Browser.
Official tutorial link: https://cordex.org/wp-content/uploads/2023/08/How-to-download-CORDEX-data-from-the-ESGF.pdf
Step-by-step:
Select a dataset, e.g.:
cordex.output.EUR-11.SMHI.MPI-M-MPI-ESM-LR.rcp85.r2i1p1.RCA4.v1.mon.tas
bash ./data/wget-YYYYMMDDHHMMSS.sh -H
Tip: You will need a Linux-based system (e.g., Ubuntu) to execute WGET scripts.
ccee/cordex.py
(triggered by build_csv.py
):
Column | Definition | Units |
---|---|---|
temp_rcp45 |
Median weekly temperature for RCP 4.5 scenario | °C |
temp_rcp85 |
Median weekly temperature for RCP 8.5 scenario | °C |
Source: Eurostat
The original Eurostat data is pulled via the API. The raw data is as follows:
Variable | Eurostat ID | Units | Time step | Region level | Coverage |
---|---|---|---|---|---|
population_density |
demo_r_d3dens |
people / km2 | yearly | NUTS-3 | 2000 - present |
population |
tps00001 (country) + demo_r_pjanaggr3 (NUTS-3) |
people | yearly | NUTS-3 / country | 2014 - present |
mortality |
demo_r_mwk3_t |
deaths | weekly | NUTS-3 / country | 2000 - present |
Missing population (pre-2014) is imputed as population_density
$\times$ area_km2
. The area of each region is obtained from the polygons inside regions.geojson
file, computed using the geopandas
library.
mortality_rate
is then mortality / population
$\times$ 100,000 (deaths per 100,000 people).
The output of ccee/eurostat.py
(triggered by build_csv.py
) adds the following columns to the weekly tables:
Column | Definition | Units |
---|---|---|
population |
Total population in region | people |
population_density |
People per square kilometer | people / km2 |
mortality |
Total deaths in region per week | deaths |
mortality_rate |
Deaths per 100,000 people per week | deaths / 100,000 people |
Source: European Air Quality Portal
Hourly gridded fields are averaged over each region and then over each week. The spatial resolution of this data is variable, as the EEA provides data per station. Each station is associated with a region, and we average the values of all stations within a region.
ccee/eea.py
(triggered by build_csv.py
) adds the following columns to the weekly tables:
Column | Definition | Units |
---|---|---|
O3 |
Ozone concentration in the air | $\mu \text{g}\ m^{-3}$ |
NOx |
Nitrogen oxides concentration in the air | $\mu \text{g}\ m^{-3}$ |
pm10 |
Particulate matter concentration in the air | $\mu \text{g}\ m^{-3}$ |
scripts/build_csv.py
)flowchart LR
subgraph "Data sources"
ERA5[ERA5 .nc]
CORDEX[CORDEX .nc]
Eurostat[Eurostat API]
EEA[EEA API]
end
BuildCSV["build_csv.py"]
Europe[europe.csv]
Austria[austria.csv]
ERA5 --> BuildCSV
CORDEX --> BuildCSV
Eurostat --> BuildCSV
EEA --> BuildCSV
BuildCSV --> Europe
BuildCSV --> Austria
regions.geojson
.ccee/*
module to obtain a weekly DataFrame.NUTS_ID
, year
, week
).File | Level | Rows (2025) | Size |
---|---|---|---|
data/processed/csv/europe.csv |
country | ~350 000 | 10 MB |
data/processed/csv/austria.csv |
NUTS-3 | ~200 000 | 7 MB |
Group | Columns |
---|---|
Keys | NUTS_ID , year , week |
ERA5 quantiles | temp_era5_q05 , temp_era5_q50 , temp_era5_q95 |
CORDEX | temp_rcp45 , temp_rcp85 |
Population / mortality | population , population_density , mortality , mortality_rate |
Air-quality | O3 , NOx , pm10 |
Task | Frequency |
---|---|
Download new ERA5 month | monthly (cron) |
Refresh Eurostat & EEA API pulls | yearly |
Re-run build_csv.py |
monthly |