How to work with Climate Adaptation Digital Twin data on Earth Data Hub. High resolution fields on a single level or surface: climatological analysis of temperature in Germany¶
This notebook will provide you guidance on how to access and use the https://cacheb.dcms.destine.eu/d1-climate-dt/ScenarioMIP-SSP3-7.0-IFS-NEMO-0001-high-sfc-v0.zarr
dataset. Access to this dataset is restricted to authorized user only via the the Data Cache Management service.
To access the data you need to instruct your tools (e.g. Xarray, Zarr...) to use the Data Cache Management service. This can be done by running the code snippets below.
First, make sure you have an account on the Destination Earth platform. Then run the following cell, filling in your Destination Earth credentials and password when asked:
%%capture cap
%run ../cacheb/cacheb-authentication.py
from pathlib import Path
with open(Path.home() / ".netrc", "a") as fp:
fp.write(cap.stdout)
⚠ NOTE: the generated password is valid for a limited period of time and needs to be regenerated and reconfigured periodically by running the cells above.
What you will learn:¶
- how to access the dataset
- select and reduce the data
- plot the results
In this notebook we set two goals:
Our first goal is to plot the mean 2 metre temperature in January 2025 over Central Europe (Germany).
Our second goal is to compute the 2 metre temperature climatology (monthly means and standard deviations) in Berlin for the 2020-2028 reference period.
Working with EDH data¶
Datasets on EDH are typically very large and remotely hosted. Typical use imply a selection of the data followed by one or more reduction steps to be performed in a local or distributed Dask environment.
The structure of a workflow that uses EDH data tipically looks like this:
- data access
- data selection
- (optional) data reduction
- data download
- further operations and visualization
Xarray and Dask work together following a lazy principle. This means that when you access and manipulate a Zarr store the data is in not immediately downloaded and loaded in memory. Instead, Dask constructs a task graph that represents the operations to be performed.
A smart user will first reduce the amount of data that needs to be downloaded and explicitly call compute()
on it. Once the compute()
operation is complete the data is loaded into memory and available for subsequent fast processing.
1. Data access¶
To access the data, only the dataset metadata must be downloaded. Xarray does this automatically when you access a Zarr dataset:
import xarray as xr
ds = xr.open_dataset(
"https://cacheb.dcms.destine.eu/d1-climate-dt/ScenarioMIP-SSP3-7.0-IFS-NEMO-0001-high-sfc-v0.zarr",
chunks={},
engine="zarr",
storage_options={"client_kwargs": {"trust_env": True}},
)
ds
<xarray.Dataset> Size: 188TB Dimensions: (time: 175320, latitude: 4096, longitude: 8193) Coordinates: * latitude (latitude) float64 33kB -90.0 -89.96 -89.91 ... 89.91 89.96 90.0 * longitude (longitude) float64 66kB -180.0 -180.0 -179.9 ... 180.0 180.0 step timedelta64[ns] 8B ... surface float64 8B ... * time (time) datetime64[ns] 1MB 2020-01-01 ... 2039-12-31T23:00:00 Data variables: d2m (time, latitude, longitude) float32 24TB dask.array<chunksize=(48, 512, 512), meta=np.ndarray> sd (time, latitude, longitude) float32 24TB dask.array<chunksize=(48, 512, 512), meta=np.ndarray> ssr (time, latitude, longitude) float32 24TB dask.array<chunksize=(48, 512, 512), meta=np.ndarray> str (time, latitude, longitude) float32 24TB dask.array<chunksize=(48, 512, 512), meta=np.ndarray> t2m (time, latitude, longitude) float32 24TB dask.array<chunksize=(48, 512, 512), meta=np.ndarray> tprate (time, latitude, longitude) float32 24TB dask.array<chunksize=(48, 512, 512), meta=np.ndarray> u10 (time, latitude, longitude) float32 24TB dask.array<chunksize=(48, 512, 512), meta=np.ndarray> v10 (time, latitude, longitude) float32 24TB dask.array<chunksize=(48, 512, 512), meta=np.ndarray> Attributes: Conventions: CF-1.7 GRIB_centre: ecmf GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts GRIB_edition: 2 GRIB_subCentre: 1003 history: 2024-06-06T16:50 GRIB to CDM+CF via cfgrib-0.9.1... institution: European Centre for Medium-Range Weather Forecasts
⚠ At this point, no data has been downloaded yet, nor loaded in memory.
2. Data selection¶
First, we perform a geographical selection corresponding to the Germany area. This greatly reduces the amount of data that will be downloaded from the DCMS. Also, we convert the temperature to °C
. For the moment, this a lazy operation.
xr.set_options(keep_attrs=True)
t2m = ds.t2m.astype("float32") - 273.15
t2m.attrs["units"] = "°C"
t2m_germany = t2m.sel(**{"latitude": slice(47, 55), "longitude": slice(5, 16)})
t2m_germany
<xarray.DataArray 't2m' (time: 175320, latitude: 182, longitude: 251)> Size: 32GB dask.array<getitem, shape=(175320, 182, 251), dtype=float32, chunksize=(48, 182, 251), chunktype=numpy.ndarray> Coordinates: * latitude (latitude) float64 1kB 47.01 47.05 47.1 ... 54.88 54.92 54.97 * longitude (longitude) float64 2kB 5.01 5.054 5.098 ... 15.91 15.95 16.0 step timedelta64[ns] 8B ... surface float64 8B ... * time (time) datetime64[ns] 1MB 2020-01-01 ... 2039-12-31T23:00:00 Attributes: (12/19) GRIB_NV: 0 GRIB_cfName: air_temperature GRIB_cfVarName: t2m GRIB_dataType: fc GRIB_gridDefinitionDescription: 150 GRIB_gridType: healpix ... ... GRIB_typeOfLevel: heightAboveGround GRIB_units: K last_restart_dim_updated: 175320 long_name: 2 metre temperature standard_name: air_temperature units: °C
Second, we further select January 2025. This is again a lazy operation:
t2m_germany_january_2025 = t2m_germany.sel(time="2025-01")
t2m_germany_january_2025
<xarray.DataArray 't2m' (time: 744, latitude: 182, longitude: 251)> Size: 136MB dask.array<getitem, shape=(744, 182, 251), dtype=float32, chunksize=(48, 182, 251), chunktype=numpy.ndarray> Coordinates: * latitude (latitude) float64 1kB 47.01 47.05 47.1 ... 54.88 54.92 54.97 * longitude (longitude) float64 2kB 5.01 5.054 5.098 ... 15.91 15.95 16.0 step timedelta64[ns] 8B ... surface float64 8B ... * time (time) datetime64[ns] 6kB 2025-01-01 ... 2025-01-31T23:00:00 Attributes: (12/19) GRIB_NV: 0 GRIB_cfName: air_temperature GRIB_cfVarName: t2m GRIB_dataType: fc GRIB_gridDefinitionDescription: 150 GRIB_gridType: healpix ... ... GRIB_typeOfLevel: heightAboveGround GRIB_units: K last_restart_dim_updated: 175320 long_name: 2 metre temperature standard_name: air_temperature units: °C
Due to the chunked structure of the DataArray, xarray must download every chunk that includes a portion of the selected data.
To estimate the size of the download, we can use the costing.py
module. This must be done before we apply any reduction operation.
import costing
costing.estimate_download_size(t2m, t2m_germany_january_2025)
estimated_needed_chunks: 16 estimated_memory_size: 0.805 GB estimated_download_size: 0.081 GB
3. Data reduction¶
Before dowloading the data, we can perform further lazy operations. Here we compute the 2 metre temperature montly average:
t2m_germany_january_2025_monthly_mean = t2m_germany_january_2025.mean(dim="time")
t2m_germany_january_2025_monthly_mean
<xarray.DataArray 't2m' (latitude: 182, longitude: 251)> Size: 183kB dask.array<mean_agg-aggregate, shape=(182, 251), dtype=float32, chunksize=(182, 251), chunktype=numpy.ndarray> Coordinates: * latitude (latitude) float64 1kB 47.01 47.05 47.1 ... 54.88 54.92 54.97 * longitude (longitude) float64 2kB 5.01 5.054 5.098 ... 15.91 15.95 16.0 step timedelta64[ns] 8B ... surface float64 8B ... Attributes: (12/19) GRIB_NV: 0 GRIB_cfName: air_temperature GRIB_cfVarName: t2m GRIB_dataType: fc GRIB_gridDefinitionDescription: 150 GRIB_gridType: healpix ... ... GRIB_typeOfLevel: heightAboveGround GRIB_units: K last_restart_dim_updated: 175320 long_name: 2 metre temperature standard_name: air_temperature units: °C
3. Data download¶
This is the phase where we explicitly trigger the download of the data. To do so we will call compute()
on the previously averaged temperature. The result will be small enought to easily fit into memory. Remember to assign the return of the compute()
function to a new variable, so that the data is kept in memory.
We can measure the time it takes:
%%time
t2m_germany_january_2025_monthly_mean_computed = t2m_germany_january_2025_monthly_mean.compute()
CPU times: user 3.55 s, sys: 1.92 s, total: 5.48 s Wall time: 3.63 s
The data was very small, this didn't take long!
5. Visualization¶
Finally, we can plot the monthly mean of January 2025 on a map:
import display
import matplotlib.pyplot as plt
from cartopy import crs
display.map(
t2m_germany_january_2025_monthly_mean_computed,
projection=crs.Miller(),
vmax=None,
cmap="YlOrRd",
title="Mean Surface Temperature, Jan 2025"
);
2 metre temperature climatology in Berlin, 2020-2035¶
The power of EDH is better showned when working with timeseries. We will now show how fast it is to compute the 2 metre temperature climatology (montly mean and standard deviation) in Berlin, over the reference period 2020-2035.
With legacy data distributon systems you would need to download the entire world temperature for the reference time period, extract the Berlin data and do your calculations. Alternatively the data provider might do a crop and merge operation for you, but chances are that this is very slow (internally the data provider still accesses the entire world temperature, you just don't see it!). Thanks to Earth Data Hub this is not needed anymore. You only need to download the relevant chunks.
Here, we select the closest data to Berlin:
t2m_Berlin_2020_2035 = t2m.sel(**{"latitude": 52.5, "longitude": 13.4}, method="nearest").sel(time=slice("2020", "2035"))
t2m_Berlin_2020_2035
<xarray.DataArray 't2m' (time: 140256)> Size: 561kB dask.array<getitem, shape=(140256,), dtype=float32, chunksize=(48,), chunktype=numpy.ndarray> Coordinates: latitude float64 8B 52.51 longitude float64 8B 13.4 step timedelta64[ns] 8B ... surface float64 8B ... * time (time) datetime64[ns] 1MB 2020-01-01 ... 2035-12-31T23:00:00 Attributes: (12/19) GRIB_NV: 0 GRIB_cfName: air_temperature GRIB_cfVarName: t2m GRIB_dataType: fc GRIB_gridDefinitionDescription: 150 GRIB_gridType: healpix ... ... GRIB_typeOfLevel: heightAboveGround GRIB_units: K last_restart_dim_updated: 175320 long_name: 2 metre temperature standard_name: air_temperature units: °C
This is already small enought to be computed:
We estimate the cost of the download with de costing.py
module.
costing.estimate_download_size(t2m, t2m_Berlin_2020_2035)
estimated_needed_chunks: 2922 estimated_memory_size: 147.069 GB estimated_download_size: 14.707 GB
We explicitly trigger the download of the data. Remember to assign the return of the compute()
function to a new variable, so that the data is kept in memory.
%%time
t2m_Berlin_2020_2035_computed = t2m_Berlin_2020_2035.compute()
CPU times: user 6min 39s, sys: 4min 12s, total: 10min 52s Wall time: 4min 14s
Now that the data is loaded in memory we can easily compute the climatology:
t2m_Berlin_climatology_mean = t2m_Berlin_2020_2035_computed.groupby("time.month").mean(dim="time")
t2m_Berlin_climatology_std = t2m_Berlin_2020_2035_computed.groupby("time.month").std(dim="time")
We can finally plot the climatology in Berlin for the 2020-2035 refrence period
plt.figure(figsize=(10, 5))
t2m_Berlin_climatology_mean.plot(label="Mean", color="#3498db")
plt.errorbar(
t2m_Berlin_climatology_mean.month,
t2m_Berlin_climatology_mean,
yerr=t2m_Berlin_climatology_std,
fmt="o",
label="Standard Deviation",
color="#a9a9a9"
)
plt.title("Surface Temperature climatology in Berlin (DE), 2020-2035")
plt.xticks(t2m_Berlin_climatology_mean.month)
plt.xlabel("Month")
plt.ylabel("Surface Temperature [°C]")
plt.legend()
plt.grid(alpha=0.3)
plt.show()