How to work with Hybrid gridded demographic data for the world, 1950-2020, on Earth Data Hub¶
Earth Data Hub (EDH) provides an innovative access to earth related data. This notebook will provide you guidance on how to access and use the https://data.earthdatahub.destine.eu/derived-GPWv4-Histsoc/demographics-hybrid-1950-2020-15-min-v0
dataset on Earth Data Hub.
This dataset combines the NASA SEDAC Gridded Population of the World version 4 (GPWv4) with the ISIMIP Histsoc gridded population data and the United Nations World Population Program (WPP) demographic modelling data. The data is supplied in 5-years population bands with 0.25° spatial resolution. For pre-2000 population data, the ISIMIP Histsoc data was upscaled from it's native 0.5˚ resolution to 0.25° resolution. The dataset grid is explicitly designed to match with the one of ERA5 single levels climate reanalysis dataset.
Goal of this tutorial¶
In this tutorial our goal is to compute and visualize the European population distribution in 2020.
What you will learn:¶
- how to access the data
- select and reduce the data
- plot the results
In order to access datasets on Earth Data Hub you need to instruct your tools (xarray, Zarr, etc.) to use EDH personal access token when downloading the data.
To obtain a personal access token you first need to register to the Destination Earth platform platform. Then, you can go to Earth Data Hub account settings where you can find your default personal access token or create others. After retrieving your personal access token, please cut and paste it below: ⤵
PAT = "your_personal_access_token"
# e.g. PAT="edh_pat_44bbb7e9192a4c6bb47ddf07d07564eee5d17de8dfc48f7118f88e3bc4a4157f8fe2403f5aa0a2d53441b6922ea9a33a"
Working with EDH data¶
Datasets on EDH are often very large and remotely hosted. Typical use imply a selection of the data followed by one or more reduction steps to be performed in a local or distributed Dask environment.
The structure of a workflow that uses EDH data tipically looks like this:
- data access
- data selection
- (optional) data reduction
- data download
- further operations and visualization
Xarray and Dask work together following a lazy principle. This means that when you access and manipulate a Zarr store the data is in not immediately downloaded and loaded in memory. Instead, Dask constructs a task graph that represents the operations to be performed.
A smart user will first reduce the amount of data that needs to be downloaded and explicitly call compute()
on it. Once the compute()
operation is complete the data is loaded into memory and available for subsequent fast processing.
1. Data access¶
To preview the data, only the dataset metadata must be downloaded. Xarray does this automatically when you access a Zarr dataset:
import xarray as xr
ds = xr.open_dataset(
f"https://edh:{PAT}@data.earthdatahub.destine.eu/derived-GPWv4-Histsoc/demographics-hybrid-1950-2020-15-min-v0",
storage_options={"client_kwargs":{"trust_env":True}},
chunks={},
engine="zarr",
)
ds
<xarray.Dataset> Size: 4GB Dimensions: (age_band_lower_bound: 14, latitude: 720, longitude: 1440, year: 71) Coordinates: * age_band_lower_bound (age_band_lower_bound) int64 112B 0 5 10 ... 55 60 65 * latitude (latitude) float64 6kB 90.0 89.75 ... -89.5 -89.75 * longitude (longitude) float64 12kB 0.0 0.25 0.5 ... 359.5 359.8 * year (year) int64 568B 1950 1951 1952 ... 2018 2019 2020 Data variables: demographic_totals (latitude, longitude, age_band_lower_bound, year) float32 4GB dask.array<chunksize=(64, 64, 14, 71), meta=np.ndarray>
The longitude of the dataset is set from 0 to 360 degrees, which makes it uncomfortable to deal with European data when using xarray:
ds.demographic_totals.sel(age_band_lower_bound=0, year=2020).plot(vmax=50_000, cmap="YlGnBu_r")
<matplotlib.collections.QuadMesh at 0x7f745ec685d0>
We can roll the longitude to a -180 to 180 extent:
xr.set_options(keep_attrs=True)
ds = ds.assign_coords(longitude=(((ds.longitude + 180) % 360) - 180))
ds = ds.roll(longitude=int(len(ds.longitude) / 2), roll_coords=True)
ds
<xarray.Dataset> Size: 4GB Dimensions: (age_band_lower_bound: 14, latitude: 720, longitude: 1440, year: 71) Coordinates: * age_band_lower_bound (age_band_lower_bound) int64 112B 0 5 10 ... 55 60 65 * latitude (latitude) float64 6kB 90.0 89.75 ... -89.5 -89.75 * year (year) int64 568B 1950 1951 1952 ... 2018 2019 2020 * longitude (longitude) float64 12kB -180.0 -179.8 ... 179.5 179.8 Data variables: demographic_totals (latitude, longitude, age_band_lower_bound, year) float32 4GB dask.array<chunksize=(64, 64, 14, 71), meta=np.ndarray>
ds.demographic_totals.sel(age_band_lower_bound=0, year=2020).plot(vmax=50_000, cmap="YlGnBu_r")
<matplotlib.collections.QuadMesh at 0x7f745c1cb350>
2. Data selection¶
First, we perform a geographical selection corresponding to the European area:
europe_population = ds.demographic_totals.sel(latitude=slice(70, 35), longitude=slice(-10,41))
Next, we select year 2020
europe_population_2020 = europe_population.sel(year = 2020)
europe_population_2020
<xarray.DataArray 'demographic_totals' (latitude: 141, longitude: 205, age_band_lower_bound: 14)> Size: 2MB dask.array<getitem, shape=(141, 205, 14), dtype=float32, chunksize=(64, 64, 14), chunktype=numpy.ndarray> Coordinates: * age_band_lower_bound (age_band_lower_bound) int64 112B 0 5 10 ... 55 60 65 * latitude (latitude) float64 1kB 70.0 69.75 69.5 ... 35.25 35.0 year int64 8B 2020 * longitude (longitude) float64 2kB -10.0 -9.75 ... 40.75 41.0
This sequence of selections greatly reduces the amount of data that will be downloaded. For the moment, the bahaviour is still lazy.
To estimate the size of the download, we can use the costing.py
module. This estimate must be done before we apply any reduction operation.
import costing
costing.estimate_download_size(ds.demographic_totals, europe_population_2020)
estimated_needed_chunks: 12 estimated_memory_size: 0.195 GB estimated_download_size: 0.02 GB
3. Data reduction¶
For the selected year, we compute the sum of all population bands:
europe_total_population_2020 = europe_population_2020.sum("age_band_lower_bound")
europe_total_population_2020
<xarray.DataArray 'demographic_totals' (latitude: 141, longitude: 205)> Size: 116kB dask.array<sum-aggregate, shape=(141, 205), dtype=float32, chunksize=(64, 64), chunktype=numpy.ndarray> Coordinates: * latitude (latitude) float64 1kB 70.0 69.75 69.5 69.25 ... 35.5 35.25 35.0 year int64 8B 2020 * longitude (longitude) float64 2kB -10.0 -9.75 -9.5 ... 40.5 40.75 41.0
4. Data downalod¶
This is the phase where we explicitly trigger the download of the data. Remember to assign the return of the compute()
function to a new variable, so that the data is kept in memory.
%%time
europe_total_population_2020_computed = europe_total_population_2020.compute();
CPU times: user 839 ms, sys: 184 ms, total: 1.02 s Wall time: 875 ms
5. Visualization¶
Finally, we can easily plot the total population in Europe for the year 2020:
europe_total_population_2020_computed.plot(vmax=300_000, cmap="YlGnBu_r")
<matplotlib.collections.QuadMesh at 0x7f7447fda2d0>