ILAMB intake Catalog

01 Dec 2022 - Nathan Collier

We are pleased to announce that the reference datasets that we have reprocessed and can be mass downloaded via ilamb-fetch are now also available as an intake catalog. Intake is a lightweight set of python tools for loading and sharing data in data science projects. It allows you to write python code referencing the ILAMB datasets by name, and then intake manages the download, using cached versions if available on your system.

In order to use this catalog, first install intake. Then you can simply load the catalog in your python script by pointing to the remote file on Github.

import intake
cat = intake.open_catalog("https://raw.githubusercontent.com/nocollier/intake-ilamb/main/ilamb.yaml")

You can see a list of all the data source entries by typing print(list(cat)) or you can treat the catalog as a dictionary. Here I start to reference an entry by opening a square bracket and starting to type gpp afterwhich I hit my tab key. This will show you all the sources that start with these characters allowing simple searches by variable name. When possible we have used the CMOR names of variables.

cat['gpp<TAB>
            gpp | FLUXCOM    
            gpp | FLUXNET2015
            gpp | WECANN     

By selecting a key relating to the dataset we wish to use, we get back an intake data source object (the variable src below). The source objects have instructions embedded in them for automatically downloading the data when you use the read() method.

src = cat['gpp | FLUXCOM']
%time gpp = src.read()  # 9.84 [s]

The first time you read a dataset on a given system, you will be downloading the data. The above read() took almost 10 seconds. However, on subsequent calls, intake manages a cache and the data will be read much faster and locally from your system. Each data source in this catalog will return a xarray dataset to you that you can then use in your analysis scripts.

print(gpp)
<xarray.Dataset>
Dimensions:      (time: 408, nb: 2, lat: 360, lon: 720)
Coordinates:
  * time         (time) object 1980-01-16 12:00:00 ... 2013-12-16 12:00:00
  * lat          (lat) float64 89.75 89.25 88.75 88.25 ... -88.75 -89.25 -89.75
  * lon          (lon) float64 -179.8 -179.2 -178.8 -178.2 ... 178.8 179.2 179.8
Dimensions without coordinates: nb
Data variables:
    time_bounds  (time, nb) object 1980-01-01 00:00:00 ... 2014-01-01 00:00:00
    gpp          (time, lat, lon) float32 9.969e+36 9.969e+36 ... 9.969e+36
Attributes:
    title:         FLUXCOM (RS+METEO) Global Land Carbon Fluxes using CRUNCEP...
    version:       1
    institutions:  Department Biogeochemical Integration, Max Planck Institut...
    source:        Data generated by Artificial Neural Networks and forced wi...
    history:       \n2020-08-25: downloaded source from ftp://ftp.bgc-jena.mp...
    references:    \n@ARTICLE{Jung2017,\n  author = {Jung, M., M. Reichstein,...
    comments:      \ntime_period: 1980-01 through 2013-12; temporal_resolutio...
    convention:    CF-1.8

This allows you to write analysis scripts utilizing ILAMB data that will run anywhere with a connection to the internet without the need to setup and separately download the data.