Tuesday, August 30th, 2022 (about 2 years ago)
TLDR: Pint-Xarray supports unit-aware operations by wrapping pint arrays, so your code can automatically track the physical units that your data represents:
1distance = xr.DataArray(10).pint.quantify("metres") 2time = xr.DataArray(4).pint.quantify("seconds") 3 4distance / time 5
Out: <xarray.DataArray ()> <Quantity(2.5, 'meter / second')>
All quantities in science have units, whether explicitly or implicitly. (And even dimensionless quantities like ratios still technically have units.)
Getting our units right is finicky, and can very easily go unnoticed in our code.
Even worse, the consequences of getting units wrong can be huge!
The most famous example of a units error has to be NASA's $125 million Mars Climate Orbiter, which in 1999 burned up in the Martian atmosphere instead of successfully entering orbit around Mars. A trajectory course correction had gone wrong, and the error was eventually traced back to a units mismatch: the engineers at Lockheed Martin expressed impulse in pound-force seconds, whereas the engineers at JPL assumed the impulse value their part of the software received was in SI newton seconds.
Newspaper cartoon depicting the incongruence in the units used by NASA and Lockheed Martin scientists that led to the Mars Climate Orbiter disaster.
We should take stories like this seriously: If we can automatically track units we can potentially eliminate a whole class of possible errors in our scientific work...
There are a few packages for handling units in python (notably unyt and astropy.units), but for technical reasons we began units integration in Xarray with pint. These various packages work by providing a numerical array type that acts similarly to a NumPy array, and is intended to plug in and replace the raw NumPy array (a so-called "duck array type").
Pint provides the Quantity
object, which is a normal numpy array combined with a pint.Unit
:
1q = np.array([6, 7]) * pint.Unit('metres') 2print(repr(q)) 3
Out: <Quantity([6 7], 'meter')>
Pint Quantities act like NumPy arrays, except that the units are carried around with the arrays, propagated through operations, and checked during operations involving multiple quantities.
Thanks to the tireless work of Xarray core developer Justus Magin, you can now enjoy this automatic unit-handling in Xarray!
Once you create a unit-aware Xarray object (see below for how) you can see the units of the data variables displayed as part of the printable representation. You also immediately get the key benefits of Pint:
Units are propagated through arithmetic, and new quantities are built using the units of the inputs:
1distance = xr.DataArray(10).pint.quantify("metres") 2time = xr.DataArray(4).pint.quantify("seconds") 3 4distance / time 5
Out: <xarray.DataArray ()> <Quantity(2.5, 'meter / second')>
Dimensionally inconsistent units are caught automatically:
1apples = xr.DataArray(10).pint.quantify("kg") 2oranges = xr.DataArray(200).pint.quantify("cm^3") 3 4apples + oranges 5
Out: DimensionalityError: Cannot convert from 'kilogram' ([mass]) to 'centimeter ** 3' ([length] ** 3)
Unit conversions become simple:
1walk = xr.DataArray(500).pint.quantify('miles') 2 3walk.pint.to('parsecs') 4
Out: <xarray.DataArray ()> <Quantity(2.6077643524162074e-11, 'parsec')>
With these features, you can build code that automatically propagates units and converts them where necessary to stay consistent. For example, the problem of the NASA orbiter could have been prevented by explicitly converting to the correct units at the start
1def jpl_trajectory_code(impulse): 2 3 # Defensively check units first 4 impulse = impulse.pint.to("Newton * seconds") 5 6 # This function we called here will only compute the correct result if supplied input in units of Newton-seconds, 7 # but that's fine because we already converted the values to be in the correct units! 8 propagated_position = some_rocket_science(impulse) 9 10 return propagated_position 11
Note: We are adding new features to make specifying the units of parameters of existing library functions more slick.
In the abstract, tracking units like this is useful in the same way that labelling dimensions with Xarray is useful: it helps us avoid errors by relieving us of the burden of remembering arbitrary information about our data.
The easiest way to create a unit-aware Xarray object is to use the helper package we made: pint-xarray.
Once you import pint_xarray
you can access unit-related functionality via .pint
on any DataArray
or Dataset
(this works via Xarray's accessor interface).
Above we have seen examples of quantifying explicitly, where we specify the units in the call to .quantify()
.
We can do this for multiple variables too, and we can also pass pint.Unit
instances:
1ds = xr.Dataset({'a': 2, 'b': 10}) 2 3ds.pint.quantify({'a': 'kg', 4 'b': pint.Unit('moles')}) 5
Out: <xarray.Dataset> Dimensions: () Data variables: a int64 [kg] 2 b int64 [mol] 10
Alternatively, we can quantify from the object's .attrs
, automatically reading the metadata which xarray objects carry around.
If nothing is passed to .quantify()
, it will attempt to parse the .attrs['units']
entry for each data variable.
This means that for scientific datasets which are stored as files with units in their attributes (which netCDF and Zarr can do for example), using Pint with Xarray becomes as simple as:
1import pint_xarray 2 3ds = open_dataset(filepath).pint.quantify() 4
To convert our pint arrays back into NumPy arrays, we can use .dequantify
.
This will strip the units from the arrays and replace them into the .attrs['units']
of each variable.
This is useful when we want to save our data back to a file, as it means that the current units will be preserved in the attributes of a netCDF file (or Zarr store etc.), as long as we just do ds.pint.dequantify().to_netcdf(...)
.
So Xarray can wrap Dask arrays, and now it can wrap Pint quantities… Can we use both together? Yes!
You can get a unit-aware, Dask-backed array either by .pint.quantify()
-ing a chunked array, or you can .pint.chunk()
a quantified array.
(If you have Dask installed, then open_dataset(f, chunks={}).pint.quantify()
will already give you a Dask-backed, quantified array.)
From there you can .compute()
the Dask-backed objects as normal, and the units will be retained.
(Under the hood we now have an xarray.DataArray
wrapping a pint.Quantity
, which wraps a dask.array.Array
, which wraps a numpy.ndarray
.
This "multi-nested duck array" approach can be generalised to include other array libraries (e.g. scipy.sparse
), but requires coordination between the maintainers of the libraries involved.)
We would love to be able to promote Xarray indexes to Pint Quantities, as that would allow you to select data subsets in a unit-aware manner like:
1da = xr.DataArray(name='a', data=[0, 1, 2], dims='x', coords={'x': [1000, 2000, 3000]}) 2da = da.pint.quantify({'a': 'Pa', 'x': 'm'}) 3 4da.pint.sel(x=2 * 'km') 5
Unfortunately this will not possible until the ongoing work to extend Xarray to support explicit indexes is complete.
In the meantime pint-xarray offers a workaround. If you tell .quantify
the units you wish an index to have, it will store those in .attrs["units"]
instead.
1time = xr.DataArray([0.1, 0.2, 0.3], dims='time') 2distance = xr.DataArray(name='distance', 3 data=[10, 20, 25], 4 dims=['time'], 5 coords={'time': time}) 6distance = distance.pint.quantify({'distance': 'metres', 7 'time': 'seconds'}) 8print(distance.coords['time'].attrs) 9
Out: {'units': <Unit('second')>}
This allows us to provide conveniently wrapped versions of common xarray methods like .sel
, so that you can still select subsets of data in a unit-aware fashion like this:
1distance.pint.sel(time=200 * pint.Unit('milliseconds')) 2
Out: <xarray.DataArray 'distance' ()> <Quantity(20, 'meter')> Coordinates: time float64 200.0
Observe how the .pint.sel
operation has first converted 200 milliseconds to 0.2 seconds, before finding the distance value that occurs at a time position of 0.2 seconds.
This wrapping is currently necessary for any operation which needs to be aware of the units of a dimension coordinate of the dataarray, or any xarray operation which relies on an external library (such as calling scipy
in .integrate
).
Different fields tend to have different niche conventions about how certain units are defined. By default, Pint doesn't understand all the unusual units and conventions we use in geosciences. But Pint is customisable, and with the help of cf-xarray we can teach it about these geoscience-specific units.
If we import cf_xarray.units
(before import pint_xarray
) then we can quantify
example climate data from the Pangeo Project's CMIP6 catalog:
1import xarray as xr 2import cf_xarray.units 3import pint_xarray 4 5ds = xr.open_dataset('gs://cmip6/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r2i1p1f1/Amon/sfcWind/gn/v20200226/', engine='zarr') 6ds = ds.pint.quantify() 7 8squared_wind = ds['sfcWind'] ** 2 9squared_wind.pint.units 10
Out: <Unit('meter ** 2 / second ** 2')>
Here (thanks to cf_xarray
) pint has successfully interpreted the CF-style units 'm s-1'
, then automatically changed them when we squared the wind speed.
We can complete our real-world example by plotting the data in its new units:
1import cartopy.crs as ccrs 2import matplotlib.pyplot as plt 3 4p = squared_wind.isel(time="2014-01").plot( 5 subplot_kws=dict(projection=ccrs.Orthographic(-80, 35), facecolor="gray"), 6 transform=ccrs.PlateCarree(), 7) 8p.axes.set_global() 9p.axes.coastlines() 10plt.show() 11
where xarray.plot
has detected the Pint units automatically.
Please have a go! You will need xarray (v2022.03.0+), pint (0.18+), and pint-xarray (0.3+).
Please also tell us about any bugs you find, or documentation suggestions you have on the Xarray or pint-xarray issue trackers. If you have usage questions you can raise them there, on the Xarray discussions page, or on the Pangeo Discourse forum.
The work here to allow Xarray to wrap Pint objects is part of a broader effort to generalise Xarray to handle a wide variety of data types (so-called "duck array wrapping"). Along with the incoming support for flexible indexes, we are excited for all the new features that this will enable for Xarray users!
Tom Nicholas' work on this project was by the Gordon and Betty Moore Foundation Data-Driven Discovery program, via a grant to the Columbia Climate Data Science Lab.