Thursday, December 19th, 2024 (22 days ago)
xarray.DataTree
has been released in v2024.10.0, and the prototype xarray-contrib/datatree
repository archived, after collaboration between the xarray team and the NASA ESDIS project. 🤝
The DataTree concept allows for organizing heterogeneous collections of scientific data in the same way that a nested directory structure facilitates organizing large numbers of files on disk. It does so in a way that preserves common structure between data in the collections, such as aligned arrays and common coordinates.
For those familiar with netCDF4/Zarr groups, a DataTree can also be thought of as an in-memory representation of a file's group structure. Xarray users have been asking for a way to handle multiple netCDF4 groups since at least 2016!
DataTree enables xarray to be used for various new use cases, including:
The new high-level container class xarray.DataTree
acts like a tree of linked xarray.Dataset
objects, with alignment enforced between arrays in parent and child nodes, but not between those in sibling nodes. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores.
For more details please see the high-level description, the dedicated page on hierarchical data, and the section on IO with groups in the xarray documentation.
If you previously had used the datatree.DataTree
prototype in the xarray-contrib/datatree
repository, that has now been archived and will no longer be supported. Instead we encourage you to migrate to the implementation of DataTree
that you can import from xarray, following the migration guide.
This was a big feature addition! For a decade there have been 3 core public xarray data structures, now there are 4: Variable
, DataArray
, Dataset
, and now DataTree
.
Datatree represents arguably one of the largest new features added to xarray in 10 years - the migration of the existing prototype alone added >10k lines of code across 80 pull requests, and the resulting datatree implementation now contains contributions from at least 25 people.
We also had to resolve some really gnarly design questions to make it work in a way we were happy with.
DataTree didn't get implemented overnight - it was a multi-year effort that took place in a number of steps, and there are some lessons to be learned from the story.
In March 2021, the xarray team submitted a funding proposal to the Chan-Zuckerberg Initiative to develop "TreeDataset", citing bioscience use cases such as microscopy image pyramids. Unfortunately whilst we've been lucky to receive CZI funding before, on this occasion we didn't win money to work on the datatree idea.
In the absence of dedicated funding for datatree, Tom then used some time whilst at the Climate Data Science Lab at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the xarray-contrib/datatree
repository, and steadily gained a small community of intrepid users. It was driven partly by the use case of climate model intercomparison datasets.
A separate repository was chosen for speed of iteration, and to be able to more easily make changes without worrying as much about backwards compatibility as code in xarray's main repo does. However the separate repo meant that the prototype datatree
library was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals.
The prototype then sat there for 2 years, until the NASA ESDIS team approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support.
Amazingly NASA were able to offer the time of 3 engineers: Owen (NASA EOSDIS Evolution and Development 3 (EED-3) contract), Matt (NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC)), and Eni (Goddard Earth Sciences Data and Information Services Center (GES DISC)). So starting in late 2023 the NASA trio worked on migrating the prototype datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core team).
This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some significant improvements to the design without backwards-compatibility concerns (for example enabling the new "coordinate inheritance" feature).
This development story is different from the more typical scientific grant funding model - how did that work out for us?
The scientific grant model for funding software expects you to present a full idea in a proposal, wait 6-12 months to hopefully get funding for it, then implement the whole thing during the grant period. In contrast datatree evolved over a gradual process of moving from ideas to hacky prototype to robust implementation, with big time gaps for user feedback and experimentation. The migration was completed by developer-users who actually wanted the feature, rather than grant awardees working in service of a separate and maybe-only-theoretical userbase.
Overall while the migration effort took longer than anticipated we think it worked out quite well!
This contributing model is more similar to how open-source software has historically been supported by industry, but perhaps because xarray is primarily developed and used by the scientific community we tend to default to more grant-based funding models.
Overall we think this type of collaboration could work again in future! So if there is an xarray or xarray-adjacent feature your organisation would like to see, please reach out to us.
Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. Simply do from xarray import DataTree
or call open_datatree(...)
on a netCDF4 file / Zarr store containing multiple groups.
Be aware that as xarray.DataTree
is still new there will likely be some bugs lurking or places that performance could be improved, as well as as-yet unimplemented features (as there always are)!
A number of other people also contributed to datatree in various ways - particular shoutout to Alfonso Ladino and Etienne Schalk for their dedicated attendance at many of the weekly migration meetings!