xarray logo

Flexible Indexes: Exciting new ways to slice and dice your data!

Tuesday, September 2nd, 2025 (11 days ago)



TL;DR: Over the last few years we've gradually refactored Xarray internals to make coordinate-based data selection and alignment customizable. As a result, Xarray>=2025.6 enables more efficient handling of both traditional and more exotic data structures. In this post we highlight a few examples that take advantage of this new superpower! See the Gallery of Custom Index Examples for more!

Summary schematic from our 2025 SciPy Presentation highlighting new custom Indexes and usecases. Link to full slide deck

Indexing basics#

First thing's first, what is an Index and why is it helpful?

Examples of indexes are all around you and are a fundamental way to organize and simplify access to information. In the United States, if you want a book about Natural Sciences, you can go to your local library branch and head straight to section 500. Or if you're in the mood for a classic novel go to section 800. Connecting thematic labels with numbers ({'Natural Sciences': 500, 'Literature': 800}) is a classic indexing system that's been around for hundreds of years (Dewey Decimal System, 1876). The need for an index becomes critical as the size of data grows - just imagine the time it would take to find a specific novel amongst a million uncategorized books!

The same efficiencies arise in computing. Consider a simple 1D dataset consisting of measurement values M=[10.0, 20.0, 30.0, 40.0, 50.0, 60.0] at six coordinate positions X=[1, 2, 4, 8, 16, 32]. What was our measurement at X = 8? To answer this in code, we could either do a brute-force linear search (or binary search if sorted) through the coordinates array, or we could build a more efficient data structure designed for fast searches --- an Index. A common convenient index is a key:value mapping or "hash table" between the coordinate values and their integer positions i=[0, 1, 2, 3, 4, 5]. Once we identify the index i=3 for our coordinate of interest (X[3] = 8) we use it to lookup our measurement value M[3] = 40.0.

pandas.Index#

Xarray's label-based selection allows a more expressive and simple syntax in which you don't have to think about the index: da.sel(x=8). To accomplish this, Xarray has historically relied on pandas.Index behind the scenes, which is still used by default:

1x = np.array([1, 2, 4, 8, 16, 32])
2m = np.arange(10, 70, 10.0)
3da = xr.DataArray(m, coords={'X': x}, name='M')
4da
5
Loading data...
1da.sel(x=8)
2# 40
3

Alternatives to pandas.Index#

There are many different indexing schemes and ways to generate an index. pandas.Index's approach is roughly similar to running a loop over all coordinate values to create an index:coordinate mapping, optionally identifying duplicates and sorting along the way. But, you might recognize that our example coordinates above can in fact be represented by a function X(i) = 2**i where i is the integer position! Given that function we can quickly get measurement values at any coordinate: M(X=8) = M[log2(8)] = M[3] = 40. Xarray now has a CoordinateTransformIndex to handle this type of on-demand calculation of coordinates!

xarray.RangeIndex#

A simple special case of CoordinateTransformIndex is a RangeIndex where coordinates can be defined by a start, stop, and uniform step size. pandas.RangeIndex only supports integers, whereas Xarray handles floating-point values. Coordinate look-up is performed on-the-fly rather than loading all values into memory up-front when creating a Dataset, which is critical for the example below that has a coordinate array of 7 terabytes!

1from xarray.indexes import RangeIndex
2
3index = RangeIndex.arange(0.0, 1000.0, 1e-9, dim='x') # 7TB coordinate array!
4ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index))
5ds
6
Loading data...

Selection with slices preserves the RangeIndex and does not require loading all the coordinates into memory.

sliced = ds.isel(x=slice(1_000, 50_000, 100))
sliced.x
Loading data...

Third-party custom Indexes#

In addition to a few new built-in indexes, xarray.Index provides an API that allows dealing with coordinate data and metadata in a highly customizable way for the most common Xarray operations such as selalignconcatstack. This is a powerful extension mechanism that is very important for supporting a multitude of domain-specific data structures. Here are a few examples below.

rasterix.RasterIndex#

Earlier we mentioned that coordinates may have a functional representation. For 2D raster images, this function often takes the form of an Affine Transform. The rasterix library extends Xarray with a RasterIndex which computes coordinates for geospatial images such as GeoTiffs via Affine Transform.

Below is a simple example of slicing a large mosaic of GeoTiffs without ever loading the coordinates into memory:

1# 811816322401 values!
2import rasterix
3
4#26475 GeoTiffs represented by a GDAL VRT
5da = (
6    xr.open_dataarray(
7        "https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt",
8        engine="rasterio",
9        parse_coordinates=False,
10    )
11    .squeeze()
12    .pipe(rasterix.assign_index)
13)
14da
15
Loading data...

After the slicing operation, a new Affine is defined. For example, notice how the origin changes below from (-180.0, 84.0) -> (-122.4, -47.1), while the spacing is unchanged (0.000277).

1print('Original geotransform:\n', da.xindexes['x'].transform())
2da_sliced = da.sel(x=slice(-122.4, -120.0), y=slice(-47.1,-49.0))
3print('Sliced geotransform:\n', da_sliced.xindexes['x'].transform())
4
1# Original geotransform:
2Affine(0.0002777777777777778, 0.0, -180.0001389,
3       0.0, -0.0002777777777777778, 84.0001389)
4
5# Sliced geotransform:
6Affine(0.0002777777777777778, 0.0, -122.40013889999999,
7       0.0, -0.0002777777777777778, -47.09986109999999)
8

xproj.CRSIndex#

We often think about metadata providing context for measurement values but metadata is also critical for coordinates! In particular, to align two different datasets we must ask if the coordinates are in the same coordinate system?

There are currently over 7000 commonly used Coordinate Reference Systems (CRS) for geospatial data in the authoritative EPSG database! And of course an infinite number of custom-defined CRSs. xproj.CRSIndex gives Xarray objects an automatic awareness of the coordinate reference system so that operations like xr.align() raises an informative error when there is a CRS mismatch:

1from xproj import CRSIndex
2lons1 = np.arange(-125, -120, 1)
3lons2 = np.arange(-122, -118, 1)
4ds1 = xr.Dataset(coords={'longitude': lons1}).proj.assign_crs(crs=4267)
5ds2 = xr.Dataset(coords={'longitude': lons2}).proj.assign_crs(crs=4326)
6ds1 + ds2
7
1MergeError: conflicting values/indexes on objects to be combined for coordinate 'crs'
2

xvec.GeometryIndex#

A "vector data cube" is an n-D array that has at least one dimension indexed by an array of vector geometries. With the xvec.GeometryIndex, Xarray objects gain functionality equivalent to geopandas' GeoDataFrames! For example, large vector cubes can take advantage of an R-tree spatial index for efficiently selecting vector geometries within a given bounding box. Below is a short code snippet which automatically uses R-tree selection behind the scenes:

1import xvec
2import geopandas as gpd
3from geodatasets import get_path
4
5# Dataset that contains demographic data indexed by U.S. counties
6counties = gpd.read_file(get_path("geoda.natregimes"))
7
8cube = xr.Dataset(
9    data_vars=dict(
10        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
11        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
12    ),
13    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
14).xvec.set_geom_indexes("county", crs=counties.crs)
15cube
16
Loading data...
1# Efficient selection using shapely.STRtree
2from shapely.geometry import box
3
4subset = cube.xvec.query(
5    "county",
6    box(-125.4, 40, -120.0, 50),
7    predicate="intersects",
8)
9
10subset['population'].xvec.plot(col='year');
11

Even more examples!#

Be sure to check out the Gallery of Custom Index Examples for more detailed examples of all the indexes mentioned in this post and more!

What's next?#

While we're extremely excited about what can already be accomplished with the new indexing capabilities, there are plenty of exciting ideas for future work. In a follow-up blog post, we will also illustrate how Xarray's internals interact with the xarray.Index API and how it can be leveraged in order to customize the behavior of some of the most common Xarray operations like indexing and alignment.

We believe the new flexible indexing machinery will increase usage of Xarray across scientific domains and are actively working on examples that hopefully will appeal to astronomers and biologists!

Have an idea for your own custom index? Check out this section of the Xarray documentation and please advertise what you're working on in our gallery of examples.

Acknowledgments#

This work would not have been possible without technical input from the Xarray core team and community! Several developers received essential funding from a CZI Essential Open Source Software for Science (EOSS) grant as well as NASA's Open Source Tools, Frameworks, and Libraries (OSTFL) grant 80NSSC22K0345.

Back to Blog

xarray logo

© 2025, Xarray core developers. Apache 2.0 Licensed.

419155a

TwitterGitHubYouTubeBlog RSS Feed

Xarray

DashboardBlogTeamCiting XarrayRoadmapBrand Assets
Powered by Vercel