Detailed code run¶
Here is a deep dive in to the xagg
functionality.
[1]:
import xagg as xa
import xarray as xr
import numpy as np
import geopandas as gpd
Intro¶
We’ll be aggregating a gridded dataset onto a set of shapefiles, using an extra set of weights. Specifically, we’ll use: - gridded: month-of-year average temperature projections for the end-of-century from a climate model (CCSM4) - shapefiles: US counties - additional weights: global gridded population density (GPW, 30 min resolution)
This is a setup that you may for example use if projecting the impact of temperature on some human variable (temperature vs. mortality, for example) for which you have data at the US county level. Since your mortality data is likely at the county level, you need to aggregate the gridded climate model output data to counties - i.e., what is the average temperature over each county? This code will calculate which pixels overlap each county - and by how much - allowing an area-averaged value for monthly temperature at the county level.
However, you also care about where people live - so you’d like to additionally weight your temperature estimate by a population density dataset. This code easily allows such additional weights. The resultant output is a value of temperature for each month at each county, averaged by both the overlap of individual pixels and the population density in those pixels. (NB: GPWv4 just averages a political unit’s population over a pixel grid, so it might not be the best product in this particular use case, but is used as a sample here)
Let’s get started.
[2]:
# Load some climate data as an xarray dataset
ds = xr.open_dataset('../../data/climate_data/tas_Amon_CCSM4_rcp85_monthavg_20700101-20991231.nc')
[5]:
# Load US counties shapefile as a geopandas GeoDataFrame
gdf = gpd.read_file('../../data/geo_data/UScounties.shp')
[7]:
# Load global gridded population data from GPW
ds_pop = xr.open_dataset('../../data/pop_data/pop2000.nc')
NB: the GPW file above has been pre-processed, by subsampling to ``raster=0`` (the 2000 population), and renaming the primary variable to ``pop`` for ease of use.
Calculating area weights between a raster grid and polygons¶
First, xagg
has to figure out how much each pixel overlaps each polygon. This process requires a few steps:
Get everything in the right format.
Gridded data comes in all shapes and sizes.
xagg
is ready to deal with most common grid naming conventions - so no matter if your lat and lon variables are called ‘Latitude’ and ‘Longitude’ or ‘y’ and ‘x’ or many options in between, as long as they’re in xarray Datasets or DataArrays, they’ll work.Behind the scenes, longitude values are also forced to -180:180 (from 0:360, if applicable), just to make sure everything is operating in the same coordinate system.
Build polygons for each pixel
To figure out how much each pixel overlaps each polygon, pixel polygons have to be constructed. If your gridded variable already has “lat_bnds” and “lon_bnds” (giving the vertices of each pixel) explicitly included in the
xr.Dataset
, then those are used. If none are found, “lat_bnds” and “lon_bnds” are constructed by assuming the vertices are halfway between the coordinates in degrees.If an additional weighting is used, the weighting dataset and your gridded data have to be homogenized at this stage. By default, the weighting dataset is regridded to your gridded data using
xesmf
. Future versions will also allow regridding the gridded data to the weighting dataset here(it’s already accounted for in some of the functions, but not all).To avoid creating gigantic geodataframes with pixel polygons, the dataset is by default subset to a bounding box around the shapefiles first. In the aggregating code below, this subsetting is taken into account, and the input
ds
intoxa.aggregate
is matched to the original source grid on which the overlaps were calculated.
Calculate area overlaps between each pixel and each polygon
Now, the overlap between each pixel and each polygon is calculated. Using
geopandas
’ excellent polygon boolean operations and area calculations, the intersection between the raster grid and the polygon is calculated. For each polygon, the coordinates of each pixel that intersects it is saved, as is the relative area of that overlap (as an example, if you had a county the size and shape of one pixel, but located half in one pixel and half in the other pixel, those two pixels would be saved, and their relative area would be 0.5 each). Areas are calculated using the WGS84 geoid.
[8]:
# Calculate overlaps
weightmap = xa.pixel_overlaps(ds,gdf,weights=ds_pop.pop)
creating polygons for each pixel...
regridding weights to data grid...
/Users/kevinschwarzwald/opt/anaconda3/envs/test/lib/python3.9/site-packages/xarray/core/dataarray.py:746: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
return key in self.data
/Users/kevinschwarzwald/opt/anaconda3/envs/test/lib/python3.9/site-packages/xesmf/frontend.py:466: FutureWarning: ``output_sizes`` should be given in the ``dask_gufunc_kwargs`` parameter. It will be removed as direct parameter in a future version.
dr_out = xr.apply_ufunc(
calculating overlaps between pixels and output polygons...
success!
Aggregating gridded data to the polygons using the area weights (and other weights) calculated above¶
Now that we know which pixels overlap which polygons and by how much (and what the value of the population weight for each pixel is), it’s time to aggregate data to the polygon level. xagg
will assume that all variables in the original ds
that have lat
and lon
coordinates should be aggregated. These variables may have extra dimensions (3-D variables (i.e. lon x lat x time
) are supported; 4-D etc. should be supported but haven’t been tested yet - the biggest issue may be in
exporting).
Since we included an additional weighting grid, this dataset is included in weightmap
from above and is seamlessly integrated into the weighting scheme.
[9]:
# Aggregate
aggregated = xa.aggregate(ds,weightmap)
adjusting grid... (this may happen because only a subset of pixels were used for aggregation for efficiency - i.e. [subset_bbox=True] in xa.pixel_overlaps())
grid adjustment successful
aggregating tas...
all variables aggregated to polygons!
Converting aggregated data¶
Now that the data is aggregated, we want it in a useable format.
Supported formats for converting include: - xarray
Dataset (using .to_dataset()
) - Grid dimensions from the original dataset are replaced with a single dimensions for polygons - by default called “poly_idx” (change this with the loc_dim=...
option). Aggregated variables keep their non-grid dimensions unchanged; with their grid dimension replaced as above. - All original fields from the geodataframe
are kept as poly_idx x 1
variables. - pandas
Dataframe (using
.to_dataframe()
) - All original fields from the geodataframe
are kept; the aggregated variables are added as separate columns. If the aggregated variables have a 3rd dimension, they are reshaped long - with procedurally generated column names (just [var]0
, [var]1
, … for now).
(the “raw” form of the geodataframe used to create these can also be directly accessed through aggregated.agg
)
[10]:
# Example as a dataset
ds_out = aggregated.to_dataset()
ds_out
[10]:
<xarray.Dataset> Dimensions: (month: 12, pix_idx: 3141) Coordinates: * pix_idx (pix_idx) int64 0 1 2 3 4 5 6 ... 3135 3136 3137 3138 3139 3140 * month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12 Data variables: NAME (pix_idx) object 'Lake of the Woods' 'Ferry' ... 'Broomfield' STATE_NAME (pix_idx) object 'Minnesota' 'Washington' ... 'Colorado' STATE_FIPS (pix_idx) object '27' '53' '53' '53' ... '02' '02' '02' '08' CNTY_FIPS (pix_idx) object '077' '019' '065' '047' ... '240' '068' '014' FIPS (pix_idx) object '27077' '53019' '53065' ... '02068' '08014' tas (pix_idx, month) float64 264.0 268.9 274.0 ... 283.5 276.4 270.4
- month: 12
- pix_idx: 3141
- pix_idx(pix_idx)int640 1 2 3 4 ... 3137 3138 3139 3140
array([ 0, 1, 2, ..., 3138, 3139, 3140])
- month(month)int641 2 3 4 5 6 7 8 9 10 11 12
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
- NAME(pix_idx)object'Lake of the Woods' ... 'Broomfi...
array(['Lake of the Woods', 'Ferry', 'Stevens', ..., 'Southeast Fairbanks', 'Denali', 'Broomfield'], dtype=object)
- STATE_NAME(pix_idx)object'Minnesota' ... 'Colorado'
array(['Minnesota', 'Washington', 'Washington', ..., 'Alaska', 'Alaska', 'Colorado'], dtype=object)
- STATE_FIPS(pix_idx)object'27' '53' '53' ... '02' '02' '08'
array(['27', '53', '53', ..., '02', '02', '08'], dtype=object)
- CNTY_FIPS(pix_idx)object'077' '019' '065' ... '068' '014'
array(['077', '019', '065', ..., '240', '068', '014'], dtype=object)
- FIPS(pix_idx)object'27077' '53019' ... '02068' '08014'
array(['27077', '53019', '53065', ..., '02240', '02068', '08014'], dtype=object)
- tas(pix_idx, month)float64264.0 268.9 274.0 ... 276.4 270.4
array([[263.9780062 , 268.88786769, 274.01215237, ..., 283.81523287, 275.1416336 , 266.05442984], [271.78043992, 275.61848506, 276.93418312, ..., 281.63345613, 276.71447491, 272.24200396], [273.2172504 , 276.94037995, 278.41422511, ..., 283.27395604, 278.0632773 , 273.66618058], ..., [263.91916835, 263.8990788 , 266.77151431, ..., 272.37796292, 265.7020255 , 264.41430223], [265.04959892, 264.79484859, 268.19315553, ..., 273.14260516, 266.53491609, 265.57515291], [270.80386353, 273.4302063 , 275.95550537, ..., 283.54470825, 276.38360596, 270.44485474]])
[11]:
# Example as a dataframe
df_out = aggregated.to_dataframe()
df_out
[11]:
NAME | STATE_NAME | STATE_FIPS | CNTY_FIPS | FIPS | tas0 | tas1 | tas2 | tas3 | tas4 | tas5 | tas6 | tas7 | tas8 | tas9 | tas10 | tas11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Lake of the Woods | Minnesota | 27 | 077 | 27077 | 263.978006 | 268.887868 | 274.012152 | 283.158717 | 290.630598 | 297.850779 | 302.038199 | 300.327744 | 293.465816 | 283.815233 | 275.141634 | 266.054430 |
1 | Ferry | Washington | 53 | 019 | 53019 | 271.780440 | 275.618485 | 276.934183 | 279.826777 | 286.621100 | 293.757010 | 299.056368 | 297.131708 | 289.844308 | 281.633456 | 276.714475 | 272.242004 |
2 | Stevens | Washington | 53 | 065 | 53065 | 273.217250 | 276.940380 | 278.414225 | 281.319652 | 287.817911 | 294.926457 | 300.903109 | 299.304529 | 292.245363 | 283.273956 | 278.063277 | 273.666181 |
3 | Okanogan | Washington | 53 | 047 | 53047 | 271.831071 | 275.586124 | 276.689357 | 279.324166 | 285.771338 | 292.635899 | 297.756402 | 295.956748 | 289.177685 | 281.440422 | 276.654779 | 272.275232 |
4 | Pend Oreille | Washington | 53 | 051 | 53051 | 272.092353 | 275.888818 | 277.346070 | 280.446389 | 287.268406 | 294.357705 | 299.851527 | 297.965815 | 290.622763 | 282.058301 | 276.996473 | 272.484589 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3136 | Skagway-Hoonah-Angoon | Alaska | 02 | 232 | 02232 | 273.605147 | 275.477240 | 276.792992 | 279.194054 | 283.764519 | 288.635583 | 290.038044 | 289.689574 | 286.058608 | 280.958010 | 276.904109 | 274.048888 |
3137 | Yukon-Koyukuk | Alaska | 02 | 290 | 02290 | 264.534558 | 264.088869 | 267.621423 | 273.426228 | 281.649435 | 289.319370 | 288.936030 | 286.209616 | 280.807791 | 273.683875 | 266.722855 | 265.538685 |
3138 | Southeast Fairbanks | Alaska | 02 | 240 | 02240 | 263.919168 | 263.899079 | 266.771514 | 272.144709 | 279.739283 | 287.625174 | 287.933732 | 285.436821 | 279.768225 | 272.377963 | 265.702026 | 264.414302 |
3139 | Denali | Alaska | 02 | 068 | 02068 | 265.049599 | 264.794849 | 268.193156 | 273.612534 | 281.097223 | 288.917064 | 288.898311 | 286.233612 | 280.504391 | 273.142605 | 266.534916 | 265.575153 |
3140 | Broomfield | Colorado | 08 | 014 | 08014 | 270.803864 | 273.430206 | 275.955505 | 280.790070 | 287.303619 | 292.830048 | 297.615662 | 297.646820 | 292.368988 | 283.544708 | 276.383606 | 270.444855 |
3141 rows × 17 columns
Exporting aggregated data¶
For reproducability and code simplicity, you will likely want to save your aggregated data. In addtion, many researchers use multiple languages or software packages as part of their workflow; for example, STATA or R for regression analysis, or QGIS for spatial analysis, and need to be able to transfer their work to these other environments.
xagg
has built-in export functions that allow the export of aggregated data to: - NetCDF - csv (for use in STATA, R) - shapefile (for use in GIS applications)
Export to netCDF¶
The netCDF export functionality saves all aggregated variables by replacing the grid dimensions (lat
, lon
) with a single location dimension (called poly_idx
, but this can be changed with the loc_dim=
argument).
Other dimensions (e.g. time
) are kept as they were originally in the grid variable.
Fields in the inputted polygons (e.g., FIPS codes for the US Counties shapefile used here) are saved as additional variables. Attributes from the original xarray
structure are kept.
[ ]:
# Export to netcdf
aggregated.to_netcdf('file_out.nc')
Export to .csv¶
The .csv output functionality saves files in a polygon (rows) vs. variables (columns) format. Each aggregated variable and each field in the original inputted polygons are saved as columns. Named attributes in the inputted netcdf file are not included.
Currently .csvs are only saved “wide” - i.e., a lat x lon x time
variable tas
, aggregated to location x time
, would be reshaped wide so that each timestep is saved in its own column, named tas0
, tas1
, and so forth.
[ ]:
# Export to csv
aggregated.to_csv('file_out.csv')
Export to shapefile¶
The shapefile export functionality keeps the geometry of the originally input polygons, and adds the aggregated variables as fields.
Similar to .csv export above, if aggregated variables have a dimension beyond their location dimensions (e.g., time
), each step in that dimension is saved in a separate field, named after the variable and the integer of the index along that dimension (e.g., tas0
, tas1
, etc. for a variable tas
).
Named attributes in the inputted netcdf file are not included.
[ ]:
# Export to csv
aggregated.to_csv('file_out.shp')