Hands-on exercise, step 3: aggregating the data#

Aggregating the weather data#

There exist tools in R and Python which aggregate gridded data according to a shapefile. In particular, xagg in Python and stagg in R handle partial grid-cell overlapping and arbitrary weighting grid, and we recommend that it be used irrespective of the language being used elsewhere.

Setting up the environment#

The approach uses xarray for gridded data, geopandas to work with shapefiles, and xagg to aggregate gridded data onto shapefiles.

Install xagg according to the instructions from GitHub.

You will need to download a shapefile for the US counties. We use one from the ESRI community website: https://community.esri.com/ccqpr47374/attachments/ccqpr47374/enterprise-portal/4124/1/UScounties.zip

Save its contents to a folder geo_data in the data directory. The following code should be run from the data directory.

Begin your code as follows:

import xarray as xr
import geopandas as gpd
import xagg as xa
import numpy as np

The approach uses raster to work with gridded data, tigris to work with shapefiles, and stagg to aggregate gridded data onto shapefiles.

Install stagg according to the instructions from GitHub.

Begin your code as follows:

library(stagg)
library(raster)
library(tigris)
library(dplyr)

Now, load the data from the previous steps and the county shapefile:

# Load temperature data using xarray
ds_tas = xr.open_dataset(
    '../data/climate_data/tas_day_BEST_historical_station_19800101-19891231.nc')

# Load population data using xarray 
ds_pop = xr.open_dataset('../data/pcount/usap90ag.nc')

# Load county shapefiles using geopandas
gdf_counties = gpd.read_file('../data/geo_data/UScounties.shp')
## Load data and expand to global grid
global.extent <- extent(-180, 180, -90, 90)

rr.tas <- stack("../data/climate_data/tas_day_BEST_historical_station_19800101-19891231.nc")
rr.tas.padded <- extend(rr.tas, global.extent)
rr.pop <- raster("../data/pcount/usap90ag.nc")
rr.pop.padded <- extend(rr.pop, global.extent)
rr.pop.padded[is.na(rr.pop.padded)] = 0

## Load counties
counties <- tigris::counties()
counties$FIPS <- paste0(counties$STATEFP, counties$COUNTYFP)

Transforming the data#

Next, we need to construct any nonlinear transformations of the data.

For our econometric model, we want temperature in both linear and quadratic form, centered around \(20^\circ\) C: \(T-20^\circ C\) and \(T^2 - (20^\circ C)^2\).

ds_tas['tas_adj'] = ds_tas.tas-20
ds_tas['tas_sq'] = ds_tas.tas**2 - 20**2

# xagg aggregates every gridded variable in ds_tas - however, we don't need
# every variable currently in tas. Let'ss drop "tas" (the un-adjusted temperature)
# and "land_mask" which is included, but not necessary for our further analysis.
ds_tas = ds_tas.drop_vars('tas')
ds_tas = ds_tas.drop_vars('land_mask')

With stagg, we will let the library handle this with the staggregate_polynomial function (below).

Create map of pixels onto polygons#

We need to create a weight map of pixels to polygons. For each polygon, we need to know which pixels overlap it and by how much.

xagg and stagg do this by creating polygons for each pixel in the gridded dataset, taking the intersect between each county polygon and all pixel polygons, and calculating the average area of overlap between the pixels that touch the polygon and the polygon.

weightmap = xa.pixel_overlaps(ds_tas, gdf_counties, weights=ds_pop.Population, subset_bbox=False)
grid.weights <- secondary_weights(secondary_raster=rr.pop.padded, grid=rr.tas.padded)
county.weights <- overlay_weights(polygons=counties, polygon_id_col="FIPS", grid=rr.tas.padded, secondary_weights=grid.weights)

Aggregate values onto polygons#

For each county, we want to calculate the weighted average of the temperatures of the pixels that cover the county. Since we included population as a desired weight above, the weight for each pixel is a combination of how much of the pixel overlaps with the county and its population density. In other words, a pixel that covers more of a county and has a higher population density will be weighted more in that county’s temperature average than a pixel that covers less of a county and is more sparsely populated.

The output of this step now gives, for each county, a 10-year time series of linear and quadratic temperature, properly area- and population-weighted.

Using the weight map calculated above, xagg now aggregates all the gridded variables in ds_tas (the tas_adj and tas_sq we calculated above) onto the county polygons.

aggregated is an object specific to the xagg package. We need to modify it to be usable, for example using aggregated.to_dataset() or aggregated.to_csv(). See the xagg docs for more info. Here we use aggregated.to_dataset() which produces an xarray dataset, which will be convenient for providing a standardized output.

aggregated = xa.aggregate(ds_tas, weightmap)

## Aggregate the result to the annual level
ds = aggregated.to_dataset()
ds2 = ds.groupby(ds.time.dt.year).sum()
## Shift axis of climate data to conform to stagg expectations
rr.tas.padded2 <- shift(rr.tas.padded, dx = 360)

## Calculate aggregation
county.tas <- staggregate_polynomial(data=rr.tas.padded2, daily_agg="none", time_agg='year', overlay_weights=county.weights, degree=2)

Export this as a .csv file to be used elsewhere#

Finally, we need to export this data to be used elsewhere. For this tutorial, we want to allow a variety of tools, so we’ll export the aggregated data into a .csv (comma-separated value) file, which can easily be read by R, STATA, and most other programming tools.

The result from xagg will be reshaped ‘wide’ - so every row is a county, and every column is a timestep of the variables. stagg on the other hand aggregates the results to the annual level and provides each county-year as a row. We also want to output this data in a standard form, so that the next step does not depend on the language and library used for this step.

Use aggregated.to_dataset() or aggregated.to_dataframe(), depending on whether you’d like to continue using it in xarray or pandas. The code below provides a standard format for the next step.

ds2['STATE_FIPS'] = ds2.STATE_FIPS.astype(str)
ds2['CNTY_FIPS'] = ds2.CNTY_FIPS.astype(str)

ds2['FIPS'] = xr.apply_ufunc(np.char.add, ds2.STATE_FIPS, ds2.CNTY_FIPS)
ds2['FIPS'] = ds2.FIPS.isel(year=0).drop('year')

ds2.swap_dims({'poly_idx': 'FIPS'}).drop('poly_idx')
ds2.to_dataframe().to_csv("../data/climate_data/agg_vars.csv")

The polynomial calculation used by stagg assumes a baseline temperature of 0 C. We will now adjust this to a baseline of 20 C.

## Only write out valid entries
county.tas.valid <- subset(county.tas, !is.na(order_1))

## Remove 20 deg per day
daysperyear <- table(substring(names(rr.tas), 2, 5))
county.tas.base <- data.frame(year=as.numeric(names(daysperyear)), order_1=20*as.numeric(daysperyear), order_2=(20^2)*as.numeric(daysperyear))

county.tas.final <- county.tas.valid %>% left_join(county.tas.base, by='year', suffix=c('', '.base'))

county.tas.final$tas_adj <- county.tas.final$order_1 - county.tas.final$order_1.base
county.tas.final$tas_sq <- county.tas.final$order_2 - county.tas.final$order_2.base
names(county.tas.final)[2] <- 'FIPS'

write.csv(county.tas.final, '../data/climate_data/agg_vars.csv', row.names=F)