Hands-On Exercise, Step 1: Preparing the Weather Data¶
Downloading the data¶
In our example, we will use Berkeley Earth (BEST) data. BEST data is available as national timeseries (area-weighted, so not usually useful even for studying national-level data) and as a 1-degree grid.
Go to the BEST archive, http://berkeleyearth.org/archive/data/
Scroll down to the Gridded Data (ignore the Time Series Data which is listed first).
Under Gridded Data, find “Daily Land (Experimental; 1880 – Recent)”.
Download the Average Temperature (TAVG) data in 1-degree gridded form for 1980 - 1989.
Place this file (
Complete_TAVG_Daily_LatLong1_1980.nc
) in thedata/climate_data
folder.
As a first pre-processing step, we will clip this file to the United States.
Loading the data¶
First, let’s load all of the data into memory. This code assumes that
it (the code) is stored in a file (call it preprocess_best.m
or
preprocess_best.py
) in a directory code/
, a sister to the data/
directory.
library(ncdf4)
nc <- nc_open("../data/climate_data/Complete_TAVG_Daily_LatLong1_1980.nc")
time <- seq(as.Date("1980-01-01"), length.out=nc$dim$time$len, by="1 day")
import numpy as np
import xarray as xr
import datetime as dt
ds = xr.open_dataset('../data/climate_data/Complete_TAVG_Daily_LatLong1_1980.nc')
# Create time variable, which wasn't auto-generated from the netcdf
# due to BEST's ambiguous timing
ds['time'] = (
('time'),dt.datetime(1980,1,1)+np.arange(0,ds.dims['time'])*dt.timedelta(days=1))
data_dir = '../data/climate_data/';
filename = 'Complete_TAVG_Daily_LatLong1_1980.nc';
clim_tmp = ncread([data_dir,filename],'climatology');
anom_tmp = ncread([data_dir,filename],'temperature');
doy_tmp = ncread([data_dir,filename],'day_of_year');
% Also loading the "months" variable - most datasets don't have it, but
% it will make more sophisticated projecting methods easier. This is
% particularly useful because of the gregorian calendar (which includes
% leap days, and therefore would otherwise add an extra step to
% determining which month each datastep is)
months = ncread([data_dir,filename],'month');
Constructing temperature levels¶
The variable temperature
in BEST is actually the temperature
anomaly; the actual temperature is formed by adding it to the
climatology
variable. Unfortunately, the climatology
variable only
accounts for days 1:365 of the year, and ignores leap days (which the
temperature
variable does not). This section doubles the climatology
for Feb 28th to also work on Feb 29th, and creates a tas
variable
that’s the climatology
+ temperature
.
doy <- ncvar_get(nc, 'day_of_year')
tas.anom <- ncvar_get(nc, 'temperature')
tas.clim <- ncvar_get(nc, 'climatology')
tas.clim.all <- tas.clim[,, doy]
tas <- tas.anom + tas.clim.all
import calendar
# Expand climatology to span all days
clim_tmp = xr.DataArray(dims=('time','latitude','longitude'),
coords={'time':ds.time,'latitude':ds.latitude,'longitude':ds.longitude},
data=np.zeros((ds.dims['time'],ds.dims['latitude'],ds.dims['longitude']))*np.nan)
# Sub in variables one year at a time
for yr in np.unique(ds.time.dt.year):
if calendar.isleap(yr):
clim_tmp.loc[{'time':(clim_tmp.time.dt.year==yr)&(clim_tmp.time.dt.dayofyear<=59)}] = ds.climatology.values[0:59]
clim_tmp.loc[{'time':(clim_tmp.time.dt.year==yr)&(clim_tmp.time.dt.dayofyear==60)}] = ds.climatology.values[59]
clim_tmp.loc[{'time':(clim_tmp.time.dt.year==yr)&(clim_tmp.time.dt.dayofyear>60)}] = ds.climatology.values[59:365]
else:
clim_tmp.loc[{'time':(clim_tmp.time.dt.year==yr)}] = ds.climatology.values
ds['climatology'] = clim_tmp
ds['tas'] = ds['temperature'] + ds['climatology']
ds = ds.drop(['temperature','climatology'])
% BEST data is given in terms of anomalies from the underlying
% climatology. So, the climatology has to be loaded first.
tas = anom_tmp + clim_tmp(:,:,doy_tmp);
clear clim_tmp, anom_tmp, doy_tmp
Subset it geographically¶
We can drop a lot of the global data. This can also happen earlier in the process.
Using the extreme points of the continental United States (see e.g., here) + 1 degree of wiggle room.
longitude <- ncvar_get(nc, 'longitude')
latitude <- ncvar_get(nc, 'latitude')
latlims = c(23, 51)
lonlims <- c(-126,-65)
tas2 <- tas[longitude >= lonlims[1] & longitude <= lonlims[2], latitude >= latlims[1] & latitude <= latlims[2],]
geo_lims = {'lat':[23,51],'lon':[-126,-65]}
ds = ds.sel(latitude=slice(*geo_lims['lat']),longitude=slice(*geo_lims['lon'])).load()
lat = ncread([data_dir,filename],'latitude');
lon = ncread([data_dir,filename],'longitude');
%% Subset to continental United States
lon_idxs = ((lon>=geo_lims(1)) & (lon<=geo_lims(2)));
lat_idxs = ((lat>=geo_lims(3)) & (lat<=geo_lims(4)));
tas = tas(lon_idxs,lat_idxs,:);
lon = lon(lon_idxs);
lat = lat(lat_idxs);
Save the result¶
We will write out the clipped, concatenated data to a single file for future processing. We make some changes in the process to conform to the standards used in CMIP5 datasets.
dimlon <- ncdim_def("lon", "degrees_east", longitude[longitude >= lonlims[1] & longitude <= lonlims[2]], longname='longitude')
dimlat <- ncdim_def("lat", "degrees_north", latitude[latitude >= latlims[1] & latitude <= latlims[2]], longname='latitude')
dimtime <- ncdim_def("time", "days since 1980-01-01 00:00:00", as.numeric(time - as.Date("1980-01-01")),
unlim=T, calendar="proleptic_gregorian")
vartas <- ncvar_def("tas", "C", list(dimlon, dimlat, dimtime), NA, longname="temperature")
ncnew <- nc_create("tas_day_BEST_historical_station_19800101-19891231.nc", vartas)
ncvar_put(ncnew, vartas, tas2)
nc_close(ncnew)
output_fn = '../data/climate_data/tas_day_BEST_historical_station_19800101-19891231.nc'
ds.attrs['origin_script']='preprocess_best.py'
# Rename to lat/lon to fit CMIP5 defaults
ds = ds.rename({'latitude':'lat','longitude':'lon'})
ds.to_netcdf(output_fn)
% Set output filename
fn_out = '../data/climate_data/tas_day_BEST_historical_station_19800101-19891231.nc';
% Write temperature data to netcdf
nccreate(fn_out,'tas','Dimensions',{'lon',size(tas,1),'lat',size(tas,2),'time',size(tas,3)})
ncwrite(fn_out,'tas',tas);
% These attributes aren't strictly necessary, but very helpful to have in
% the file
ncwriteatt(fn_out,'tas','long_name','near-surface air temperature');
ncwriteatt(fn_out,'tas','units','C');
% Write lat/lon data to netcdf file
nccreate(fn_out,'lat','Dimensions',{'lat',size(tas,2)})
nccreate(fn_out,'lon','Dimensions',{'lon',size(tas,1)})
ncwrite(fn_out,'lat',lat);
ncwrite(fn_out,'lon',lon);
ncwriteatt(fn_out,'lat','long_name','latitude')
ncwriteatt(fn_out,'lon','long_name','longitude')
ncwriteatt(fn_out,'lat','units','degrees')
ncwriteatt(fn_out,'lon','units','degrees')
% Write time dimension to netcdf file
time_idxs = 0:(size(tas,3)-1);
nccreate(fn_out,'time','Dimensions',{'time',size(tas,3)})
ncwrite(fn_out,'time',time_idxs)
ncwriteatt(fn_out,'time','units','days since 1980-01-01')
nccreate(fn_out,'month','Dimensions',{'time',size(tas,3)})
ncwrite(fn_out,'month',months)
ncwriteatt(fn_out,'month','units','month number')
% Write global attributes (again, not strictly necessary, but very useful);
% using the '/' name for global
ncwriteatt(fn_out,'/','variable','tas')
ncwriteatt(fn_out,'/','variable_long','near-surface air temperature')
ncwriteatt(fn_out,'/','source','Berkeley Earth Surface Temperature Project')
ncwriteatt(fn_out,'/','origin_script','preprocess_best.m')
ncwriteatt(fn_out,'/','calendar','gregorian')