Hands-On Exercise, Step 1: Preparing the Weather Data

Downloading the data

In our example, we will use Berkeley Earth (BEST) data. BEST data is available as national timeseries (area-weighted, so not usually useful even for studying national-level data) and as a 1-degree grid.

  1. Go to the BEST archive, http://berkeleyearth.org/archive/data/

  2. Scroll down to the Gridded Data (ignore the Time Series Data which is listed first).

  3. Under Gridded Data, find “Daily Land (Experimental; 1880 – Recent)”.

  4. Download the Average Temperature (TAVG) data in 1-degree gridded form for 1980 - 1989.

  5. Place this file (Complete_TAVG_Daily_LatLong1_1980.nc) in the data/climate_data folder.

As a first pre-processing step, we will clip this file to the United States.

Loading the data

First, let’s load all of the data into memory. This code assumes that it (the code) is stored in a file (call it preprocess_best.m or preprocess_best.py) in a directory code/, a sister to the data/ directory.


nc <- nc_open("../data/climate_data/Complete_TAVG_Daily_LatLong1_1980.nc")

time <- seq(as.Date("1980-01-01"), length.out=nc$dim$time$len, by="1 day")
import numpy as np
import xarray as xr
import datetime as dt

ds = xr.open_dataset('../data/climate_data/Complete_TAVG_Daily_LatLong1_1980.nc')

# Create time variable, which wasn't auto-generated from the netcdf 
# due to BEST's ambiguous timing
ds['time'] = (
data_dir = '../data/climate_data/';
filename = 'Complete_TAVG_Daily_LatLong1_1980.nc';
clim_tmp = ncread([data_dir,filename],'climatology');
anom_tmp = ncread([data_dir,filename],'temperature');
doy_tmp = ncread([data_dir,filename],'day_of_year');

% Also loading the "months" variable - most datasets don't have it, but
% it will make more sophisticated projecting methods easier. This is
% particularly useful because of the gregorian calendar (which includes
% leap days, and therefore would otherwise add an extra step to
% determining which month each datastep is)
months = ncread([data_dir,filename],'month');

Constructing temperature levels

The variable temperature in BEST is actually the temperature anomaly; the actual temperature is formed by adding it to the climatology variable. Unfortunately, the climatology variable only accounts for days 1:365 of the year, and ignores leap days (which the temperature variable does not). This section doubles the climatology for Feb 28th to also work on Feb 29th, and creates a tas variable that’s the climatology + temperature.

doy <- ncvar_get(nc, 'day_of_year')
tas.anom <- ncvar_get(nc, 'temperature')
tas.clim <- ncvar_get(nc, 'climatology')

tas.clim.all <- tas.clim[,, doy]
tas <- tas.anom + tas.clim.all
import calendar

# Expand climatology to span all days
clim_tmp = xr.DataArray(dims=('time','latitude','longitude'),

# Sub in variables one year at a time
for yr in np.unique(ds.time.dt.year):
    if calendar.isleap(yr):
        clim_tmp.loc[{'time':(clim_tmp.time.dt.year==yr)&(clim_tmp.time.dt.dayofyear<=59)}] = ds.climatology.values[0:59]
        clim_tmp.loc[{'time':(clim_tmp.time.dt.year==yr)&(clim_tmp.time.dt.dayofyear==60)}] = ds.climatology.values[59]
        clim_tmp.loc[{'time':(clim_tmp.time.dt.year==yr)&(clim_tmp.time.dt.dayofyear>60)}] = ds.climatology.values[59:365]
        clim_tmp.loc[{'time':(clim_tmp.time.dt.year==yr)}] = ds.climatology.values

ds['climatology'] = clim_tmp

ds['tas'] = ds['temperature'] + ds['climatology']
ds = ds.drop(['temperature','climatology'])
% BEST data is given in terms of anomalies from the underlying
% climatology. So, the climatology has to be loaded first.
tas = anom_tmp + clim_tmp(:,:,doy_tmp); 

clear clim_tmp, anom_tmp, doy_tmp

Subset it geographically

We can drop a lot of the global data. This can also happen earlier in the process.

Using the extreme points of the continental United States (see e.g., here) + 1 degree of wiggle room.

longitude <- ncvar_get(nc, 'longitude')
latitude <- ncvar_get(nc, 'latitude')

latlims = c(23, 51)
lonlims <- c(-126,-65)

tas2 <- tas[longitude >= lonlims[1] & longitude <= lonlims[2], latitude >= latlims[1] & latitude <= latlims[2],]
geo_lims = {'lat':[23,51],'lon':[-126,-65]}

ds = ds.sel(latitude=slice(*geo_lims['lat']),longitude=slice(*geo_lims['lon'])).load()
lat = ncread([data_dir,filename],'latitude');
lon = ncread([data_dir,filename],'longitude');

%% Subset to continental United States
lon_idxs = ((lon>=geo_lims(1)) & (lon<=geo_lims(2)));
lat_idxs = ((lat>=geo_lims(3)) & (lat<=geo_lims(4)));

tas = tas(lon_idxs,lat_idxs,:);
lon = lon(lon_idxs);
lat = lat(lat_idxs);

Save the result

We will write out the clipped, concatenated data to a single file for future processing. We make some changes in the process to conform to the standards used in CMIP5 datasets.

dimlon <- ncdim_def("lon", "degrees_east", longitude[longitude >= lonlims[1] & longitude <= lonlims[2]], longname='longitude')
dimlat <- ncdim_def("lat", "degrees_north", latitude[latitude >= latlims[1] & latitude <= latlims[2]], longname='latitude')
dimtime <- ncdim_def("time", "days since 1980-01-01 00:00:00", as.numeric(time - as.Date("1980-01-01")),
                     unlim=T, calendar="proleptic_gregorian")
vartas <- ncvar_def("tas", "C", list(dimlon, dimlat, dimtime), NA, longname="temperature")

ncnew <- nc_create("tas_day_BEST_historical_station_19800101-19891231.nc", vartas)
ncvar_put(ncnew, vartas, tas2)
output_fn = '../data/climate_data/tas_day_BEST_historical_station_19800101-19891231.nc'


# Rename to lat/lon to fit CMIP5 defaults
ds = ds.rename({'latitude':'lat','longitude':'lon'})

% Set output filename
fn_out = '../data/climate_data/tas_day_BEST_historical_station_19800101-19891231.nc';

% Write temperature data to netcdf 
% These attributes aren't strictly necessary, but very helpful to have in
% the file
ncwriteatt(fn_out,'tas','long_name','near-surface air temperature');

% Write lat/lon data to netcdf file

% Write time dimension to netcdf file
time_idxs = 0:(size(tas,3)-1);

ncwriteatt(fn_out,'time','units','days since 1980-01-01')

ncwriteatt(fn_out,'month','units','month number')

% Write global attributes (again, not strictly necessary, but very useful);
% using the '/' name for global
ncwriteatt(fn_out,'/','variable_long','near-surface air temperature')
ncwriteatt(fn_out,'/','source','Berkeley Earth Surface Temperature Project')