Supplementary material 2/6

This notebook serves as supplementary material to the manuscript on low-level jets submitted to Wind Energy Science. The supplementary material consists of 6 notebooks:

  1. A very quick overview of the measurement data from the LiDARs.
  2. Overview of the ERA5 data and especially the procedure of aligning it with the observations.
  3. Some error statistics on wind speed from ERA5 and observations
  4. Main results on low-level jets
  5. Reconstruction of seasonal cycle using a machine learning package
  6. Spatial figures from ERA5 (data-intensive).

In this notebook (2/6), we provide an overview of the ERA5 data.

Matching ERA5 data to the observations

In the previous notebook, we've loaded the observation data. In this notebook, we will create a matching version of the ERA5 data. We do this in several steps:

  1. Extract only the platform locations from the ERA5 dataset, but keep the lower 500 m and keep the entire time range
  2. Create a second set of data, in which only the data at observation heights are retained.
  3. Create subsets of the two datasets above, matching the time stamps of the observation data.

After loading the data, we make some standard visualizations to see whether the data looks OK

In [1]:
# First load some packages

# Xarray is our netCDF working horse
import xarray as xr
import numpy as np

# Import own code to read ECN and ERA5 data (some functions from the SI notebooks are refactored to a code base)
import sys
sys.path.append('../../code/datasets')
import ecn
import era5

# and similar for IFS utilities and interpolation routine
sys.path.append('../../code/utils')
import ifs
import interpolate

# and a script to plot a wind rose
sys.path.append('../../code/visualization')
import windrose

# Load plotting library
import matplotlib.pyplot as plt
plt.xkcd()
Out[1]:
<matplotlib.rc_context at 0x2b6997610048>
In [3]:
# Load observation data to align the ERA5 data to.
obs = ecn.load_platform_data()

# This is a dictionary consisting of platform names and xarray objects of the corresponding netcdf files.
obs
BWF1
BWF2
EPL
HKNA
HKNB
HKZA
HKZB
K13
LEG
MMIJ
Out[3]:
{'BWF1': <xarray.Dataset>
 Dimensions:    (height: 10, time: 90250)
 Coordinates:
   * time       (time) datetime64[ns] 2015-06-11T18:20:00 2015-06-11T18:30:00 ...
   * height     (height) int32 30 40 60 80 100 120 140 160 180 200
 Data variables:
     wspd       (height, time) float64 11.66 11.25 11.31 11.43 11.37 10.31 ...
     wdir       (height, time) float64 41.72 42.42 46.64 43.48 43.48 42.07 ...
     latitude   float64 51.71
     longitude  float64 3.035
 Attributes:
     creation_date:                   09-Oct-2018 14:37:53
     Measurement Location:            Borssele Wind Farm Zone -- Lot 1
     Measurement Period:              2015-06-11 18:20 UTC --2017-02-27 11:50 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300
     Wind Direction Instrument Info:  LiDAR - Zephir 300,
 'BWF2': <xarray.Dataset>
 Dimensions:    (height: 10, time: 20996)
 Coordinates:
   * time       (time) datetime64[ns] 2016-02-12T16:20:00 2016-02-12T16:30:00 ...
   * height     (height) int32 30 40 60 80 100 120 140 160 180 200
 Data variables:
     wspd       (height, time) float64 7.91 7.852 8.203 7.969 8.086 8.203 ...
     wdir       (height, time) float64 67.5 66.45 68.55 66.09 69.61 67.15 ...
     latitude   float64 51.65
     longitude  float64 2.951
 Attributes:
     creation_date:                   09-Oct-2018 14:37:54
     Measurement Location:            Borssele Wind Farm Zone -- Lot 2
     Measurement Period:              2016-02-12 16:20 UTC --2016-07-07 11:30 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300
     Wind Direction Instrument Info:  LiDAR - Zephir 300,
 'EPL': <xarray.Dataset>
 Dimensions:    (height: 10, time: 83664)
 Coordinates:
   * time       (time) datetime64[ns] 2016-05-30 2016-05-30T00:10:00 ...
   * height     (height) int32 63 91 116 141 166 191 216 241 266 291
 Data variables:
     wspd       (height, time) float64 17.3 17.36 18.27 18.27 18.07 18.0 18.1 ...
     wdir       (height, time) float64 311.4 309.9 308.6 309.4 308.5 310.7 ...
     latitude   float64 52.0
     longitude  float64 3.276
 Attributes:
     creation_date:                   09-Oct-2018 14:38:13
     Measurement Location:            Europlatform
     Measurement Period:              2016-05-30 00:00 UTC --2017-12-31 23:50 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300
     Wind Direction Instrument Info:  LiDAR - Zephir 300,
 'HKNA': <xarray.Dataset>
 Dimensions:    (height: 10, time: 29520)
 Coordinates:
   * time       (time) datetime64[ns] 2017-04-10 2017-04-10T00:10:00 ...
   * height     (height) int32 30 40 60 80 100 120 140 160 180 200
 Data variables:
     wspd       (height, time) float64 9.023 9.434 9.727 9.258 9.258 8.789 ...
     wdir       (height, time) float64 299.2 293.2 296.0 294.6 298.5 295.7 ...
     latitude   float64 52.69
     longitude  float64 4.242
 Attributes:
     creation_date:                   09-Oct-2018 14:38:29
     Measurement Location:            Hollandse Kust Noord -- Site A
     Measurement Period:              2017-04-10 00:00 UTC --2017-10-31 23:50 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300
     Wind Direction Instrument Info:  LiDAR - Zephir 300,
 'HKNB': <xarray.Dataset>
 Dimensions:    (height: 10, time: 29520)
 Coordinates:
   * time       (time) datetime64[ns] 2017-04-10 2017-04-10T00:10:00 ...
   * height     (height) int32 30 40 60 80 100 120 140 160 180 200
 Data variables:
     wspd       (height, time) float64 9.434 9.258 9.434 8.73 9.199 9.141 ...
     wdir       (height, time) float64 296.0 301.3 302.3 295.3 298.1 297.4 ...
     latitude   float64 52.68
     longitude  float64 4.242
 Attributes:
     creation_date:                   09-Oct-2018 14:38:29
     Measurement Location:            Hollandse Kust Noord -- Site B
     Measurement Period:              2017-04-10 00:00 UTC --2017-10-31 23:50 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300
     Wind Direction Instrument Info:  LiDAR - Zephir 300,
 'HKZA': <xarray.Dataset>
 Dimensions:    (height: 10, time: 91296)
 Coordinates:
   * time       (time) datetime64[ns] 2016-06-05 2016-06-05T00:10:00 ...
   * height     (height) int32 30 40 60 80 100 120 140 160 180 200
 Data variables:
     wspd       (height, time) float64 8.965 8.789 nan nan nan nan nan nan ...
     wdir       (height, time) float64 14.77 19.69 nan nan nan nan nan nan ...
     latitude   float64 52.31
     longitude  float64 4.009
 Attributes:
     creation_date:                   09-Oct-2018 14:39:13
     Measurement Location:            Hollandse Kust Zuid -- Site A
     Measurement Period:              2016-06-05 00:00 UTC --2018-02-28 23:50 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300
     Wind Direction Instrument Info:  LiDAR - Zephir 300,
 'HKZB': <xarray.Dataset>
 Dimensions:    (height: 10, time: 91296)
 Coordinates:
   * time       (time) datetime64[ns] 2016-06-05 2016-06-05T00:10:00 ...
   * height     (height) int32 30 40 60 80 100 120 140 160 180 200
 Data variables:
     wspd       (height, time) float64 nan nan nan nan nan nan nan nan nan ...
     wdir       (height, time) float64 nan nan nan nan nan nan nan nan nan ...
     latitude   float64 52.29
     longitude  float64 4.009
 Attributes:
     creation_date:                   09-Oct-2018 14:39:13
     Measurement Location:            Hollandse Kust Zuid -- Site B
     Measurement Period:              2016-06-05 00:00 UTC --2018-02-28 23:50 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300
     Wind Direction Instrument Info:  LiDAR - Zephir 300,
 'K13': <xarray.Dataset>
 Dimensions:    (height: 10, time: 74304)
 Coordinates:
   * time       (time) datetime64[ns] 2016-11-01 2016-11-01T00:10:00 ...
   * height     (height) int32 63 91 116 141 166 191 216 241 266 291
 Data variables:
     wspd       (height, time) float64 nan nan nan nan nan nan nan nan nan ...
     wdir       (height, time) float64 nan nan nan nan nan nan nan nan nan ...
     latitude   float64 53.22
     longitude  float64 3.219
 Attributes:
     creation_date:                   09-Oct-2018 14:39:36
     Measurement Location:            K13a
     Measurement Period:              2016-11-01 00:00 UTC --2018-03-31 23:50 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300
     Wind Direction Instrument Info:  LiDAR - Zephir 300,
 'LEG': <xarray.Dataset>
 Dimensions:    (height: 10, time: 164232)
 Coordinates:
   * time       (time) datetime64[ns] 2014-11-17T12:00:00 2014-11-17T12:10:00 ...
   * height     (height) int32 63 91 116 141 166 191 216 241 266 291
 Data variables:
     wspd       (height, time) float64 nan nan nan nan nan 7.87 nan 8.02 7.0 ...
     wdir       (height, time) float64 nan nan nan nan nan 115.9 nan 127.7 ...
     latitude   float64 51.92
     longitude  float64 3.667
 Attributes:
     creation_date:                   09-Oct-2018 14:40:13
     Measurement Location:            LEG
     Measurement Period:              2014-11-17 12:00 UTC --2017-12-31 23:50 UTC
     Wind Speed Instrument Info:      LiDAR - WINDCUBE v2
     Wind Direction Instrument Info:  LiDAR - WINDCUBE 300,
 'MMIJ': <xarray.Dataset>
 Dimensions:    (height: 12, time: 228960)
 Coordinates:
   * time       (time) datetime64[ns] 2011-11-02 2011-11-02T00:10:00 ...
   * height     (height) int32 27 58 90 115 140 165 190 215 240 265 290 315
 Data variables:
     wspd       (height, time) float64 nan nan nan nan nan nan nan nan nan ...
     wdir       (height, time) float64 230.6 234.7 231.4 229.0 235.6 236.8 ...
     latitude   float64 52.85
     longitude  float64 3.436
 Attributes:
     creation_date:                   09-Oct-2018 14:49:20
     Measurement Location:            IJmuiden
     Measurement Period:              2011-11-02 00:00 UTC --2016-03-09 23:50 UTC
     Wind Speed Instrument Info:      LiDAR - Zephir 300 | Mast - Thies First ...
     Wind Direction Instrument Info:  LiDAR - Zephir 300 | Mast - Thies First ...}

Step 1: Extract platform data

The ERA5 data spans 10 years (2008-2017) and a domain covering most of the Southern North Sea and surroundings. It takes 58 GB of disk space. They are stored one file per month.

In this step, we extract the wind speed data only for those coordinates that match our platform data. We then concatenate all monthly files, to get one file per platform. A separate routine is available to load the extracted platform data into memory. This is much more memory efficient, and the additional disk space is small (up to 10 MB per platform).

The code below loads the platform data into memory. If from_source is set to 'python' or 'nco', it will reproduce the intermediate files first.

Note that this step is quite memory intensive and should be run anywhere close to where the ERA5 data are stored.

The function for extracting the platform data is copied to the codebase in code/datasets/era5.py.

In [4]:
def extract_subset(lon, lat, variables, infile_path, outfile, levtype='ml', engine='nco'):
    """ Combine many large-domain/small-timespan ERA5 data files into a single file for a specific location.

    Description
    -----------
    I have implemented two routines to perform this step. The first is based on python/xarray. 
    However, this is quite slow and some spurious NaN's are introduced in the intermediate files. 
    Therefore, I wrote a second routine which calls NCO from the command line. 
    Basically, it does the following:

        module load nco
        for all yearmonths:
            ncks -v t,q -d longitude,3.436, -d latitude,52.85 era5_yearmonth_ml.nc tmp2.nc --mk_rec_dmn time

        ncpdq to unpack to other temp file
        ncrcat tmp???.nc output.nc
        rm tmp*

    It still quite slow, but at least the NaNs are gone. It is worthwhile to note that NCO raised a 
    warning that it doesn't account for different scaling parameters between the input files for concatenation. 
    Therefore, an intermediate step was added in which the variables were unpacked before concatenation, 
    and re-packed afterwards.

    An important difference between the python and nco routines is that for the latter, the processed 
    files retain their original dimensions, thus lat and lon are still coordinates/dimensions, even 
    though they contain only one value.

    Arguments
    ---------
    lon: float; longitude
    lat: float; latitude
    variables: list or string; variables to extract
        if engine == xarray, variables should be a list of variables, e.g. ['u','v']
        if egine == nco, variables should be a comma-separted string of variables, e.g. 'u,v'
    infile_path: location where the source files are stored, e.g. 'data/raw/'
    outfile: location where the intermediate file should be saved, e.g. 'data/interim/mmij.nc'
    levtype: string; 'ml', or 'sfc', as I used this suffix in the source files to distinguish them
    engine: string; 'nco' or 'xarray', which method to use to create the extracts.

    Updates
    -------
    - Merged nco and xarray routines; added 'engine' argument
    - Added option to specify levtype, to operate both on ml and sfc data
    - Removed option to subset levels for generality between levtypes and engines; do this on load instead

    Test
    ----
    Using two different methods to do the exact same thing gives a feel for the accurracy of the resulting 
    dataset. I tested this on the 100m u-wind component of the surface data, and found that the methods 
    yield the same results up to a precision of 0.001 m/s. (See test).

    # Test whether the two methods are consistent
    inpath = '../../data/external/ECMWF/ERA5/'
    outfile = '../../data/interim/ERA5platforms/testx.nc'
    era5.extract_subset(lon=3.5, lat=52.8, variables='u100,v100', infile_path=inpath, outfile=outfile, levtype='sfc', engine='nco')
    outfile = '../../data/interim/ERA5platforms/testy.nc'
    era5.extract_subset(lon=3.5, lat=52.8, variables=['u100','v100'], infile_path=inpath, outfile=outfile, levtype='sfc', engine='xarray')
    ds1 = xr.open_dataset('../../data/interim/ERA5platforms/testx.nc')
    ds2 = xr.open_dataset('../../data/interim/ERA5platforms/testy.nc')
    print(np.where(np.abs(ds1.u100-ds2.u100)>0.0008))
    print(np.abs(ds1.u100-ds2.u100).mean())
    """
    import xarray as xr
    import subprocess as sp    
    
    if engine=='xarray':
        with xr.open_mfdataset(infile_path+'era5_*_%s.nc'%levtype) as ds:
            subset = ds.sel(longitude=lon, latitude=lat, method='nearest')[variables]
            subset.to_netcdf(outfile)
    elif engine=='nco':
        # Add NCO to path (somehow python doesn't recognize module load nco)
        ncks = '/hpc/sw/nco-4.6.0-intel/bin/ncks '
        ncrcat = '/hpc/sw/nco-4.6.0-intel/bin/ncrcat '
        ncpdq = '/hpc/sw/nco-4.6.0-intel/bin/ncpdq '

        options = '-v %s -d longitude,%f -d latitude,%f -O --mk_rec_dmn time '%(variables, lon, lat)
        for year in range(2008, 2018):
            print(year)
            for month in range(1, 13):
                infile = infile_path+'era5_%04d%02d_%s.nc '%(year, month, levtype)
                tmp = 'tmp.nc '
                tmpmonth = 'tmpmonth%02d.nc'%month
                sp.run(ncks+options+infile+tmp, shell=True, check=True)
                # Unpack the temporary file before it gets concatenated
                sp.run(ncpdq+'-O -U '+tmp+tmpmonth, shell=True, check=True)
                sp.run('rm tmp.nc', shell=True, check=True)
            sp.run(ncrcat+'-O tmpmonth??.nc tmpyear%04d.nc'%year, shell=True, check=True)
            sp.run('rm tmpmonth*', shell=True, check=True)
        sp.run(ncrcat+'-O tmpyear????.nc tmp.nc', shell=True, check=True)
        sp.run('rm tmpyear*', shell=True, check=True)
        # Pack the final dataset
        sp.run(ncpdq+'-O tmp.nc '+outfile, shell=True, check=True)
        sp.run('rm tmp.nc', shell=True, check=True)

    return
In [5]:
# This actually calls the above function, 
# extracts the data for each platforms to intermediate files, 
# and then loads these small intermediate files into memory 

era = {}
for name, ds in obs.items():
    # Because this takes time, it is useful to test it only for one platform:
#     if name!='BWF2':
#         continue
    lon = ds.longitude.values
    lat = ds.latitude.values
    print(name, lon, lat)
    
    # In case the dataset is prepared with NCO, we need to squeeze the lat and lon dimensions
    era[name]= era5.read_platform_data(lon, lat, name, from_source=False).squeeze()
BWF1 3.034638888888889 51.70688888888889
BWF2 2.951416666666667 51.64630555555556
EPL 3.276 51.999
HKNA 4.242020474993242 52.68869454787769
HKNB 4.241912343238767 52.683319381733675
HKZA 4.008972222222222 52.30658333333333
HKZB 4.008583333333333 52.28908333333333
K13 3.218611111111111 53.2175
LEG 3.6670277777777778 51.91711111111111
MMIJ 3.435566666666667 52.84816666666667
In [8]:
# Data are loaded into a dictionary, similarly to the observations
# Example platform file:
era['MMIJ']
Out[8]:
<xarray.Dataset>
Dimensions:    (level: 14, time: 87672)
Coordinates:
    latitude   float32 52.8
  * level      (level) int32 137 136 135 134 133 132 131 130 129 128 127 126 ...
    longitude  float32 3.3
  * time       (time) datetime64[ns] 2008-01-01 2008-01-01T01:00:00 ...
Data variables:
    u          (time, level) float32 ...
    v          (time, level) float32 ...
Attributes:
    Conventions:               CF-1.6
    history:                   Mon Oct  1 10:41:10 2018: /hpc/sw/nco-4.6.0-in...
    NCO:                       "4.6.0"
    nco_openmp_thread_number:  1

Step 2: only retain observation heights

The observation data reach up to about 300 m, and the exact elevations of each record differs per platform. The ERA5 data, on the other hand, come at model levels, and we downloaded the lower 500 m.

In [9]:
# These are approximate heights of the model levels:
z = ifs.get_reference_levels(era['MMIJ'].level.values)
z
Out[9]:
array([ 10.  ,  30.96,  53.92,  79.04, 106.54, 136.62, 169.51, 205.44,
       244.69, 287.52, 334.24, 385.16, 440.61, 500.95])

Time-dependency of model level elevation

However, there are small deviations in the model level heights, dependent on the atmospheric circumstances (i.e. warmer air is 'thicker' than colder air). Therefore, we first reconstruct the true, time-dependent model level elevations.

In [10]:
# First, we need to extract the temperature, humidity and surface- and mean sea level perssure 
# These are stored in intermediate files and then loaded into memory, similar to the ERA5 wind data

from_source = False #'nco'

# Read q and t from model level files and sp and msl from surface data
qt = {}
spmsl = {}
for name, ds in obs.items():
    lon = ds.longitude.values
    lat = ds.latitude.values
    print(name)
    
    inpath = '../../data/external/ECMWF/ERA5/'
    outfile = '../../data/interim/ERA5platforms/'+name+'_qt.nc'

    # Extract location data to intermediate file
    if from_source=='nco':
        print('Extracting from source using NCO routine')
        era5.extract_subset(lon=lon, lat=lat, variables='q,t', infile_path=inpath, outfile=outfile, levtype='ml', engine='nco')

    # Load intermediate file into memory    
    qt[name]=xr.open_dataset(outfile).sel(level=slice(None, 124, -1)).squeeze()
    
    # Now for sfc variables
    outfile = '../../data/interim/ERA5platforms/'+name+'_spmsl.nc'
    # Extract location data to intermediate file
    if from_source=='nco':
        print('Extracting from source using NCO routine')
        era5.extract_subset(lon=lon, lat=lat, variables='sp,msl', infile_path=inpath, outfile=outfile, levtype='sfc', engine='nco')
    
    # Load intermediate file into memory    
    spmsl[name]=xr.open_dataset(outfile).squeeze()      
BWF1
BWF2
EPL
HKNA
HKNB
HKZA
HKZB
K13
LEG
MMIJ
In [11]:
# This procedure is based on the IFS documentation. Basically, it departs from the surface geopotential 
# and then adds the geopotential thickness of each layer based on its temperature and humidity. 
# My code for this is stored in the `ifs.py` and can be provided on request.

# Get reference profile elevations
levs = qt['MMIJ'].level.values
zref = ifs.get_reference_levels(levs)

for name, ml in qt.items():
    print(name)
    sfc = spmsl[name]

    # Compute nd real (geopotential) heights
    ph, pf = ifs.pressure_at_model_levels(levs, sfc.sp, z_axis=1)
    dp, dlnp, upper, alpha = ifs.pressure_funcs(ph, levs, z_axis=1)
    Tv = ifs.virtual_temperature(ml.t, ml.q)
    Z0 = ifs.surface_geopotential_hypso(sfc.sp, sfc.msl, Tv.sel(level=137))
    Z = ifs.geopotential_at_model_levels(dlnp, Tv.values, Z0.values, alpha, levs, z_axis=1, t_axis=0)
    agl = (Z-Z0.values)/9.81
    
    # Add z_nd and z_ref to each dataframe for easy reference
    era[name]['z_nd'] = xr.DataArray(agl, coords={'time': ml.time, 'level': ml.level}, dims=['time', 'level'])
    era[name]['z_ref'] = xr.DataArray(zref, coords={'level': ml.level}, dims=['level'])
    
    # We store the elevations in separate netcdf files to prevent overwriting the NCO-produced intermediate files
    era[name][['z_nd', 'z_ref']].to_netcdf('../../data/interim/ERA5platforms/'+name+'_z.nc')
BWF1
BWF2
EPL
HKNA
HKNB
HKZA
HKZB
K13
LEG
MMIJ
In [13]:
# See how the model files now include 1d and 2d height arrays
era['MMIJ']
Out[13]:
<xarray.Dataset>
Dimensions:    (level: 14, time: 87672)
Coordinates:
    latitude   float32 52.8
  * level      (level) int32 137 136 135 134 133 132 131 130 129 128 127 126 ...
    longitude  float32 3.3
  * time       (time) datetime64[ns] 2008-01-01 2008-01-01T01:00:00 ...
Data variables:
    u          (time, level) float32 ...
    v          (time, level) float32 ...
    z_nd       (time, level) float64 9.755 30.2 52.59 77.12 104.0 133.4 ...
    z_ref      (level) float64 10.0 30.96 53.92 79.04 106.5 136.6 169.5 ...
Attributes:
    Conventions:               CF-1.6
    history:                   Mon Oct  1 10:41:10 2018: /hpc/sw/nco-4.6.0-in...
    NCO:                       "4.6.0"
    nco_openmp_thread_number:  1

We can justify all the work above by illustrating the discrepancy between the true, time-dependent elevations and the reference heights

In [14]:
fig, ax = plt.subplots()
for name, ds in era.items():
    ax.plot(ds.z_ref, (ds.z_nd-ds.z_ref).mean(axis=0), label=name)
    
ax.set_xlabel('reference height (m)')
ax.set_ylabel('discrepancy (m)')
ax.legend()
Out[14]:
<matplotlib.legend.Legend at 0x2b69c8b7ecc0>

Alignment of ERA5 and observations

I use a routine that implements a cubic spline interpolator to obtain the ERA5 wind speeds at the exact observation levels of each platform. This routine is available on request. It also implements linear interpolation, and I checked whether the difference is small. An illustration of the results is shown below.

For those wondering why I don't use something like logarithmic interpolation: The logarithmic wind profile or the wind profile power law are based on a number of assumptions, e.g. they are only valid in the surface layer and for idealized (neutral) wind profiles. I'm interested in conditions that violate these assumptions (low-level jets) and hence these methods are not appropriate.

Over much deeper atmospheric layers, it might be more appropriate to use logarithmic interpolation because pressure increases logarithmically with height and one could argue that pressure is a more suitable coordinate, but over the small distances used here the pressure differences are so small that this is negligible.

In [27]:
z = ifs.get_reference_levels(era['MMIJ'].level.values)

era_intp = {}
for name, ds_obs in obs.items():
    print(name)
    ds_era = era[name]
    z_obs = ds_obs.height
    z_era = ds_era.z_nd
    
    # interpolate u:
    u = ds_era.u.values
    u_interp = interpolate.along_axis(u, z_era, z_obs, axis=1, method='cubic')
    da_u = xr.DataArray(u_interp, coords={'time': ds_era.time, 'height': z_obs}, dims=['time', 'height'])
    da_u.name = 'u'
    
    # interpolate v:
    v = ds_era.v.values
    v_interp = interpolate.along_axis(v, z_era, z_obs, axis=1, method='cubic')
    da_v = xr.DataArray(v_interp, coords={'time': ds_era.time, 'height': z_obs}, dims=['time', 'height'])
    da_v.name = 'v'
    
    # Merge u and v and add the aligned/interpolated dataset to a new dictionary
    era_intp[name] = xr.merge([da_u, da_v])
    
    # Save the interpolated data to intermediate netcdf files
    era_intp[name].to_netcdf('../../data/interim/ERA5platforms/'+name+'_interpolated.nc')
BWF1
BWF2
EPL
HKNA
HKNB
HKZA
HKZB
K13
LEG
MMIJ
In [18]:
# This is what the interpolated/aligned data looks like:
era_intp['MMIJ']
Out[18]:
<xarray.Dataset>
Dimensions:  (height: 12, time: 87672)
Coordinates:
  * time     (time) datetime64[ns] 2008-01-01 2008-01-01T01:00:00 ...
  * height   (height) int32 27 58 90 115 140 165 190 215 240 265 290 315
Data variables:
    u        (time, height) float64 -0.7127 -0.7002 -0.6498 -0.6265 -0.553 ...
    v        (time, height) float64 0.9755 1.042 1.053 1.053 1.069 1.111 ...
In [26]:
# Let's check 10 random (u-)wind profiles to see whether the interpolation worked correctly
fig, ax = plt.subplots()
sample = np.random.choice(era['MMIJ'].time, 10)
for t in sample:
    line, = ax.plot(era['MMIJ'].u.sel(time=t), era['MMIJ'].z_nd.sel(time=t))
    color = line.get_color()
    ax.plot(era_intp['MMIJ'].u.sel(time=t), era_intp['MMIJ'].height, 'o', c=color)
    
ax.legend(['Original ERA5 profile', 'Aligned with observations'])
plt.show()

Step 3: Only retain observation times

We use xarray's where to keep only those timestamps for which observations are available as well.

In [38]:
# This illustrates the subsetting
# We use the mean of the observed wind speed to collapse the height dimension 
# because otherwise xarray also tries to align the heights and we don't want that.
# The mean of the observations is only NaN if all heights are NaN,
# So a NaN at a single level does not lead to exclusion of that time step.
era['MMIJ'].where(np.isfinite(obs['MMIJ'].wspd.mean(dim='height')))
Out[38]:
<xarray.Dataset>
Dimensions:    (level: 14, time: 38160)
Coordinates:
  * time       (time) datetime64[ns] 2011-11-02 2011-11-02T01:00:00 ...
    latitude   float32 52.8
  * level      (level) int32 137 136 135 134 133 132 131 130 129 128 127 126 ...
    longitude  float32 3.3
Data variables:
    u          (time, level) float32 2.6106873 2.7238722 2.8007941 2.8876061 ...
    v          (time, level) float32 1.0047754 1.0252842 1.040096 1.0674411 ...
    z_nd       (time, level) float64 9.984 30.9 53.8 78.87 106.3 136.3 169.0 ...
    z_ref      (level, time) float64 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 ...
Attributes:
    Conventions:               CF-1.6
    history:                   Mon Oct  1 10:41:10 2018: /hpc/sw/nco-4.6.0-in...
    NCO:                       "4.6.0"
    nco_openmp_thread_number:  1
In [33]:
# Apply this to all data, save to new dictionaries

era_subs = {} # for ERA5 subsets up to 500 m
era_intp_subs = {} # for ERA5 subsets only at observations heights

for name, ds_obs in obs.items():
    era_subs[name] = era[name].where(np.isfinite(ds_obs.wspd.mean(dim='height')))
    era_intp_subs[name] = era_intp[name].where(np.isfinite(ds_obs.wspd.mean(dim='height')))    

A quick analysis of the ERA5 data.

We now have 4 dictionaries to work with:

  • era - the full datasets for each platform
  • era_intp - the datasets for each platform, retaining only the observation heights
  • era_subs - the datasets for each platform up to 500 m, but only at observation times
  • era_intp_subs - the datasets for each platform, retaining only the observation heights and times.
In [44]:
fig, ax = plt.subplots()
for name, ds in era.items():
    wspd = (ds.u**2+ds.v**2)**.5
    z_mean = ds.z_nd.mean(dim='time')
    ax.plot(wspd.mean(dim='time').values, z_mean, 'o-', label=name)
    
ax.legend(bbox_to_anchor=(1.05, 1))
ax.set_xlabel('mean wind speed (m/s)')
ax.set_ylabel('altitude (m)')
ax.set_title('Average wind profiles')
plt.show()

Note: Identical grid cells

  • I think some platform locations are identical, as in, they fall within the same grid cell. The observations may still span different times, but let's check this to be sure.
In [45]:
# Loop trough all platforms
for name1, ds1 in era.items():
    # Cross compare with all other platforms
    for name2, ds2 in era.items():
        if name1==name2:
            continue
        if ds1.latitude==ds2.latitude and ds1.longitude==ds2.longitude:
            print(name1, 'and', name2, 'fall within the same grid cell')        
BWF1 and BWF2 fall within the same grid cell
BWF2 and BWF1 fall within the same grid cell
HKNA and HKNB fall within the same grid cell
HKNB and HKNA fall within the same grid cell
HKZA and HKZB fall within the same grid cell
HKZB and HKZA fall within the same grid cell

Indeed, some of the locations share the same grid cell. Since the observation periods may span different values - and to maintain the same format in the model data as in the observation data, I will just continue to work with separate files for these platforms, even if this is somehow redundant.

Diurnal cycle bias

In [47]:
fig, ax = plt.subplots()
for name, ds in era.items():
    wspd = (ds.u**2+ds.v**2)**.5
    hourly_means = wspd.mean(dim='level').groupby('time.hour').mean(dim='time').values
    ax.plot(range(24), hourly_means, label=name)
    
ax.legend(bbox_to_anchor=(1.05, 1))
ax.set_xlabel('Hour of day')
ax.set_ylabel('Wind speed (m/s)')
ax.set_title('Time/height averaged diurnal cycle')
# fig.savefig('diurnal_cycle_bias', dpi=300, bbox_inches='tight')
plt.show()

There's a weird discontinuity in the diurnal cycle. Someone at the EMS pointed me towards this issue. Closer inspection reveals:

  • The issue is either at 7, 8 and 9 UTC, or in the hours following 10 UTC, but comparison with obs should verify this (after subsetting, see below).
  • The issue is stronger for those platforms closer to the coast or more towards the South, which makes sense because the diurnal cycle is more pronounced over land.
  • The issue is happens for each individual year, so it's not caused by a large outlier.
  • The issue is present at each level (up to 500m), but seems larger at higher altitudes. This rules out a physical process.
  • The issue happens for every month. This rules out the possibility that its a diurnal/baroclinic effect, since that would show a seasonal cycle.
  • Comparison with observations (Notebook 3/6) reveals that the bias is larger for the daytime hours, so 10, 11, 12 UTC etc are suffering from the issue, not 7, 8 and 9 UTC. For the nighttime hours, ERA5 has no bias or a small underestimation of up to 0.5 m/s (depending on the location), but after 10 UTC the bias suddenly increases to an underestimation of 0.5 to 1.0 m/s.

Cause

The error is related to the 4D-VAR analysis window. From the IFS documentation for the ERA5 model version, section 2.2.2:

In the CY37R2 operational configurations the assimilation window is 12-hours long, running from 09–21 UTC to produce the 12 UTC analysis and forecast products, and from 21–09UTC for the 00 UTC production (Haseler, 2004).

So, only two time windows are used to derive all hourly analysis fields. Thus, the analysis at 9 UTC is derived from 4D-VAR with observations between 21-9, whereas the analysis at 10 UTC is derived from 4D-VAR with observations between 9-21. That could explain the discontinuity. Apparently, the performance is better for the nighttime hours than for the daytime hours. Perhaps because the boundary layer height is lower during the night, so the effect of data-assimilation may be stronger. I suppose it would be cleaner to shift the time window for each analysis time, but this is probably too computationally demanding.

Proposed solution

ECMWF has confirmed this problem and the data-assimilation strategy, but they did not actually agree that this is causing the error. They recommend working with the forecast files instead.

In a separate notebook, I've checked whether the forecast data does not suffer from the discontinuity in the diurnal cycle. I found that the overall bias is smaller in the forecast data around noon, but there is also a discontinuity (now at 12, because then a new forecast is started). Moreover, the smaller bias comes at the cost of larger standard deviation of the error (and larger overall RMSE) and also the STDE increases with lead time, especially in the afternoon.

Based on these cons, plus the notion that the bias is actually very systematic, and small compared to day-to-day variations and overall errors, I think it is cleaner to proceed working with the analysis data. But we should probably stress this issue, as it is very relevant for all those people who will be working with ERA5 in the future (and based on the success of ERA-interim and the amount of projects and presentations I've seen already on ERA5, there are many). A bias correction should not be too difficult, but it will not affect the low-level jets, since it will change only the magnitude of the wind speed profile, and not the shape.

Seasonal cycle

In [48]:
fig, ax = plt.subplots()
for name, ds in era.items():
    wspd = (ds.u**2+ds.v**2)**.5
    monthly_means = wspd.mean(dim='level').groupby('time.month').mean(dim='time').values
    ax.plot(range(1,13), monthly_means, label=name)
    
ax.legend(bbox_to_anchor=(1.05, 1))
ax.set_xlabel('Month of year')
ax.set_ylabel('Wind speed (m/s)')
ax.set_title('Time/height averaged seasonal cycle')
plt.show()

Notes

  • There is a clear seasonal cycle, as expected.
  • There seems to be a discontinuity in may, and another small one in october.
  • The discontinuity in May could be a baroclinic effect?
  • October seems to be an outlier - could it be that the time period (10) years is still too short?

Let's plot the standard deviation to check this for MMIJ

In [49]:
fig, ax = plt.subplots()
for name, ds in era.items():
    if not name=='MMIJ':
        continue
    wspd = (ds.u**2+ds.v**2)**.5
    monthly_means = wspd.mean(dim='level').groupby('time.month').mean(dim='time').values
    std = wspd.mean(dim='level').groupby('time.month').std(dim='time').values
    l, = ax.plot(range(1,13), monthly_means, label=name)
    c = l.get_color()
    ax.fill_between(range(1,13), monthly_means+std, monthly_means-std, color=c, alpha=0.1)
    
ax.legend(bbox_to_anchor=(1.05, 1))
ax.set_xlabel('Month of year')
ax.set_ylabel('Wind speed (m/s)')
ax.set_title('Time/height averaged seasonal cycle')
plt.show()

The small discontinuity is dwarfed by the interannual variability, so this is probably just a coincidence.

Barcode plots

To check the subsets

In [66]:
ds.dropna(dim='time').z_nd
Out[66]:
<xarray.DataArray 'z_nd' (time: 7445, level: 14)>
array([[ 10.113317,  31.322921,  54.583827, ..., 393.657536, 450.572258,
        512.493907],
       [ 10.118554,  31.337744,  54.604865, ..., 393.687634, 450.626582,
        512.577863],
       [ 10.123907,  31.354996,  54.63568 , ..., 393.776451, 450.75751 ,
        512.758969],
       ...,
       [  9.712642,  30.065643,  52.353994, ..., 373.681504, 427.37361 ,
        485.765113],
       [  9.715239,  30.072963,  52.366688, ..., 373.835184, 427.555404,
        485.978728],
       [  9.809314,  30.370794,  52.885312, ..., 376.488652, 430.500229,
        489.217366]])
Coordinates:
  * time       (time) datetime64[ns] 2015-06-11T19:00:00 2015-06-11T20:00:00 ...
    latitude   float32 51.6
  * level      (level) int32 137 136 135 134 133 132 131 130 129 128 127 126 ...
    longitude  float32 3.0
In [71]:
fig, axs = plt.subplots(11,1, figsize=(12, 16), sharex=True, sharey=True)
axs = axs.flatten()
for i, (name, ds) in enumerate(era_intp_subs.items()):
    wspd = (ds.u**2+ds.v**2)**.5
    mesh = axs[i].pcolormesh(ds.time.values, ds.height.values, wspd.T, vmin=0, vmax=30)
    axs[i].set_title(name)
    
plt.tight_layout()

axs[-2].xaxis.set_tick_params(labelbottom=True)
cax = fig.add_axes(axs[-1].get_position())
axs[-1].set_axis_off()
plt.colorbar(mesh, cax=cax, orientation='horizontal', label='wind speed (m/s)')
plt.show()

Wind roses

A quick visualization of the mean wind climate at each site.

In [72]:
# Note: Patheffects
# xkcd() plots a thick white stroke around everything it draws.
# However, this is not always desirable. For the wind rose plot
# it makes the bars very difficult to read and it hides the grid lines.
# Therefore, either restore the default rc settings with plt.rcdefaults()
# Or, adjust the stroke width using this trick:
# https://github.com/matplotlib/matplotlib/issues/3843#issuecomment-64876925
from matplotlib import patheffects
plt.xkcd()
plt.rcParams['path.effects'] = [patheffects.withStroke(linewidth=1)]
fig, axs = windrose.subplots(3, 4, figsize=(16, 12))
axs = axs.flatten()

# The last two axes will not be populated
axs[-1].set_axis_off()
axs[-2].set_axis_off()

# Plot the wind roses
for i, (name, ds) in enumerate(era.items()):
    u = ds.u.sel(level=133).values
    v = ds.v.sel(level=133).values
    counts, speeds, bars, cmapper = windrose.plot(u, v, ax=axs[i], title=name)

# Adjust subplot spacing
plt.tight_layout()

# Improvised title
axs[-2].text(0, 0, 'Wind roses at ~100m for 2008-2017 period')

# To plot a colorbar, the mapper must be associated with an array
cmapper.set_array([]) 

# Draw colorbar
plt.colorbar(cmapper, ax=axs[-2:], orientation='horizontal', label='bin median wind speed (m/s)')

# Save
# fig.savefig('wind_roses_full', dpi=300, bbox_inches='tight')

# Restore the xkcd patheffects for following plots
plt.xkcd()
Out[72]:
<matplotlib.rc_context at 0x2b69c9ef4e80>

Note how the wind roses are impacted by the time/range/wake limitations of the observations:

In [80]:
plt.rcParams['path.effects'] = [patheffects.withStroke(linewidth=1)]
fig, axs = windrose.subplots(3, 4, figsize=(16, 12))
axs = axs.flatten()

# The last two axes will not be populated
axs[-1].set_axis_off()
axs[-2].set_axis_off()

# Plot the wind roses
for i, (name, ds) in enumerate(era_intp_subs.items()):
    u = ds.u.sel(height=100, method='nearest').values
    v = ds.v.sel(height=100, method='nearest').values
    counts, speeds, bars, cmapper = windrose.plot(u, v, ax=axs[i], title=name)

# Adjust subplot spacing
plt.tight_layout()

# Improvised title
axs[-2].text(0, 0, 'Wind roses at ~100m for ERA5 subsets aligned with measurements')

# To plot a colorbar, the mapper must be associated with an array
cmapper.set_array([]) 

# Draw colorbar
plt.colorbar(cmapper, ax=axs[-2:], orientation='horizontal', label='bin median wind speed (m/s)')

# Save
# fig.savefig('wind_roses_full', dpi=300, bbox_inches='tight')

# Restore the xkcd patheffects for following plots
plt.xkcd()
/home/peter919/miniconda3/envs/thesis/lib/python3.7/site-packages/matplotlib/colors.py:496: RuntimeWarning: invalid value encountered in less
  xa[xa < 0] = -1
Out[80]:
<matplotlib.rc_context at 0x2b69ca584f98>

Notes

Especially Borssele is influenced by nearby windfarms, and the data from SW is filtered out. This influences the climatology at these sites and should be kept in mind during the analysis.