An intercomparison of mesoscale models at simple sites for wind energy applications

Understanding uncertainties in wind resource assessment associated with the use of the output from numerical weather prediction (NWP) models is important for wind energy applications. A better understanding of the sources of error reduces risk and lowers costs. Here, an intercomparison of the output from 25 NWP models is presented for three sites in northern Europe characterized by simple terrain. The models are evaluated using a number of statistical properties relevant to wind energy and verified with observations. On average the models have small wind speed biases offshore and aloft (< 4 %) and larger biases closer to the surface over land (> 7 %). A similar pattern is detected for the inter-model spread. Strongly stable and strongly unstable atmospheric stability conditions are associated with larger wind speed errors. Strong indications are found that using a grid spacing larger than 3 km decreases the accuracy of the models, but we found no evidence that using a grid spacing smaller than 3 km is necessary for these simple sites. Applying the models to a simple wind energy offshore wind farm highlights the importance of capturing the correct distributions of wind speed and direction.


Introduction
Numerical weather prediction (NWP) models are increasingly being used in wind energy applications, e.g., wind power resource mapping and site assessment, for planning and developing wind farms, power forecasting, electricity scheduling, maintenance of wind farms, and energy trading on electricity markets. In site assessment, NWP models are commonly part of the model chain used to estimate the annual energy production (AEP) and are responsible for a large part of the uncertainty of this estimate.
The extensive use of NWP models, and the vast customization space of each model, means that a strong demand exists for quantification of (a) the overall model uncertainties and (b) the sensitivity of the uncertainties to the choice of subcomponents and parameters. Understanding the sensitivities and uncertainties of the NWP model output can reduce their associated risks and improve decision making. Model users aware of the sensitivity of individual model components will be able to optimize the model setup for specific applications.
In the following, the NWP models will be referred to as mesoscale models, signifying that they partly resolve atmo-spheric phenomena in the mesoscale range, defined as the range of horizontal length scales from about one to several hundreds of kilometers (Orlanski, 1975).
A common way to assess NWP model uncertainties is to use an ensemble approach, where a number of parallel model runs, referred to as ensemble members, are run with slightly perturbed initial conditions (Warner, 2010). The magnitude of the perturbations is typically limited by the uncertainty associated with the particular perturbed variable in the expectation that the ensemble of solutions will cover the solution space arising from the uncertainties of the input parameters. Ensemble-based techniques are used for many meteorological applications, including precipitation forecasting (Gebhardt et al., 2011;Bowler et al., 2006) and wind power production forecasting (Constantinescu et al., 2011). However, one would not expect that the ensembles of any particular modeling system fully represent the uncertainties of another modeling system. This was also demonstrated in the DEMETER project (Development of a European multimodel Ensemble for seasonal to inTERannual climate prediction) (Palmer et al., 2004), where a multi-model ensemble approach, consisting of a number of different modeling systems, each split into a number of ensembles, provided a better representation of the overall uncertainties than any single model ensemble.
Mesoscale model uncertainties in wind speed near the ground are particularly sensitive to some model components, e.g., the choice of planetary boundary layer (PBL) scheme, the spin up and simulation time, and the grid spacing. In the last couple of decades these sensitivities have been studied in great detail. Vincent and Hahmann (2015), Draxl et al. (2014), and Hahmann et al. (2015b) studied the sensitivities of the Weather Research and Forecasting (WRF) model (Skamarock et al., 2008) in offshore and coastal areas in northern Europe. Vincent and Hahmann (2015) studied the effect of grid nudging, spin-up time, and simulation time on near-surface and upper-PBL wind speed variance. They showed that (1) spatial smoothing is observed when nudging is used, but the impact is small in the lower part of the atmosphere, and (2) longer nudged simulation times (11 days) only have slightly lower variance than short simulations (36 h), which makes longer simulations appropriate for climatological wind energy studies. Draxl et al. (2014) studied the ability of the WRF model to represent the wind speed and wind shear profiles at a coastal site in Denmark using seven different PBL schemes. They showed that the Yonsei University (YSU) (Hong et al., 2006) scheme represents the profiles best for unstable atmospheric stability conditions, while the Asymmetric Convective Model version 2 (ACM2) (Pleim, 2007b), and the Mellor-Yamada-Janjić (MYJ) (Janjić, 1994) PBL schemes had more realistic profiles for neutral and stable conditions respectively. Using the WRF model for wind resource assessment, Hahmann et al. (2015b) showed that the choice of PBL scheme and spin-up time has the greatest impact on the simulated mean wind speed for a number of offshore sites, while the number of vertical levels and the source of initial conditions had a smaller impact.
Several studies have investigated the WRF model sensitivities in regions of complex terrain. Carvalho et al. (2012) studied the sensitivities related to the choice of initialization frequency, grid nudging, suite of surface layer (SL) scheme, PBL scheme, and land surface model (LSM). They observe that using grid nudging and frequent starts (every second day) gives the best agreement for wind speed, with several masts located in complex terrain in Portugal. Carvalho et al. (2012) and García-Díez et al. (2013) found a seasonal dependency of the optimal suite of SL-PBL-LSM for simulating PBL winds and temperature. Carvalho et al. (2014b) investigated the sensitivities related to the SL and PBL schemes in the WRF model at both land and offshore sites in and near Portugal. They showed that the Pleim-Xiu SL scheme (Pleim, 2006) combined with the ACM2 PBL scheme (Pleim, 2007b) gave the smallest errors for wind speed and wind energy production estimates across the sites, while the quasi-normal scale elimination (QNSE) SL and PBL schemes (Sukoriansky et al., 2005) gave smaller errors for offshore sites. In a similar study, Gómez-Navarro et al. (2015) analyzed the sensitivities of the WRF model to the choice of PBL scheme and grid spacing in complex terrain in Switzerland. They found that using a modified version of the YSU PBL scheme, which accounts for effects of unresolved topography (Jiménez and Dudhia, 2012), in combination with the smallest grid spacing (2 km) and analysis nudging gave the best agreements with measurements during a number of wind storms. Carvalho et al. (2014a) studied the sensitivities of simulating the local wind resource with the WRF model at several masts in Portugal to the choice of data set used for initial and boundary conditions. They show that using the ERA-Interim reanalysis data set (Simmons et al., 2007) gave the smallest errors compared to NCEP (National Centers for Environmental Prediction) R2 (Kanamitsu et al., 2002), CFSR (Saha et al., 2010), FNL, and GFS data sets, as well as the NASA MERRA data set (Rienecker et al., 2011).
Sensitivities to the choice of modeling system have also been studied for wind energy applications. Horvath et al. (2012) compared the MM5 (Grell et al., 1994) and WRF models for a site in west-central Nevada characterized by complex terrain. Both models were run in a grid nesting setup from 27 km to 333 m grid spacing, and the near surface wind was compared to wind observations from several 50 m tall towers. The study showed that the WRF-derived winds were in better agreement with mean wind speed observations, but thermally driven flows were overestimated in both intensity and frequency. Hahmann et al. (2015a) compared two downscaling methodologies: the KAMM-WAsP (Badger et al., 2014) and WRF Wind Atlas (Hahmann et al., 2015b) methods, both based on a model chain approach between a NWP model and a linearized flow microscale model, for a number of mast sites in South Africa. The study showed that the WRF-based method gave smaller biases than the KAMMbased approach, which underestimated the wind speeds. Community-driven model intercomparison projects provide an opportunity to study both model uncertainties and sensitivities to model components. In the last decade, several intercomparison projects have been successfully carried out based on model output submitted by modelers from the wind energy community. The Bolund experiment (Bechmann et al., 2011) was an intercomparison of flow models, from simple linearized flow models to computational fluid dynamics (CFD) models. The models were compared to measurements around the small island of Bolund in Denmark. The Comparison of Resource and Energy Yield Assessment Procedures (CREYAP; Mortensen et al., 2015) was an intercomparison of energy yield assessment procedures based on four case studies. The study revealed a large spread amongst the different procedures and highlighted the need for further studies into the uncertainties associated with the models themselves. A similar intercomparison of NWP models is attractive for a number of reasons. First, it offers an opportunity for model developers, model users, and stake holders to get a better understanding of the model uncertainties. Sec-Wind Energ. Sci., 2, 211-228, 2017 www.wind-energ-sci.net/2/211/2017/ ondly, a collaborative intercomparison project, which utilizes model data crowdsourced from the wind energy community, increases the scalability of the study compared to traditional sensitivity studies by distributing the workload and computational cost among participants. Finally, if sufficient metadata are collected, they offer a unique insight into the common practices in mesoscale modeling within the wind energy community.
In this paper, a blind intercomparison of the output from 25 different NWP simulations is presented for three locations in northern Europe. The study is based on model output submitted by the modeling community to an open call for model data for a benchmarking exercise co-organized by the European Wind Energy Association (EWEA, now WindEurope) and the European Energy Research Alliance, Joint Programme Wind Energy (EERA JP WIND). The three chosen sites represent some of the simplest terrains: offshore, inland near the coast, and inland in flat terrain, where the smoothing of the terrain representation is not an issue. The three sites have quality observations from tall meteorological masts with many heights. The main objectives of this study are (1) to highlight and quantify the uncertainties of the models and serve as motivation for future analysis of model uncertainties and (2) to identify model setup decisions that have an impact on the model performance. The models are evaluated using simple metrics relevant to wind energy applications.
The structure of the paper is as follows. In Sect. 2 we present a detailed description of the methodology used, including a description of the three study sites and the models used by the participants. Section 3 presents the intercomparison results, and finally Sect. 4 contains the summary and conclusions of the study.

Sites and observations
Three sites with quality measurements from tall meteorological masts with different terrain characteristics were chosen for this study: (1) FINO3, an offshore mast in the North Sea, (2) Høvsøre, a land mast near the Danish west coast, and (3) Cabauw, a land mast in the Netherlands. The mast locations are shown in Fig. 1, and the coordinates and characteristics of each site are provided in Table 1. Long-term measurements are available from each of the masts, but a single year (2011) was selected as the study period due to its excellent data availability.
FINO3 (Fabre et al., 2014) is a marine platform located in the North Sea 80 km off the coast of Denmark, with a meteorological mast reaching 120 m above mean sea level (a.m.s.l.). We used measurements at 40, 60, and 90 m a.m.s.l. in this study. The Høvsøre  mast is located about 2 km east of the coastline in western Jutland, Denmark. Apart from the sharp surface roughness change at the coast- line and the presence of a small coastal escarpment, the surrounding terrain is homogeneous and flat. We used measurements at 10, 40, 60, 80, 100 m at this site. The Cabauw mast (Ulden and Wieringa, 1995) is located 40 km inland near the small towns of Cabauw and Lopik in the Netherlands. The surroundings are flat and characterized by fairly homogeneous agricultural fields, but with patches of forest and buildings. Here we used measurements at 10, 20, 40, 80, 140, and 200 m. Figure 2 shows availability of wind speed observations for 2011 at the three meteorological masts. At Cabauw, the data were gap-filled by simple interpolation as the missing values were few (less than 2 % missing data per month) and the gaps short. The time series from the two other sites were not gapfilled.
At FINO3, the wind speed measurements at three of the heights, 50, 70, 90 m, are a combination of the measurements from three anemometers at three separate booms 120 • apart. This procedure minimizes the effects of the mast flow distortion. At the other two heights, 40 and 60 m, only one anemometer is available, and the wind measurements are therefore susceptible to flow distortion. Thus, instead of using the single-anemometer data from 40 and 60 m, the measurements from 50 and 70 m were vertically interpolated in log height to 40 and 60 m. This assumes that the errors due to interpolation and extrapolation are much smaller than those caused by mast flow distortion.

Submission procedure and models
EWEA issued an open call for data and the submission procedure consisted of a template spreadsheet and a question-  naire available for download from the EWEA website. The participants filled the spreadsheet with the time series of the required variables at each location and height. The questionnaire contained details about the setup of the modeling system used. The participants returned the spreadsheet to EWEA, who passed it on to the authors in an anonymous version. The requested model variables were hourly wind speed and direction, air temperature, and atmospheric stability. The questionnaire asked about the modeling setup, i.e., the model code and version, the SL and PBL schemes, the LSM, the grid nest size(s) and spacing(s), the vertical levels, the land use data, the length of the simulation, the spin-up time, and the source of the initial and boundary conditions. The participants were also asked to comment on any additional modifi-cations made to the model, including assimilation, ensemble, or other methods used. Table 3 lists the various groups participating in the exercise. It includes representatives from private companies, universities, research centers, and meteorological institutes. Table 4 summarizes the models and the different model setup options used. The WRF model is by far the most commonly used model in the study, with 18 out of 25 models (Table 4). The Noah LSM was the most common LSM used, and the ERA-Interim reanalysis was the most common source of boundary and initial conditions. The PBL scheme used and the source of land cover data were more varied amongst the participants. Most models used a maximum simulation length of less than 100 h, including the spin-up time (most typically 12 h spin-up and 36 h of total simulation). The simulation and spin-up length ranged from 1 h spin-up and 7 h simulation to 24 h spin-up and continuously running for the full year.
For reference, wind time series from the ERA-Interim reanalysis (Dee et al., 2011) were included in the comparisons whenever possible. The ERA-Interim reanalysis data set is a global data set based on extensive assimilation of surface and upper-air observations. The data are available on a grid spacing of about 80 km in the horizontal with 60 vertical levels, with values at approximately 10, 34, 69, 118, 187, and 275 m above the model surface. We used bilinear interpolation to interpolate to the sites coordinates and linear interpolation in the vertical. The data set is available in 6 h intervals; thus, linear interpolation in time was used to obtain hourly samples.

Statistical methods
This study is based on direct comparison between the observations and model output at collocated positions, as well as intercomparison of the modeled output. The sampling frequency for the study was chosen to be 1 h. For the observation data this means hourly mean values; for the mesoscale models the inter-hourly variation is small; thus, instantaneous values were used. To ensure temporal consistency between observations and modeled output, instances of missing data from the observations were removed from the modeled output. Furthermore, to get consistent vertical profiles, only instances where all heights for a particular mast with available data were used. The model output submitted was Wind Energ. Sci., 2, 211-228, 2017 www.wind-energ-sci.net/2/211/2017/ assumed to be quality checked by the submitter, but it was also checked by the authors for obvious nonphysical or inconsistent behavior and not used in that case. The number of models excluded was between two and four at each of the sites, but no model was excluded from all three sites.

Inter-model mean and inter-model variations
The emphasis of this study is on the wind speed, u, and wind direction, as they are the most important variables for wind energy applications. In the following, a subscript "m" signifies the temporal mean of a variable, i.e., u m is the temporal mean wind speed. This is not to be confused with the mean value of the model ensemble, also referred to as the intermodel mean, which is denoted with a tilde. For example, the mean of the model ensemble for the temporal mean wind speed is denoted as u m and calculated as Here i is the model index and N is the total number of models. Likewise, it is useful to define its standard deviation: which is the standard deviation of the inter-model variation between the temporal model means. Since u m and σ u m are both sensitive to outliers, we used the following procedure: 1. Calculate u m and σ u m .
3. Recalculate u m and σ u m with the new subset of models.
The value of 3.5 σ u m was chosen somewhat arbitrarily to ensure that only extreme outliers were removed. The procedure included only models with output available at all the heights to ensure a vertically consistent profile of the mean and its variation. Typically, only one or two models were removed by this criteria.

Coefficient of variation
Variations in wind speed often scale with the mean wind speed. Thus, to allow for intercomparison of wind speed variation intensity across vertical levels, we define the coefficient of variation, C v,u . It is defined as the ratio of the standard deviation and the mean, σ u /u m , and is a unit-less measure of the relative variation at the sampling timescale. At timescales of seconds it is known as the turbulence intensity, but in this case, with a sampling frequency of 1 h, it represents the intensity of variations of synoptic-and mesoscale weather phenomena.

Wind speed shear exponent
To diagnose the wind shear in the boundary layer, we use the wind shear exponent, α, which uses the wind speed u 1 and u 2 at two heights z 1 and z 2 , given by the expression In the surface layer, α is strongly influenced by the surface roughness and the atmospheric stability. By comparing the modeled to the measured α, it is thus possible to gain insights into how the model captures these effects.

Error metrics
The RMSE and the normalized RMSE (NRMSE) were used as error metrics to obtain single-value measures of the error across heights at a site. The RMSE and NRMSE are defined as for a set of n modeled values x M j and observed values x O j . The RMSE was used for variables that do not scale with height in the surface layer, e.g., wind speed shear exponent; the NRMSE was used for variables that do scale with height, e.g., wind speed.

Wind energy application
To investigate the errors associated with the use of each model in wind energy applications, we performed a simple wind resource assessment exercise, using both measurements and modeled time series at FINO3.
A typical approach to resource assessment is to run a mesoscale model for a number of years, followed by a downscaling process where the wind climate statistics obtained from the mesoscale model are used as input to a microscale model (Badger et al., 2014;Hahmann et al., 2015a). In simple terrain, the microscale model usually consists of a flow model like the one used by the Wind Applications and Analysis Program (WAsP). WAsP uses a linearized flow model based on Jackson and Hunt (1975). The procedure in WAsP consists first of an upscaling, where local effects from variations in orography, surface roughness, and obstacles are removed from the wind climate statistics. This is referred to as generalization of the wind climate, which makes it representative for a larger area than the site-specific wind climate. The size of this area depends on the complexity of the surface roughness and orographic variations in that area. To obtain a site-specific wind climate at a new site in this area, the generalized wind climate is downscaled by reversing the generalization process, i.e., by introducing the site-specific effects of orography, surface roughness, and obstacles of the new site. Given the wind climate and the turbine power curve, the expected power output can be calculated for any site. Since the participants in this intercomparison were not requested to submit the model-specific orography and roughness maps near each site, it is not possible to go through the generalization procedure and subsequent downscaling process at the inland sites. However, for the offshore site FINO3 there are no effects of orography, and the differences in roughness between the models can be assumed to be negligible. Therefore, we can use the raw model output at this site to estimate the wind resources estimated by each of the models, without the generalization procedure.
We performed the wind resource exercise at 90 m at FINO3, assuming first a single Vestas V80 turbine at the site, and then repeated for the exercise for the wind farm of Horns Rev, which is an 80-turbine wind farm located near FINO3. The resource estimations for the wind farm include the simple wake parametrization present in the WAsP model, which was used to estimate the power losses.

Mean quantities and distributions
The following subsection is dedicated to the general performance of the models and their ability to capture the mean and the distributions of a number of wind-related quantities. As previously stated, the goal is to highlight the weaknesses of the models to encourage further analysis of model sensitivities. Figure 3 shows the vertical profiles of mean wind speed (u m ) at the three sites. At FINO3 (Fig. 3a), most mesoscale mod-els (MMs) underpredict u m at all heights. However, the bias on average is less than 0.27 m s −1 (∼ 2.8 %). This is a small bias compared to that of the ERA-Interim data, which show a larger bias than all the mesoscale models. The inter-model variance σ u m at FINO3 is 2.7-3.1 % of the inter-model mean, and decreases with height. That is the lowest combined intermodel variance of any of the three sites.

Annual mean wind speed
At Høvsøre (Fig. 3b), the MMs generally have small wind speed biases above 10 m. The error of the inter-model mean of the models is smaller than ±0.16 m s −1 (∼ 1.9 %), and the inter-model variance is 3.0-5.2 %, decreasing with height, which is low compared to the biases at the other site on land (Fig. 3c). At 10 m, most MMs overpredict the mean wind speed. The inter-model mean has a positive bias of 0.54 m s −1 (∼ 8.4 %). The largest inter-model variance is also seen at 10 m (7.8 %). The ERA-Interim also overpredicts the mean wind speed at 10 m, with a larger bias than u m . Above 10 m, ERA-Interim has smaller errors, but the shape of the profile is not well captured. Signs of a "kink" in both the observed and modeled profiles are present, which could indicate the transition from the low surface roughness of the sea to the higher surface roughness inland.
At Cabauw (Fig. 3c), most of the MMs overpredict u m . Only one of the models and the ERA-Interim reanalysis show a significant underprediction, and in the case of the reanalysis, this underestimation increases with height. The overprediction by the rest of the MMs varies in magnitude, but the average of the models, excluding the outliers, is in the range of 4-9 % across the different heights. The largest relative errors are at the lowest levels. The inter-model variance ( σ u m ) at Cabauw varies between 3.3 and 8.1 % across the different heights and is largest at the lowest levels. The decrease in wind speed bias with height was also observed by Jiménez et al. (2016), who associated this with excessive turbulent mixing, which may be caused by a misrepresentation of the surface roughness length.   Figure 5. Wind direction distributions at the three sites (FINO3 at 90 m, Høvsøre at 80 m, and Cabauw at 80 m), based on 24 sectors, for the observations (black), the ERA-Interim data set (green), the mesoscale models MM i (red), and the inter-model mean MM (blue line) and its standard deviation ± σ (blue shade). Figure 4 shows that, on average, the MMs capture the wind speed distributions well compared to the observations. The only exception is a slight shift towards higher wind speeds at Cabauw, corresponding to the positive bias in mean wind speed observed in Fig. 3. The ERA-Interim data set captures the distribution well at Høvsøre, but it has distributions that are shifted towards lower wind speeds at FINO3 and Cabauw, corresponding to the bias in Fig. 3. Figure 3 shows that the MMs generally capture the mean wind speed well. This is also true for the wind direction distributions, commonly called "wind roses". The distributions are split into 15 • sectors at heights of either 80 or 90 m. Figure 5 also shows that the models are in good agreement. At all three sites the MMs capture the distribution better than the reanalysis data. At all sites, but most markedly at Cabauw, the ERA-Interim distribution is rotated clockwise relative to the distribution from the observations and MMs. This rotation might result in a different wind farm layout if its power is optimized according to the wind roses from MMs or the ERA-Interim. Figure 6a shows the monthly distribution of the mean wind speed for the MMs and the measurements. Apart from a few models outside the 3 × quartile range, most models capture the diurnal cycle well. Interestingly, the figure also reveals that both the overestimation by the models at Cabauw and the underestimation at FINO3, seen in Fig. 3, are evenly distributed throughout the year. At Høvsøre, a mix of under-and overestimations are observed. Figure 6b shows the monthly distribution of the mean absolute error (MAE) for wind speed for the MMs. Summer and spring are generally associated with larger deviations between the modeled and observed wind speeds. It is well established that fall and winter weather in northern Europe is governed by large-scale planetary and synoptic weather phenomena, which is well captured by mesoscale models. During spring and summer, meso-and thermally induced phenomena (e.g., sea breezes and convection) have a larger impact on the flow, which is more difficult for the models to correctly capture. The lowest MAE is observed at FINO3 in

Effect of atmospheric stability
It is generally acknowledged that non-neutral atmospheric stability conditions pose one of the greatest challenges for MMs (Fernando and Weil, 2010). To study the performance of the models in different stability regimes, the stability parameters supplied for each model (inverse Obukhov length or bulk Richardson number) were used to group the hourly samples into five stability classes based on Gryning et al. (2007) and Mohan and Siddiqui (1998), shown in Table 2. Because the models represent atmospheric stability in different ways, the number of samples in each stability group varies for the different models. However, the number of samples in each group was never below 150 h (out of 8760 h), and it was more than 400 in most cases. The MAE for wind speed was calculated for each of groups and for all models. The results are shown in Fig. 7.
At all three sites, the smallest deviations between modeled and measured wind speeds are found when the models perceive the surface layer stability from unstable (U) to stable (S). The MAE in these cases typically range from 10 to 35 %, with just a few models outside of the 3 × quartile range. The largest deviations are found when the models estimated very stable conditions (VS) or very unstable conditions (VU) (typical values in the range 15-45 % MAE). The site where the largest errors are found is Cabauw, and the smallest is FINO3. This is in agreement with the results in Sect. 3.1.
Wind Energ. Sci., 2, 211-228, 2017 www.wind-energ-sci.net/2/211/2017/ Table 2. Ranges of inverse Obukhov length (1/L) and bulk Richardson number (Ri b ) used in the stability classification. The 1/L classes were used in Gryning et al. (2007) and the Ri b classes in Mohan and Siddiqui (1998 Figure 8 shows the mean coefficient of variation (C v,u ) for wind speed at the three sites. At FINO3, the average of the MMs C v,u is similar to the observations, with a bias of less than 1 % at all three heights. Ignoring one outlier, the intermodel variance ranges between 3.0 and 3.5 % at the three heights. The outlier, which shows much lower values, is a consequence of the low variance for that model compared to the other models. It was removed by the filtering method described in Sect. 2.3 when calculating the mean of the models ( C v,u ) and the inter-model variance ( σ C v,u ). The ERA-Interim data set also captures the magnitude of C v,u well. At Høvsøre, C v,u decreases with height for both the observations and most of the MMs. The inter-model mean of the models ( C v,u ) agrees well with the observations, but underestimates it by about 2 %. The ERA-Interim data set does not capture this behavior, and instead shows an increase with height. At the highest levels, however, it reaches the average of the models and the observed values. The spread of the MMs (σ C v,u ) is slightly higher than at FINO3 (3.6-4.4 %) and is highest at the lowest levels.

Coefficient of variation of wind speed
At Cabauw, C v,u at 10 m is the largest value found across all sites. Above 10 m a sharp drop-off is found up to 80 m, where is starts to slowly increase up to 200 m. Most of the MMs capture this behavior, which is reflected in the mean of the models ( C v,u ). However, the models underestimate the magnitude and the drop-off of C v,u at the lowest levels, with Company Spain a bias of up to 12 % at 10 and 20 m. Above 80 m the models agree with the observations. The ERA-Interim data set is nearly constant with height, underestimates C v,u below 40 m, and overestimates it above. The inter-model variance ( σ C v,u ) of the MMs is largest at the lowest levels, 8.0 % at 10 m, and gradually decreases to less than 4 % at 200 m.

Effect of upstream conditions on the variation of wind speed
The coastal site Høvsøre and the offshore site FINO3 is used to investigate whether there is a dependency of the coefficient of variation for wind speed (shown in Fig. 8) on upstream surface conditions. With a nearby coastline aligned northsouth, Høvsøre represents the case with anisotropic surface roughness conditions: westerly winds come from the sea (onshore flow) and easterly winds from land (offshore flow). In contrast, the offshore site FINO3 has isotropic upstream surface roughness. To study the differences, the coefficients of variation were binned according to four wind direction sectors, each spanning 90 • : north, east, south, and west. The values for the east and west sectors were then extracted and analyzed. Figure 9 shows the profiles of C v,u for the two wind directions at FINO3 and Høvsøre. At FINO3, the coefficient of variance is almost constant with height and slightly lower for easterly winds than for westerly flow. This is true for both models and observations. The sample size for easterly winds is smaller, about half, than for westerly flow. However, both sample sizes are large (N > 1000); thus, the influence from sample sizes is expected to be small. The average of the MMs captures the observed behavior well for both westerly and easterly winds, and the inter-model variance is similar for the two sectors. The ERA-Interim agrees better with the observations during easterly flow at FINO3. At Høvsøre, the coefficient of variation is larger for westerly than for easterly winds. Easterly winds show larger coefficients of variation at 10 m than higher up. The reduction of C v,u with height up to 40 m for easterly flow is underestimated by most of the mesoscale models and completely missed by the ERA-Interim data set. For westerly winds, the mean of the models and the observations agree but is underestimated by ERA-Interim.
The dependence on height of C v,u is only present at Høvsøre for easterly winds and points to the influence of upstream surface conditions on the variation. The observed pattern is captured by the MMs, but the models show a more smoother vertical transition than the observations do. The ERA-Interim reanalysis does not capture the pattern. Figure 10 shows the distributions of wind speed shear exponent (α) for each of the three sites calculated between 40 and 80 or 40 and 90 m. Under neutral atmospheric stability conditions and isotropic surface roughness, a sharp distribution centered around a single value is expected. This means that for offshore sites such as FINO3, the spread in shear exponent comes primarily from variations in atmospheric stability. With this in mind, the distributions show that most MMs capture the stability well at the site. The ERA-Interim data set does not capture the strongest shear situations well. This can be easily explained by the low data frequency (6 h At Høvsøre and Cabauw, the distributions of α reflect the combined effect of both the nonhomogenous upstream surface roughness and the variations in atmospheric stability. At the coastal site, the wind speed profile changes depending on whether the fetch is from land or from the sea, which is also reflected in the distribution of α (Hahmann et al., 2015b). Figure 10 also shows that while the shear distributions are generally also well captured at Høvsøre and Cabauw, a slight shift towards lower values is observed at both sites. This points to an underestimation of the surface roughness, a misrepresentation of the atmospheric stability, or a combination of the two. Just like at FINO3, the ERA-Interim data set does not capture the weak and strong shear cases at Høvsøre and Cabauw.

Relating performance to model setup
To identify what model setup choices lead to better model performance, the statistics of each model across all heights are reduced to just two values at each site: NRMSE for wind speed (NRMSE u ) and RMSE for wind speed shear exponent (RMSE α ). The shear exponent was calculated between pairs of nearby levels, e.g., at FINO3 two values were calculated, one between 40 and 70 m, and one between 70 and 90 m. The RMSE α was then calculated as described in Sect. 4 between modeled and observed values of the shear exponent across all height pairs. Figure 11 shows NRMSE u and RMSE α for all MMs at all three sites. It shows, similar to Sect. 3.1, that the models generally have smaller mean wind speed and mean shear exponent errors at the offshore site FINO3. However, as previously shown, errors are larger near the surface, and the three levels used at FINO3 are at 40 m and above, unlike Høvsøre and Cabauw where levels below 40 m are included.
The models were then grouped according to specific model components. Given the range of setup choices that influence the model performance, large groups were needed to obtain useful statistics. With this in mind, three setup options were chosen for analysis: PBL scheme, grid spacing, and simulation lead time; statistics of NRMSE u and RMSE α Figure 11. RMSE for wind speed shear exponent (RMSE α ) versus normalized RMSE for wind speed (NRMSE u ) at the three sites.
were also computed for each group. The choice of groupings was based mainly on two criteria: (1) it was possible to form groups with at least six members in each group and (2) each of the options was highlighted in the literature as being important for model performance (Hahmann et al., 2015b;Gómez-Navarro et al., 2015;Carvalho et al., 2012;Draxl et al., 2014). Several other setup options were considered: MM, LSM, land cover, spin-up time, and data set used for initial and boundary conditions, but either it was not possible to group them in a meaningful way, or they were deemed of too little importance based on previous studies. Models missing information about particular setup options, or missing output at some heights, were excluded from this analysis.

PBL scheme
The PBL scheme in a MM ensures an accurate representation of thermodynamic and kinematic structures of the lower troposphere (Cohen et al., 2015). Two important characteristics of the PBL schemes are their order of closure and whether mixing happens through a local or a nonlocal process. Equa- Table 4. Setup description of the 25 model setups ranked by horizontal grid spacing of the finest grid. The columns are the model name and version (model), the PBL scheme (PBL), the land surface model (LSM), whether nesting was used (Nest.), the horizontal grid spacing ( ), the land cover source, simulation and spin-up time (Sim. time), and initial and boundary condition data (B.C.  Noilhan and Mahfouf (1996). l Champeaux et al. (2005). m Pleim (2007a). n NCEP final analysis. o Saha et al. (2010). p Nakanishi and Niino (2006). q Rienecker et al. (2011). r Friedl et al. (2010). s Loveland and Belward (1997). t Lean et al. (2008). u Lock et al. (2000. v Cox et al. (1999). w Kallos et al. (1997). x Pan and Mahrt (1987). y Global Forecast System. z Kallberg (1989). aa Cuxart et al. (2000). ab Integrated Forecasting System. ac Pielke et al. (1992). ad Walko and Tremback (2005). ae Grell et al. (1994).
tions describing turbulent motion of order n contain terms of order n + 1. The order of closure describes the highest order of equations included; higher orders are parametrized. In local schemes, variables are only affected by adjacent cells, while nonlocal schemes relate changes to gradients in the whole PBL column (Cohen et al., 2015).
To study the influence of the PBL schemes used, the MMs were split into three groups: YSU, MYJ, and Other. The statistics of NRMSE u and RMSE α for these groups are shown in Table 5. The YSU group consists of six models that used the YSU PBL scheme (Hong et al., 2006), which is a first-order nonlocal scheme. The models in this group span a range of grid spacings and lead times, but models with larger-than-average grid spacing and longer-thanaverage lead times dominate the group. The MYJ group contains six models that used the MYJ PBL scheme (Janjić, 1994), which is a 1.5 order local scheme. Most of the models use a short lead-time limit and a grid spacing that is close to the average for the MMs in this study. The last group, labeled Other, contains nine models that use a mix of different PBL schemes (see Table 4), with different orders of closure and a mix of local and nonlocal formulations. These models have a wide representation of different grid spacings and lead times.
At FINO3, the group consisting of models not using either the YSU or MYJ PBL schemes generally has smaller wind speed errors; even though the group also contains the model with the largest NRMSE u . The models using the MYJ PBL scheme have smaller wind shear exponent errors, and also on average smaller wind speed errors than YSU. However, the median model in the YSU and MYJ groups has similar wind speed errors.
At Høvsøre, the three groups have very similar mean wind speed error statistics, with YSU showing only slightly smaller errors. However, for wind shear exponent, the models in the YSU group have the smallest errors, both on average and for the median model. Draxl et al. (2014) (Hahmann et al., 2015b). At Cabauw, the YSU group has smaller errors than the other groups for both wind speed and wind shear exponent, but the errors for the median model in the YSU and MYJ groups are quite similar. The single most accurate model is found in the Other group, but that group as a whole has larger errors.

Grid spacing
A mesoscale model should be able to explicitly resolve smaller and smaller phenomena as the grid spacing is decreased. Skamarock (2004) illustrated that the effective resolution of the WRF model is approximately 7 times the grid spacing used. However, mesoscale models, as the name suggests, have been developed to simulate the mesoscale, and they are often not capable of simulating weather at scales that lie between the micro-and mesoscales, i.e., between approximately 100 and 2000 m. To study the importance of the grid spacing, the models were ranked by grid spacing, similar to Table 4. The models were then split into three groups: fine, moderate, and coarse. The fine group consists of seven models that all have a grid spacing below 3 km. The moderate group consists of eight models at exactly 3 km, and the coarse group consists of six models above 3 km. The fine group contains models that are well distributed in terms of PBL schemes and simulation lead time. The moderate models also have a good representation of different PBL schemes and lead-time limits, but the MYJ PBL scheme and short lead times are most common. The coarse group contains no models using the MYJ PBL scheme, and half of the models use a short lead time. Table 6 shows the statistics for NRMSE u and RMSE α . At FINO3, the fine group has the smallest wind speed errors. For the wind shear exponent, the smallest error is found in the coarse group, but on average the fine and moderate groups have smaller errors. At Høvsøre, the fine and moderate groups have similar errors for both wind speed and shear exponent. However, the model with the smallest shear exponent error is found in the coarse group. At Cabauw, the moderate group shows the smallest errors for both metrics, followed by the fine group. However, just as for Høvsøre, the model with the smallest RMSE α is found in the coarse group.

Simulation time
As the solution in mesoscale models is integrated forward in time, the uncertainties associated with the errors in the initial conditions increase (Yoden, 2007). This can cause the model solution to drift away from the true solution. Furthermore, amplification errors can reduce the variance, which reduces the accuracy of the model in a statistical sense. To study the influence of the simulation time on the model performance, the models were ranked and split into three groups: short, medium, and long. The short group consists of nine models with a lead time below 48 h. Four models in the group use the MYJ scheme, and one uses the YSU scheme. The short group has a good representation of models with different grid spacings. The medium group includes eight models with a lead time between 48 and 335 h. The group has a good representation of different PBL schemes and grid spacing. The long group consists of seven models with a lead-time limit above 335 h. Five of the models use the YSU PBL scheme, and most of the models use a larger-than-average grid spacing. Error (%) Figure 12. Distribution of errors from the model's output at 90 m at FINO3 for the following errors: (1) the mean wind speed u m (blue), (2) the power density P m (green), (3) the power density with an implied power curve P m,pc (red), and (4) the averaged power density of a wind farm, including the same implied power curve as (3) and the wake effects (purple). Outliers are not shown; the most extreme ones are −25 % for u m , −60 % for P m , −37 % for P m,wf , and −35 % for P m,pc Table 7 shows the errors statistics for the three simulationtime groups. At FINO3, the median model from the short group has the lowest NRMSE u and RMSE α , but because one model has large errors, the lowest mean errors are found in the medium group. The medium group has smaller errors across all metrics compared to the long group.
At Høvsøre, the short and long groups have similar error statistics for wind speed, and both measures are lower than those for the medium group. For RMSE α the median model from the short group has the smallest error, while on average the errors are smallest in the medium group.
At Cabauw, the smallest errors for both wind speed and shear exponent are on average found in the long group, while the median model with the smallest errors is in the short group. It is worth noting that five of the seven models in the long group use the YSU PBL scheme, and in Sect. 3.2.1 the models using the YSU PBL scheme were shown to have smaller errors at Cabauw; thus, it cannot be ruled out that the small errors in the long group at Cabauw are related to the overrepresentation of the YSU scheme and not the simulation length.

Wind energy application
As described in Sect. 2.4, the output from the mesoscale models was applied to a simple wind energy exercise. The 90 m wind resource of a Horns Rev wind farm was estimated using the output from the various MMs at FINO3. Figure 12 shows the errors for four metrics: (1) error in mean wind speed u m , (2) error in mean power density P m , (3) error in mean power density using a single power curve P m,pc , and (4) error in the mean power density of a wind farm of 80 turbines P m,wf , including wake effects. Figure 12 shows that the majority of the models have less than ±5 % error in mean wind speed. The errors are mostly underestimations, and in a few cases severe underestimation of more than 10 % (outside the scale of the figure). For the mean power density, the spread of the models is, as expected, much larger due to the third-power dependence on the wind speed. However, when the power density is calculated using a turbine power curve, where the highest wind speeds (> 14 m s −1 ) are less important, the inter-model variance is comparable to that for mean wind speed. For the wind farm case, where the power density depends on the wind direction distribution, because of the wake losses, the variance is comparable in size to that of the mean wind speed and P m,pc , and most models have errors smaller than ±2 %. The improvement seen for P m,wf is caused by the underestimation of the wake effects by most models, leading to a relative increase in mean power density, offsetting the underprediction from the modeled wind speed distribution. However, the relative effect of over-or underpredicting the wake effects may just as well enhance the total power density errors, given slightly different wind direction distributions.

Summary and conclusions
The mesoscale models in this study are able to reproduce the observed mean wind speed profiles and the distributions of wind speed well. At FINO3 and above 10 m at Høvsøre, the average of the models has a bias of 3 % or less. The largest mean wind speed biases (7-9 %) are found at the lowest levels at Høvsøre and Cabauw. Similarly, the MMs were able to reproduce the relative variations of wind speed well in most cases (Fig. 8), but they underestimated the relative variations at the lowest levels at Cabauw. A simple analysis of the impact of upstream surface roughness conditions on the relative wind speed variations suggested that the models may be misrepresenting the surface characteristics ( Fig. 9), which could be a misrepresentation of either the land use classification, the conversion of land use classes into surface roughness lengths, or the PBL scheme. This problem highlights the need for (1) further analysis of the representativeness of the surface characteristics in mesoscale models and (2) downscaling of the mesoscale results using a coupled microscale model to capture subgrid-scale influence from variations in orography and surface roughness. The modeled distributions of the wind direction showed only minor differences compared to the observed ones.
For future benchmarking exercises, our study shows that the focus should be on the model representation of surface characteristics, such as orography and land use, and their associated surface roughness. An attempt was made here to include these details, but because only a subset of the participants supplied this information, it was not feasible. Further studies could also benefit from including more land masts with low to moderate complexity, where capturing the surface characteristics is important, but still manageable by mesoscale models.
The impact of choosing specific model subcomponents was studied in some detail. To allow this, the output from the models was reduced to two metrics at each site, one related to the wind speed bias (NRMSE for wind speed) and one related to the shape of the wind speed profile (RMSE for wind speed shear exponent). The models were then separated into large groups according to their model setup for three setup choices: PBL scheme, grid spacing, and simulation lead time. At FINO3, the grouping revealed that the models using the MYJ PBL scheme had smaller wind speed and shear exponent errors than those that use the YSU scheme. At Høvsøre and Cabauw, the opposite was true. However, the differences between the two groups were not significant and the median model from the two groups had similar errors. Grouping the models according to grid spacing showed that the models with 3 km grid spacing or smaller had lower errors than the group with the largest grid spacings. For these sites, no conclusive evidence was found that reducing the grid spacing below 3 km results in smaller errors. For simulation lead time, the median model from the group with short lead times had the smallest errors at all sites, with the exception of the shear exponent error at Høvsøre. However, no significant difference between the mean of the groups was found, which suggests that the PBL scheme and grid spacing may be of greater importance for the performance at these sites. Future studies should include many more runs to provide more robust statistics, which can provide a basis for best-practice guidelines for wind energy applications using NWP models.
Last, we used the observed and modeled time series for a classical wind energy application, the estimation of power production at a hypothetical wind farm at FINO3. The power production, including wake losses, was estimated for both a single turbine and for a wind farm, using a standard power curve. The exercise showed that while a large spread exists between the modeled power density, it is reduced when the power is calculated using a power curve. It also showed the importance of accurately estimating the wind direction distribution since a small deviation in the distributions might induce large changes in the power production because of its sensitivity to the wind farm layout. Data availability. The output data from the mesoscale models have been submitted to the European Wind Energy Association (EAWE) for the mesoscale benchmarking study under an agreement that ensures that individual participants are anonymous in the reported results, and that the model output was not publicly shared. The measurements from the meteorological masts FINO3, Høvsøre, and Cabauw are provided by the data owners under an agreement of not sharing the data with any third party.