Check and process

(based on Rule 6 of the article doi:10.1111/2041-210X.14033)

Rigorously check your data quality, integrity and compatibility during each step of data processing. Trait-based analyses, mainly when data are consolidated from different sources, can harbour various inherent incompatibilities that may cause biases and severe scientific misinterpretations. For trait compilations, data usually need to be harmonised, subset, transformed, derived and/or aggregated into comparable formats to fit the research question. Wherever possible, steps must be scripted and directly reproducible, and where not, manual steps should be well documented.

Harmonise trait data:

If trait data originate from multiple sources, each source may identify the same entities or concepts differently (Kunz et al., 2022). Harmonisation is crucial to reconcile equivalent entities and explicitly connect related entities by “similar” or subclass relationships. Ideally, these entities or concepts should be identified by standard identifiers (see Rule 5). Manual harmonisation may be necessary to detect and reconcile spelling variations before text strings are mapped to identifiers. But for common classes of data, there are a variety of services available that allow automated and reproducible harmonisation, e.g. for taxonomic names (Boyle et al., 2013; Chamberlain & Szöcs, 2013; Global Names Architecture, reviewed by Grenié et al., 2022), units (Gama, 2014) or geographic names (Boyle et al., 2022). Other covariates and categorical trait values may be semantically reconciled where appropriate ontologies exist (Kunz et al., 2022; Violle et al., 2015).

Filter where needed and double-check data contexts:

Not all trait data are equally suitable for all purposes. Erroneous or duplicate data points need to be identified and removed before analyses, e.g. by validation of data origins and metadata to make sure identical values were indeed from individual measurements. As with other kinds of data, outlier detection and data visualisation provide valuable methods for the detection of such data errors (de Bello et al., 2013). For trait data that are primarily compiled from different sources, other reasons may also render data points inappropriate. For example, if metadata suggests that the observation is from a cultivated occurrence such as a botanical garden, greenhouse, zoo, or farm, values might not be representative of wild specimens (Gering et al., 2019). Observations stemming from introduced or experimental populations may violate assumptions as well. Observations can be collected from different subsets of the population (e.g., adult vs. juvenile, healthy vs. diseased), at different times of year (e.g., breeding season vs. overwintering), in different contexts (e.g., experimental temperature treatments), and using other protocols. It is essential to exclude unsuitable observations, usually by making use of the associated metadata.

Derive traits from raw data:

Research questions may concern composite or derived traits, such as the ‘hand-wing index’ (a wing’s aspect ratio in birds). It is advisable to calculate derived traits directly from the raw data where possible to avoid bias and allow for new calculations. This procedure may not always be possible because of data gaps; in this case the calculation can be done at a higher level (e.g. at the taxonomic level of interest).

Aggregate trait data:

Trait data may come at different levels of resolution. A dataset may include multiple measurements per individual, per population, species, or even higher taxonomic levels. Such structures may imply aggregating (e.g., to calculate average trait values) within individuals, then populations, then species derived from a particular data source, and then across data sources if the species is represented in several of these (Schneider et al., 2019). The way trait values were aggregated has to be precisely described, in particular when data transformation is involved. For example, when it is desired to express leaf area on a log scale, it makes a difference to take the log before or after aggregating the data. Importantly, suppose multiple successive steps of aggregation are necessary. In that case, there is the need to properly measure the uncertainty of the final trait values and assess the effect of aggregation on the results and conclusions, e.g., by sensitivity analyses with different aggregated datasets (Kunz et al., 2022).

Transform and standardise where applicable:

Likewise for other types of data, transformations such as the natural logarithm or square root may be essential to conform to the requirements of analytical models. Beyond these, data challenges include how to combine binary, categorial and continuous traits into the same analysis (de Bello et al., 2021). It is thus very useful to explore transformation and standardisation options applied in current trait scientific literature. For example, to compare the effects of several explanatory traits on a specific response in a linear model approach, values can be standardised for each trait to range between 0 and 1, or by scaling their mean to 0 and their standard deviation to 1 or 0.5 (in case of making continuous traits comparable with categorical traits, (Gelman, 2008)).

Work with relative errors:

Units are essential when we deal with approximations, uncertainties, and errors (Langtangen & Pedersen, 2016). An example is, a trait measurement where the length scale is typically measured in mm and approximates 12.5 m to the exact value of 12.52 m with an error of 0.02 m. Switching units to mm leads to an error of 200 mm. A study working in mm would report 2 x102 as the error, while a study working in m would report 0.02 as the error. As a result, knowing the original measurement units is essential, and the downstream use of the unitless relative error is recommended (Langtangen & Pedersen, 2016).