Ecological and evolutionary meta-analysis more often than not need much more sophisticated multilevel meta-analytic models (Hadfield, 2010; Nakagawa and Santos, 2012; Noble et al., 2017) compared with random effect models to analyse data. Biological data from field and lab-based experiments are highly structured – many studies could be done on the same species or they could be done by the same lab group or student sharing similar methodology (Nakagawa and Santos, 2012; Noble et al., 2017). In addition, we might have many effect size estimates from a single study because many traits are measured, or many treatments are applied (Noble et al., 2017). This adds complexity to the data set and also results in special types of non-independence that are somewhat unique to meta-analysis – especially when using contrast-based effect sizes like Hedges’ g, log response ratios, log odds ratios etc (Noble et al., 2017).

In the past, complex non-independence was often dealt with by: **1)** conducting separate meta-analyses on subsets of non-independent data sets, usually using random effect models; **2)** taking one or only a subset of effect size estimates from a single study to reduce non-independence; and/or **3)** averaging effect size estimates from a single study and modelling the average (Nakagawa and Santos, 2012). **These approaches can be problematic because disposing of useful effect size data will result in a loss of statistical power or even inflated Type I error rates if average effect size estimates were not calculated accounting for non-independence (Nakagawa et al., 2022)**. Furthermore, removing effect size estimates from an analysis prevents estimation of biological and methodologically relevant information that can help shape the field (Nakagawa and Santos, 2012; Nakagawa et al., 2022). While the above approaches could still be useful at times, where possible, we recommend you model all available data (Nakagawa and Santos, 2012; Nakagawa et al., 2022; Noble et al., 2017).

In the next series of tutorials we’ll overview multilevel meta-analytic models and discuss common forms of non-independence and how to deal with these challenges. Here, we’ll first focus on simple multilevel meta-analysis models and how to interpret these. We’ll then work our way up to including more complex forms of non-independence, such as phylogenetic meta-analysis and accounting for shared control. Technically, all these sources could be built into a single model but will require a substantial amount of data. Sometimes it will not be possible to deal with all sources of non-independence, in which case it’s important to conduct sensitivity analyses to understand how your conclusions and inferences are affected by the analytic decisions made (Noble et al., 2017). **Strong justification for your choices is always essential**.

Recall that a random effects model is defined as follows:

\[
y_{i} = \mu + s_{i} + m_{i} \\
m_{i} \sim N(0, v_{i}) \\
s_{i} \sim N(0, \tau^2)
\]
In this model, \(y_{i}\) is the *i*th effect size estimate and \(m_{i}\) is the sampling error (deviation from \(\mu\)) for effect size *i*. Given that we have a single effect size estimate from each study we can include \(s_{i}\), which is the study specific deviation, and this gives us an ability to estimate a “residual” variance from this model. That’s because after we remove the known sampling variance there is likely to be “extra” variability among effect sizes at the effect/study level (in this case these are confounded because we have one effect / study). We can estimate that explicitly which is what we are doing here. **The challenge with having a single effect estimate per study is that the within and between study variances are confounded**. In other words, we cannot estimate both.

More commonly, we can extract many (sometimes a lot) effect size estimates from a single study. In such case, we no longer have a single effect size estimate from a given study but a whole bunch of effects from a single study. These could be effects from different experimental treatments, at different time points or on different traits. These effects are not independent of each other and there is likely a substantial amount of within study variation in addition to the across study variation (Nakagawa and Santos, 2012). We can now estimate these two sources of variability explicitly turning our model into what is the simplest form of a multilevel meta-analytic model:

\[ y_{i,j} = \mu + s_{j} + e_{i,j} + m_{i,j} \\ m_{i,j} \sim N(0, v_{i,j}) \\ s_{j} \sim N(0, \tau^2) \\ e_{i,j} \sim N(0, \sigma_{e}^2) \]

You’ll notice that the notation changes slightly – \(y_{i,j}\) is now the *i* effect size from study *j* and \(m_{i,j}\) is effect size *i*’s known sampling variance from study *j*. \(s_{j}\) is now the study specific effect, *j*, applied to the *i*th effect size, where *j* = 1,…,\(N_{studies}\). We have also now added a random effect (\(e_{i,j}\)) to the model. **\(e_{i,j}\) is the effect-size-specific (within study) effect. This is often referred to as a ‘residual’ effect (Nakagawa and Santos, 2012) and is used to estimate a residual variance**.

In reality, we may also have lots of studies on a specific set of species. For example, common or economically important species, like Salmon, Trout or lab model organisms like Zebrafish might be over-represented in the data set. This adds an additional level of structure to our data. We can account for species in this model by adding another random effect that captures the similarities between effect size estimates from the same species. As such, we can expand our multilevel model as follows:

\[ y_{i,j,k} = \mu + s_{j} + spp_{k} + e_{i,j,k} + m_{i,j,k} \\ m_{i,j,k} \sim N(0, v_{i,j}) \\ s_{j} \sim N(0, \tau^2) \\ spp_{k} \sim N(0, \sigma_{k}^2) \\ e_{i,j,k} \sim N(0, \sigma_{e}^2) \]

Here, \(y_{i,j, k}\) is now the *i*th effect size from study *j* and species *k*. \(m_{i,j, k}\) is effect size *i*’s known sampling variance from study *j* and species *k*. \(e_{i,j, k}\) is the effect-size-specific (within study) applied to each effect *i* from study *j* and species *k*. \(s_{j}\) is still the study specific effect, *j*, applied to the *i*th effect size, where *j* = 1,…,\(N_{studies}\), but now \(spp_{k}\) is the species specific effect, *k*, applied to the *i*th effect size, where *k* = 1,…,\(N_{species}\).

As you can imagine, the notation can get quite complicated the more levels you add. As such, a simplification of the above model notation that you might read (e.g., Nakagawa and Santos, 2012; Noble et al., 2017) is as follows:

\[ y_{i} = \mu + s_{j[i]} + spp_{k[i]} + e_{i} + m_{i} \\ m_{i} \sim N(0, v_{i}) \\ s_{j} \sim N(0, \tau^2) \\ spp_{k} \sim N(0, \sigma_{k}^2) \\ e_{i} \sim N(0, \sigma_{e}^2) \]

To demonstrate how to fit a multilevel meta-analytic model similar to the one discussed above we’ll turn to a familiar example by Pottier et al. (2021), who used the acclimation response ratio (ARR) to analyse thermal acclimation responses on \(CT_{max}\) and how this differs between the sexes. We’ll keep it simple for the moment and focus exclusively on a single sex.

We’ll download the raw data from the paper again. We’ll also need to load the `metafor`

package to fit our models and the tidyverse package for processing the data.

```
# install.packages('pacman') ; uncomment this line if you haven't already
# installed 'pacman'
pacman::p_load(metafor, tidyverse)
asr_dat <- read.csv("https://osf.io/qn2af/download")
```

We’ll also need our function for calculating ARR and its sampling variance because these don’t exist in any current packages.

```
#' @title arr
#' @description Calculates the acclimation response ratio (ARR).
#' @param t2_l Lowest of the two treatment temperatures
#' @param t1_h Highest of the two treatment temperatures
#' @param x1_h Mean trait value at high temperature
#' @param x2_l Mean trait value at low temperature
#' @param sd1_h Standard deviation of mean trait value at high temperature
#' @param sd2_l Standard deviation of mean trait value at low temperature
#' @param n1_h Sample size at high temperature
#' @param n2_l Sample size at low temperature
arr <- function(x1_h, x2_l, sd1_h, sd2_l, n1_h, n2_l, t1_h, t2_l) {
ARR <- (x1_h - x2_l)/(t1_h - t2_l)
V_ARR <- ((1/(t1_h - t2_l))^2 * (sd2_l^2/n2_l + sd1_h^2/n1_h))
return(data.frame(ARR, V_ARR))
}
```

Using the `arr`

function above we can now calculate the acclimation response ratio (ARR). Given Pottier et al. (2021) were interested in comparing the sexes, they calculated the ARR difference between males and females. To simplify here, we’ll just focus on estimating the ARR for a single sex, females. As such, we’ll calculate our ARR effect size and sampling variance and then sub-set out the data on females.

```
# Calculate the effect sizes
asr_dat <- asr_dat %>%
mutate(ARR = arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low,
sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[, 1],
V_ARR = arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low,
sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[,
2]) %>%
filter(sex == "female")
```

`metafor`

Our data is now ready for analysis! We will use the `rma.mv`

function in `metafor`

to fit our first multilevel meta-analysis model. The model takes the ARR as a response variable (`yi`

) and accounts for it’s sampling variance (`V`

= V_ARR). We are also going to estimate a random effect variance for species, between study and within-study grouping variables to control for non-independence and understand sources driving effect size variability. These are all denoted in the `random`

argument of `metafor`

which takes a list of all the random effects.

You will notice two additional arguments being specified:

The first,

`dfs`

, which calculates the degrees of freedom for the inferential tests in the model.`dfs = "contain"`

will calculate the degrees of freedom based on the lowest clustering level minus 1 (in this case study). Simulations have shown that this is a more robust measure to protect against type I errors (Nakagawa et al., 2022).The second argument,

`test`

specifies the test statistic used.`metafor`

defaults to z values based on the standard normal distribution, however, simulations have shown (Pappalardo et al., 2020) and it has been recommended (Rosenberg, 2013), that the t-distribution be used because samples sizes are often small and the true variance of the test is being estimated.

```
# Multi-level meta-analytic model
MLMA <- metafor::rma.mv(yi = ARR ~ 1, V = V_ARR, method = "REML", random = list(~1 |
species_ID, ~1 | study_ID, ~1 | es_ID), dfs = "contain", test = "t", data = asr_dat)
print(MLMA)
```

```
##
## Multivariate Meta-Analysis Model (k = 123; method: REML)
##
## Variance Components:
##
## estim sqrt nlvls fixed factor
## sigma^2.1 0.0008 0.0280 29 no species_ID
## sigma^2.2 0.0154 0.1241 21 no study_ID
## sigma^2.3 0.0097 0.0987 123 no es_ID
##
## Test for Heterogeneity:
## Q(df = 122) = 3941.0055, p-val < .0001
##
## Model Results:
##
## estimate se tval df pval ci.lb ci.ub
## 0.1668 0.0316 5.2857 20 <.0001 0.1010 0.2327 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

What is the overall mean estimate and variance of the model, and how much heterogeneity exist?

We can see from above that the model estimates an

overall meta-analytic mean as 0.167 with a 95% confidence interval of 0.101 to 0.233.In other words, for each degree variation in acclimation temperature female heat tolerance changes by an average of 0.167\(^{\circ}\)C.

We can also see that there’s a significant amount of heterogeneity among our effect size estimates with our

Q-test being significant. Some of this heterogeneity seems to be driven by variation among studies, within studies and among species as judged by the variance components of the model. We’ll probe how to dissect and discuss this heterogeneity in later tutorials.

**R version 4.0.3 (2020-10-10)**

**Platform:** x86_64-apple-darwin17.0 (64-bit)

**attached base packages:**
*stats*, *graphics*, *grDevices*, *utils*, *datasets*, *methods* and *base*

**other attached packages:**
*magick(v.2.7.2)*, *vembedr(v.0.1.5)*, *equatags(v.0.1.0)*, *mathjaxr(v.1.6-0)*, *pander(v.0.6.4)*, *orchaRd(v.2.0)*, *forcats(v.0.5.1)*, *stringr(v.1.4.0)*, *dplyr(v.1.0.9)*, *purrr(v.0.3.4)*, *readr(v.2.1.2)*, *tidyr(v.1.2.0)*, *tibble(v.3.1.7)*, *ggplot2(v.3.3.6)*, *tidyverse(v.1.3.1)*, *flextable(v.0.6.6)*, *metafor(v.3.4-0)*, *metadat(v.1.3-0)* and *Matrix(v.1.3-4)*

**loaded via a namespace (and not attached):**
*nlme(v.3.1-152)*, *fs(v.1.5.2)*, *lubridate(v.1.8.0)*, *httr(v.1.4.3)*, *tools(v.4.0.3)*, *backports(v.1.4.1)*, *bslib(v.0.3.1)*, *utf8(v.1.2.2)*, *R6(v.2.5.1)*, *DBI(v.1.1.3)*, *colorspace(v.2.0-3)*, *withr(v.2.5.0)*, *tidyselect(v.1.1.2)*, *curl(v.4.3.2)*, *compiler(v.4.0.3)*, *cli(v.3.3.0)*, *rvest(v.1.0.2)*, *formatR(v.1.11)*, *pacman(v.0.5.1)*, *xml2(v.1.3.3)*, *officer(v.0.3.18)*, *bookdown(v.0.22)*, *sass(v.0.4.1)*, *scales(v.1.2.0)*, *rappdirs(v.0.3.3)*, *systemfonts(v.1.0.2)*, *digest(v.0.6.29)*, *rmarkdown(v.2.14)*, *base64enc(v.0.1-3)*, *pkgconfig(v.2.0.3)*, *htmltools(v.0.5.2)*, *dbplyr(v.2.2.1)*, *fastmap(v.1.1.0)*, *rlang(v.1.0.3)*, *readxl(v.1.4.0)*, *rstudioapi(v.0.13)*, *jquerylib(v.0.1.4)*, *generics(v.0.1.2)*, *jsonlite(v.1.8.0)*, *zip(v.2.2.0)*, *magrittr(v.2.0.3)*, *Rcpp(v.1.0.8.3)*, *munsell(v.0.5.0)*, *fansi(v.1.0.3)*, *gdtools(v.0.2.3)*, *lifecycle(v.1.0.1)*, *stringi(v.1.7.6)*, *yaml(v.2.3.5)*, *grid(v.4.0.3)*, *crayon(v.1.5.1)*, *lattice(v.0.20-44)*, *haven(v.2.5.0)*, *hms(v.1.1.1)*, *knitr(v.1.39)*, *klippy(v.0.0.0.9500)*, *pillar(v.1.7.0)*, *uuid(v.1.1-0)*, *reprex(v.2.0.1)*, *xslt(v.1.4.3)*, *glue(v.1.6.2)*, *evaluate(v.0.15)*, *data.table(v.1.14.2)*, *remotes(v.2.4.2)*, *modelr(v.0.1.8)*, *vctrs(v.0.4.1)*, *tzdb(v.0.3.0)*, *cellranger(v.1.1.0)*, *gtable(v.0.3.0)*, *assertthat(v.0.2.1)*, *locatexec(v.0.1.1)*, *xfun(v.0.31)*, *broom(v.1.0.0)* and *ellipsis(v.0.3.2)*