In physiology, ecology and evolution, we deal with many different populations, species, experimental designs etc. As such, we’re not just interested in understanding ‘what the overall effect’ actually is (in fact, we may not even care in many cases), but we are mainly focused on attempting to understand what factors (e.g., biological, methodological) explain variation in effects (Gurevitch et al., 2018; Lajeunesse, 2010; Noble et al., 2022). Understanding and quantifying how much variability in effects exist, whether this is more than what we expect by chance, and what factors explain variation is a primary goal of meta-analysis.

Meta-analytic mean estimates need to be interpreted in the context of how much variation in effects exist within and across studies. As such, reporting upon measures of variability, or what is referred to as ‘heterogeneity’ in meta-analysis, is essential to meta-analysis and should never be ignored (even though it often is unfortunately) (Borenstein, 2019; Gurevitch et al., 2018; Nakagawa and Santos, 2012; Nakagawa et al., 2017a; O’Dea et al., 2021).

There are a number of important metrics of heterogeneity that are commonly used and reported upon in the meta-analytic literature (Borenstein, 2019; Nakagawa and Santos, 2012; Nakagawa et al., 2017a). We’ll overview in this tutorial how to calculate, and interpret, many of the common types and what they mean with respect to interpreting the meta-analytic mean. Understanding the consistency of results across studies tells one a great deal about how likely an effect is going to be picked up in future studies and whether we can make broad general conclusions.

The most commonly encountered measures of heterogeneity in comparative physiology, and indeed ecology and evolution more generally, are results of *Q* tests, \(I^2\) metrics (or raw \(\tau^2\) or variance estimates), and less commonly, prediction intervals. These measures of heterogeneity (even if just a few) should always be presented along side meta-analytic mean effect size estimates because they tell the reader a great deal about the ‘consistency’ of results (e.g., prediction intervals, total heterogeneity) or the relative contribution of different factors to effect size variation (i.e., different measures of \(I^2\)). We’ll overview these different metrics and show how to calculate an interpret them in meta-analyses.

For the purpose of this tutorial we’ll assume the following multilevel meta-analytic model is fit as described in the multilevel model tutorial. If you can’t quite remember all the notation we recommend going back to review that page.

\[ y_{i} = \mu + s_{j[i]} + spp_{k[i]} + e_{i} + m_{i} \\ m_{i} \sim N(0, v_{i}) \\ s_{j} \sim N(0, \tau^2) \\ spp_{k} \sim N(0, \sigma_{k}^2) \\ e_{i} \sim N(0, \sigma_{e}^2) \] We will return to the meta-analysis by Pottier et al. (2021) that we already detailed in the multilevel meta-analysis tutorial. We’ll walk through the different types of heterogeneity statistics that we can calculate for the model.

We can re-load that data again and get it ready for analysis using the code below:

```
# install.packages('pacman') ; uncomment this line if you haven't already
# installed 'pacman'
pacman::p_load(metafor, tidyverse)
asr_dat <- read.csv("https://osf.io/qn2af/download")
#' @title arr
#' @description Calculates the acclimation response ratio (ARR).
#' @param t2_l Lowest of the two treatment temperatures
#' @param t1_h Highest of the two treatment temperatures
#' @param x1_h Mean trait value at high temperature
#' @param x2_l Mean trait value at low temperature
#' @param sd1_h Standard deviation of mean trait value at high temperature
#' @param sd2_l Standard deviation of mean trait value at low temperature
#' @param n1_h Sample size at high temperature
#' @param n2_l Sample size at low temperature
arr <- function(x1_h, x2_l, sd1_h, sd2_l, n1_h, n2_l, t1_h, t2_l) {
ARR <- (x1_h - x2_l)/(t1_h - t2_l)
V_ARR <- ((1/(t1_h - t2_l))^2 * (sd2_l^2/n2_l + sd1_h^2/n1_h))
return(data.frame(ARR, V_ARR))
}
# Calculate the effect sizes
asr_dat <- asr_dat %>%
mutate(ARR = arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low,
sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[, 1],
V_ARR = arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low,
sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[,
2]) %>%
filter(sex == "female")
# Re-fit the multilevel meta-analytic model
MLMA <- metafor::rma.mv(yi = ARR ~ 1, V = V_ARR, method = "REML", random = list(~1 |
species_ID, ~1 | study_ID, ~1 | es_ID), dfs = "contain", test = "t", data = asr_dat)
```

We’ll now use this model to calculate various heterogeneity statistics in addition to the ones already calculated (which we will describe below).

\(I^2\) estimates are probably most commonly presented in the literature (Borenstein, 2019; Higgins and Thompson, 2002; Higgins et al., 2003; Nakagawa et al., 2017a; Senior et al., 2016). There are different forms of \(I^2\) that can be calculated, but the one that describes the proportion of effect size variation after accounting for total sampling variation is \(I_{total}^2\) (Nakagawa and Santos, 2012). Assuming we’re using our multilevel model described above, it’s calculated as follows:

\[ \begin{equation} I^2_{total} = \frac{\sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual}}{\sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual} +\sigma^2_{m}} \ \tag{1} \end{equation} \]

where \(\sigma^2_{total} = \sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual} +\sigma^2_{m}\) is the total effect size variance and \(\sigma^2_{m}\) is the ‘typical’ sampling error variance calculated as:

\[ \begin{equation} \sigma_{m}^2 = \sum w_{i}\left( k-1\right) / \left[ \left( \sum w_{i}\right)^2 + \sum w_{i}^2\right] \ \tag{2} \end{equation} \]

where *k* is the number of studies and the weights, \(w_{i} = \frac{1}{v_{i}}\), can be calculated using the inverse of the sampling variance (\(v_{i}\)) for each effect size, *i*. Below, we’ll make use of the `orchaRd`

R package [vers. 2.0; Nakagawa et al. (2021)] to calculate various \(I^2\) metrics. In fact, we can simply change what is in the numerator of eqn (1) to calculate a multitude of different \(I^2\) metrics that can be useful in understanding the relative importance of factors most important in explaining effect size variation. We’ll do this below:

```
# The orchaRd package has some convenient functions for calculating various I2
# estimates including total. We'll load and install that package
# install.packages('pacman')
pacman::p_load(devtools, tidyverse, metafor, patchwork, R.rsp, emmeans, flextable)
# devtools::install_github('daniel1noble/orchaRd', force = TRUE,
# build_vignettes = TRUE)
library(orchaRd)
orchaRd::i2_ml(MLMA, data = asr_dat)
```

```
## I2_Total I2_species_ID I2_study_ID I2_es_ID
## 98.42 2.98 58.48 36.96
```

Interpret the meaning of \(I_{Total}^2\) from the multilevel meta-analytic model

Overall, we have highly heterogeneous effect size data because sampling variation only contributes to 1.581% of the total variation in effects.

Interpret the meaning of \(I_{study}^2\) from the multilevel meta-analytic model

From the multilevel meta-analytic model we find that 58.481% of the total variation in effect size estimates is the result of differences between studies.

There are also times that we may want to estimate, and present, uncertainty about these heterogeneity estimates. We can do that by bootstrapping 1000 times:

```
# The orchaRd package has some convenient functions for calculating various I2
# estimates including total. We'll load and install that package
# install.packages('pacman')
pacman::p_load(devtools, tidyverse, metafor, patchwork, R.rsp, emmeans, flextable)
# devtools::install_github('daniel1noble/orchaRd', force = TRUE,
# build_vignettes = TRUE)
library(orchaRd)
orchaRd::i2_ml(MLMA, data = asr_dat, boot = 1000)
```

```
## Est. 2.5% 97.5%
## I2_Total 98.36 97.2 99.0
## I2_species_ID 1.68 0.0 28.1
## I2_study_ID 55.89 19.8 76.5
## I2_es_ID 37.72 19.1 64.0
```

Here, we can now see that \(I_{total}^2\) has a fairly narrow 95% confidence interval of ~ 0.97 to 0.99.

Prediction intervals (PI) are probably the best and most intuitive way to report heterogeneity of meta-analytic results (Borenstein, 2019; Nakagawa et al., 2021; Noble et al., 2022). Predictions intervals tell us how much we can expect a given effect size to vary across studies. More specifically, if we were to go out and conduct another study or experiment they tell us what the range of effect size estimates we are expected to observe 95% of the time from that new study (Borenstein, 2019; Nakagawa et al., 2021; Noble et al., 2022).

Prediction intervals can be calculated in a similar way to confidence intervals but instead of just using the standard error, like we do with confidence interval construction we add in the extra random effect variance estimates. For our model above, PI can be calculated as follows:

\[ \begin{equation} PI \sim \bar{u} \pm 1.96 \sqrt{SE^2 + \sigma^2_{study} + \sigma^2_{species} + \sigma^2_{residual}} \tag{3} \end{equation} \]

We can get these quite easily using the `predict`

function in `metafor`

:

```
# Calculate the prediction intervals
predict(MLMA)
```

```
##
## pred se ci.lb ci.ub pi.lb pi.ub
## 0.1668 0.0316 0.1010 0.2327 -0.1755 0.5091
```

If the meta-analytic mean ARR is 0.167 what would be the expected range of effect size we would expect in a future study? Are the studies consistent or inconsistent?

Our 95% prediction intervals are wide. Effect sizes (ARR) we are expected to get range from -0.175 to 0.509, suggesting a lot of inconsistency between studies.

So far we’ve talked about important heterogeneity statistics that describe, and quantify, the proportion of variability (\(I^2\)) or the expected range of plausible effect size values which incorporate all the variance estimates (PI). But, how do we know that the heterogeneity in the data is greater than what we would expect by chance? It seems obvious that, if the \(I_{total}^2\) is really high (i.e., >85%) there’s a lot of variability in effects within and across studies. While these statistics are probably sufficient, *Q* tests can also be useful to more formally be able to make an inferential statement about whether heterogeneity is significantly greater than chance.

We have already encountered how to calculate the *Q*-statistic when discussing fixed and random effects models. We can use the *Q* statistic now to compare this against a \(\chi^2\)-distribution to calculate a p-value. Wolfgang’s documentation (`?rma.mv`

) describes what the *Q*-test means:

Cochran’s 𝑄-test, which tests whether the variability in the observed effect sizes or outcomes is larger than one would expect based on sampling variability (and the given covariances among the sampling errors) alone. A significant test suggests that the true effects/outcomes are heterogeneous.

We don’t need to do anything special to get this test. In fact, this is automatically done by `metafor`

:

`print(MLMA)`

```
##
## Multivariate Meta-Analysis Model (k = 123; method: REML)
##
## Variance Components:
##
## estim sqrt nlvls fixed factor
## sigma^2.1 0.0008 0.0280 29 no species_ID
## sigma^2.2 0.0154 0.1241 21 no study_ID
## sigma^2.3 0.0097 0.0987 123 no es_ID
##
## Test for Heterogeneity:
## Q(df = 122) = 3941.0055, p-val < .0001
##
## Model Results:
##
## estimate se tval df pval ci.lb ci.ub
## 0.1668 0.0316 5.2857 20 <.0001 0.1010 0.2327 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

How would you describe in a paper theQ-test from the`metafor`

model above?

Of course, there are a multitude of ways that we could describe this test in words, but one way to describe it in your results section might be as follows:

The overall mean ARR across ectotherms was 0.167 and differed significantly from zero (95% CI: 0.101 to 0.233, df = 20, t = 5.286, p < 0.001). However, there was significant residual heterogeneity (

Q= 3941.005, df = 122, p < 0.001) with estimates expected to range from -0.175 to 0.509 (95% Prediction Interval) (\(I_{total}^2\) = 98.419).

You will probably have noticed that \(I^2\) looks awfully familiar. It looks a lot like the way we might calculate \(R^2\) described by Nakagawa and Schielzeth (2013). Well, you would be correct! Often we also want to quantify how much variation is explained by our moderators and random effects in our multilevel meta-regression model (Nakagawa and Schielzeth, 2013; Nakagawa et al., 2017b). This is done using \(R^2\). We can calculate different \(R^2\) values depending on whether we ignore or include random effects in the numerator. If only the variation explained by moderators is included in the numerator than we call this \(R_{marginal}^2\), which is defined as:

\[ \begin{equation} R^2_{marginal} = \frac{\sigma^2_{fixed}}{\sigma^2_{fixed} + \sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual}} \ \tag{4} \end{equation} \]

Note that this formula does not include \(\sigma^2_{m}\) as sampling error variance is assumed to be known in meta-analysis. We won’t go into the details on how to calculate and interpret these because we’ll cover this in a later tutorial on publication bias. Suffice it to say that this is a useful statistic for describing how much variation your moderators or model explain.

Hopefully it’s clear from this tutorial why explicitly estimating and reporting upon heterogeneity is so critically important in meta-analysis. Of course, some of these measures are only possible if we known the total sampling variance, which is not possible if one doesn’t use a weighted meta-analytic model. While all heterogeneity measures are useful, we agree with Borenstein (2019) and Nakagawa et al. (2021) that prediction intervals are probably the best to present given they are a very intuitive interpretation of heterogeneity.

**R version 4.0.3 (2020-10-10)**

**Platform:** x86_64-apple-darwin17.0 (64-bit)

**attached base packages:**
*stats*, *graphics*, *grDevices*, *utils*, *datasets*, *methods* and *base*

**other attached packages:**
*emmeans(v.1.7.5)*, *R.rsp(v.0.44.0)*, *patchwork(v.1.1.1)*, *devtools(v.2.4.3)*, *usethis(v.2.1.5)*, *magick(v.2.7.2)*, *vembedr(v.0.1.5)*, *equatags(v.0.1.0)*, *mathjaxr(v.1.6-0)*, *pander(v.0.6.4)*, *orchaRd(v.2.0)*, *forcats(v.0.5.1)*, *stringr(v.1.4.0)*, *dplyr(v.1.0.9)*, *purrr(v.0.3.4)*, *readr(v.2.1.2)*, *tidyr(v.1.2.0)*, *tibble(v.3.1.7)*, *ggplot2(v.3.3.6)*, *tidyverse(v.1.3.1)*, *flextable(v.0.6.6)*, *metafor(v.3.4-0)*, *metadat(v.1.3-0)* and *Matrix(v.1.3-4)*

**loaded via a namespace (and not attached):**
*TH.data(v.1.0-10)*, *colorspace(v.2.0-3)*, *ellipsis(v.0.3.2)*, *rprojroot(v.2.0.3)*, *estimability(v.1.3)*, *base64enc(v.0.1-3)*, *fs(v.1.5.2)*, *rstudioapi(v.0.13)*, *remotes(v.2.4.2)*, *mvtnorm(v.1.1-3)*, *fansi(v.1.0.3)*, *lubridate(v.1.8.0)*, *xml2(v.1.3.3)*, *codetools(v.0.2-18)*, *splines(v.4.0.3)*, *R.methodsS3(v.1.8.1)*, *cachem(v.1.0.6)*, *knitr(v.1.39)*, *pkgload(v.1.2.4)*, *jsonlite(v.1.8.0)*, *broom(v.1.0.0)*, *dbplyr(v.2.2.1)*, *R.oo(v.1.24.0)*, *compiler(v.4.0.3)*, *httr(v.1.4.3)*, *backports(v.1.4.1)*, *assertthat(v.0.2.1)*, *fastmap(v.1.1.0)*, *cli(v.3.3.0)*, *formatR(v.1.11)*, *htmltools(v.0.5.2)*, *prettyunits(v.1.1.1)*, *tools(v.4.0.3)*, *coda(v.0.19-4)*, *gtable(v.0.3.0)*, *glue(v.1.6.2)*, *rappdirs(v.0.3.3)*, *Rcpp(v.1.0.8.3)*, *cellranger(v.1.1.0)*, *jquerylib(v.0.1.4)*, *vctrs(v.0.4.1)*, *nlme(v.3.1-152)*, *xfun(v.0.31)*, *ps(v.1.7.1)*, *brio(v.1.1.3)*, *testthat(v.3.1.4)*, *rvest(v.1.0.2)*, *lifecycle(v.1.0.1)*, *pacman(v.0.5.1)*, *klippy(v.0.0.0.9500)*, *MASS(v.7.3-54)*, *zoo(v.1.8-10)*, *scales(v.1.2.0)*, *hms(v.1.1.1)*, *sandwich(v.3.0-1)*, *yaml(v.2.3.5)*, *curl(v.4.3.2)*, *memoise(v.2.0.1)*, *gdtools(v.0.2.3)*, *sass(v.0.4.1)*, *stringi(v.1.7.6)*, *desc(v.1.4.1)*, *pkgbuild(v.1.3.1)*, *zip(v.2.2.0)*, *rlang(v.1.0.3)*, *pkgconfig(v.2.0.3)*, *systemfonts(v.1.0.2)*, *evaluate(v.0.15)*, *lattice(v.0.20-44)*, *tidyselect(v.1.1.2)*, *processx(v.3.6.1)*, *magrittr(v.2.0.3)*, *bookdown(v.0.22)*, *R6(v.2.5.1)*, *generics(v.0.1.2)*, *multcomp(v.1.4-17)*, *DBI(v.1.1.3)*, *pillar(v.1.7.0)*, *haven(v.2.5.0)*, *withr(v.2.5.0)*, *survival(v.3.2-11)*, *modelr(v.0.1.8)*, *crayon(v.1.5.1)*, *locatexec(v.0.1.1)*, *uuid(v.1.1-0)*, *utf8(v.1.2.2)*, *tzdb(v.0.3.0)*, *rmarkdown(v.2.14)*, *officer(v.0.3.18)*, *progress(v.1.2.2)*, *grid(v.4.0.3)*, *readxl(v.1.4.0)*, *data.table(v.1.14.2)*, *callr(v.3.7.0)*, *xslt(v.1.4.3)*, *reprex(v.2.0.1)*, *digest(v.0.6.29)*, *xtable(v.1.8-4)*, *R.cache(v.0.15.0)*, *R.utils(v.2.11.0)*, *munsell(v.0.5.0)*, *bslib(v.0.3.1)* and *sessioninfo(v.1.2.2)*