Introduction to Heterogeneity in Meta-analysis

In physiology, ecology and evolution, we deal with many different populations, species, experimental designs etc. As such, we’re not just interested in understanding ‘what the overall effect’ actually is (in fact, we may not even care in many cases), but we are mainly focused on attempting to understand what factors (e.g., biological, methodological) explain variation in effects (Gurevitch et al., 2018; Lajeunesse, 2010; Noble et al., 2022). Understanding and quantifying how much variability in effects exist, whether this is more than what we expect by chance, and what factors explain variation is a primary goal of meta-analysis.

Meta-analytic mean estimates need to be interpreted in the context of how much variation in effects exist within and across studies. As such, reporting upon measures of variability, or what is referred to as ‘heterogeneity’ in meta-analysis, is essential to meta-analysis and should never be ignored (even though it often is unfortunately) (Borenstein, 2019; Gurevitch et al., 2018; Nakagawa and Santos, 2012; Nakagawa et al., 2017a; O’Dea et al., 2021).

There are a number of important metrics of heterogeneity that are commonly used and reported upon in the meta-analytic literature (Borenstein, 2019; Nakagawa and Santos, 2012; Nakagawa et al., 2017a). We’ll overview in this tutorial how to calculate, and interpret, many of the common types and what they mean with respect to interpreting the meta-analytic mean. Understanding the consistency of results across studies tells one a great deal about how likely an effect is going to be picked up in future studies and whether we can make broad general conclusions.

Measures of Heterogeneity

The most commonly encountered measures of heterogeneity in comparative physiology, and indeed ecology and evolution more generally, are results of Q tests, \(I^2\) metrics (or raw \(\tau^2\) or variance estimates), and less commonly, prediction intervals. These measures of heterogeneity (even if just a few) should always be presented along side meta-analytic mean effect size estimates because they tell the reader a great deal about the ‘consistency’ of results (e.g., prediction intervals, total heterogeneity) or the relative contribution of different factors to effect size variation (i.e., different measures of \(I^2\)). We’ll overview these different metrics and show how to calculate an interpret them in meta-analyses.

For the purpose of this tutorial we’ll assume the following multilevel meta-analytic model is fit as described in the multilevel model tutorial. If you can’t quite remember all the notation we recommend going back to review that page.

\[ y_{i} = \mu + s_{j[i]} + spp_{k[i]} + e_{i} + m_{i} \\ m_{i} \sim N(0, v_{i}) \\ s_{j} \sim N(0, \tau^2) \\ spp_{k} \sim N(0, \sigma_{k}^2) \\ e_{i} \sim N(0, \sigma_{e}^2) \] We will return to the meta-analysis by Pottier et al. (2021) that we already detailed in the multilevel meta-analysis tutorial. We’ll walk through the different types of heterogeneity statistics that we can calculate for the model.

We can re-load that data again and get it ready for analysis using the code below:

# install.packages('pacman') ; uncomment this line if you haven't already
# installed 'pacman'
pacman::p_load(metafor, tidyverse)

asr_dat <- read.csv("")

#' @title arr
#' @description Calculates the acclimation response ratio (ARR).  
#' @param t2_l  Lowest of the two treatment temperatures
#' @param t1_h  Highest of the two treatment temperatures
#' @param x1_h  Mean trait value at high temperature
#' @param x2_l  Mean trait value at low temperature
#' @param sd1_h Standard deviation of mean trait value at high temperature
#' @param sd2_l Standard deviation of mean trait value at low temperature
#' @param n1_h  Sample size at high temperature
#' @param n2_l  Sample size at low temperature

arr <- function(x1_h, x2_l, sd1_h, sd2_l, n1_h, n2_l, t1_h, t2_l) {
    ARR <- (x1_h - x2_l)/(t1_h - t2_l)
    V_ARR <- ((1/(t1_h - t2_l))^2 * (sd2_l^2/n2_l + sd1_h^2/n1_h))
    return(data.frame(ARR, V_ARR))

# Calculate the effect sizes
asr_dat <- asr_dat %>%
    mutate(ARR = arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low,
        sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[, 1],
        V_ARR = arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low,
            sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[,
            2]) %>%
    filter(sex == "female")

# Re-fit the multilevel meta-analytic model
MLMA <- = ARR ~ 1, V = V_ARR, method = "REML", random = list(~1 |
    species_ID, ~1 | study_ID, ~1 | es_ID), dfs = "contain", test = "t", data = asr_dat)

We’ll now use this model to calculate various heterogeneity statistics in addition to the ones already calculated (which we will describe below).

Proportion of Total Heterogeneity: \(I_{total}^2\)

\(I^2\) estimates are probably most commonly presented in the literature (Borenstein, 2019; Higgins and Thompson, 2002; Higgins et al., 2003; Nakagawa et al., 2017a; Senior et al., 2016). There are different forms of \(I^2\) that can be calculated, but the one that describes the proportion of effect size variation after accounting for total sampling variation is \(I_{total}^2\) (Nakagawa and Santos, 2012). Assuming we’re using our multilevel model described above, it’s calculated as follows:

\[ \begin{equation} I^2_{total} = \frac{\sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual}}{\sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual} +\sigma^2_{m}} \ \tag{1} \end{equation} \]

where \(\sigma^2_{total} = \sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual} +\sigma^2_{m}\) is the total effect size variance and \(\sigma^2_{m}\) is the ‘typical’ sampling error variance calculated as:

\[ \begin{equation} \sigma_{m}^2 = \sum w_{i}\left( k-1\right) / \left[ \left( \sum w_{i}\right)^2 + \sum w_{i}^2\right] \ \tag{2} \end{equation} \]

where k is the number of studies and the weights, \(w_{i} = \frac{1}{v_{i}}\), can be calculated using the inverse of the sampling variance (\(v_{i}\)) for each effect size, i. Below, we’ll make use of the orchaRd R package [vers. 2.0; Nakagawa et al. (2021)] to calculate various \(I^2\) metrics. In fact, we can simply change what is in the numerator of eqn (1) to calculate a multitude of different \(I^2\) metrics that can be useful in understanding the relative importance of factors most important in explaining effect size variation. We’ll do this below:

# The orchaRd package has some convenient functions for calculating various I2
# estimates including total. We'll load and install that package
# install.packages('pacman')
pacman::p_load(devtools, tidyverse, metafor, patchwork, R.rsp, emmeans, flextable)

# devtools::install_github('daniel1noble/orchaRd', force = TRUE,
# build_vignettes = TRUE)

orchaRd::i2_ml(MLMA, data = asr_dat)
##      I2_Total I2_species_ID   I2_study_ID      I2_es_ID 
##         98.42          2.98         58.48         36.96

Interpreting \(I^2\) Estimates

First Task!

Interpret the meaning of \(I_{Total}^2\) from the multilevel meta-analytic model

First Answer!

Overall, we have highly heterogeneous effect size data because sampling variation only contributes to 1.581% of the total variation in effects.

Second Task!

Interpret the meaning of \(I_{study}^2\) from the multilevel meta-analytic model

Second Answer!

From the multilevel meta-analytic model we find that 58.481% of the total variation in effect size estimates is the result of differences between studies.

Bootstrapping \(I^2\) Estimates

There are also times that we may want to estimate, and present, uncertainty about these heterogeneity estimates. We can do that by bootstrapping 1000 times:

# The orchaRd package has some convenient functions for calculating various I2
# estimates including total. We'll load and install that package
# install.packages('pacman')
pacman::p_load(devtools, tidyverse, metafor, patchwork, R.rsp, emmeans, flextable)

# devtools::install_github('daniel1noble/orchaRd', force = TRUE,
# build_vignettes = TRUE)

orchaRd::i2_ml(MLMA, data = asr_dat, boot = 1000)
##                Est. 2.5% 97.5%
## I2_Total      98.36 97.2  99.0
## I2_species_ID  1.68  0.0  28.1
## I2_study_ID   55.89 19.8  76.5
## I2_es_ID      37.72 19.1  64.0

Here, we can now see that \(I_{total}^2\) has a fairly narrow 95% confidence interval of ~ 0.97 to 0.99.

Prediction Intervals

Prediction intervals (PI) are probably the best and most intuitive way to report heterogeneity of meta-analytic results (Borenstein, 2019; Nakagawa et al., 2021; Noble et al., 2022). Predictions intervals tell us how much we can expect a given effect size to vary across studies. More specifically, if we were to go out and conduct another study or experiment they tell us what the range of effect size estimates we are expected to observe 95% of the time from that new study (Borenstein, 2019; Nakagawa et al., 2021; Noble et al., 2022).

Prediction intervals can be calculated in a similar way to confidence intervals but instead of just using the standard error, like we do with confidence interval construction we add in the extra random effect variance estimates. For our model above, PI can be calculated as follows:

\[ \begin{equation} PI \sim \bar{u} \pm 1.96 \sqrt{SE^2 + \sigma^2_{study} + \sigma^2_{species} + \sigma^2_{residual}} \tag{3} \end{equation} \]

We can get these quite easily using the predict function in metafor:

# Calculate the prediction intervals
##    pred     se  ci.ub  pi.ub 
##  0.1668 0.0316 0.1010 0.2327 -0.1755 0.5091

Interpreting Predcition Intervals


If the meta-analytic mean ARR is 0.167 what would be the expected range of effect size we would expect in a future study? Are the studies consistent or inconsistent?


Our 95% prediction intervals are wide. Effect sizes (ARR) we are expected to get range from -0.175 to 0.509, suggesting a lot of inconsistency between studies.

Q-tests: Inferential Tests to Determine Excess Heterogeneity

So far we’ve talked about important heterogeneity statistics that describe, and quantify, the proportion of variability (\(I^2\)) or the expected range of plausible effect size values which incorporate all the variance estimates (PI). But, how do we know that the heterogeneity in the data is greater than what we would expect by chance? It seems obvious that, if the \(I_{total}^2\) is really high (i.e., >85%) there’s a lot of variability in effects within and across studies. While these statistics are probably sufficient, Q tests can also be useful to more formally be able to make an inferential statement about whether heterogeneity is significantly greater than chance.

We have already encountered how to calculate the Q-statistic when discussing fixed and random effects models. We can use the Q statistic now to compare this against a \(\chi^2\)-distribution to calculate a p-value. Wolfgang’s documentation (? describes what the Q-test means:

Cochran’s 𝑄-test, which tests whether the variability in the observed effect sizes or outcomes is larger than one would expect based on sampling variability (and the given covariances among the sampling errors) alone. A significant test suggests that the true effects/outcomes are heterogeneous.

We don’t need to do anything special to get this test. In fact, this is automatically done by metafor:

## Multivariate Meta-Analysis Model (k = 123; method: REML)
## Variance Components:
##             estim    sqrt  nlvls  fixed      factor 
## sigma^2.1  0.0008  0.0280     29     no  species_ID 
## sigma^2.2  0.0154  0.1241     21     no    study_ID 
## sigma^2.3  0.0097  0.0987    123     no       es_ID 
## Test for Heterogeneity:
## Q(df = 122) = 3941.0055, p-val < .0001
## Model Results:
## estimate      se    tval  df    pval   ci.ub     ​ 
##   0.1668  0.0316  5.2857  20  <.0001  0.1010  0.2327  *** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Describing the Q-test


How would you describe in a paper the Q-test from the metafor model above?


Of course, there are a multitude of ways that we could describe this test in words, but one way to describe it in your results section might be as follows:

The overall mean ARR across ectotherms was 0.167 and differed significantly from zero (95% CI: 0.101 to 0.233, df = 20, t = 5.286, p < 0.001). However, there was significant residual heterogeneity (Q = 3941.005, df = 122, p < 0.001) with estimates expected to range from -0.175 to 0.509 (95% Prediction Interval) (\(I_{total}^2\) = 98.419).

Proportion of Variation Explained by the Model: \(R^2_{marginal}\) and \(R^2_{conditional}\)

You will probably have noticed that \(I^2\) looks awfully familiar. It looks a lot like the way we might calculate \(R^2\) described by Nakagawa and Schielzeth (2013). Well, you would be correct! Often we also want to quantify how much variation is explained by our moderators and random effects in our multilevel meta-regression model (Nakagawa and Schielzeth, 2013; Nakagawa et al., 2017b). This is done using \(R^2\). We can calculate different \(R^2\) values depending on whether we ignore or include random effects in the numerator. If only the variation explained by moderators is included in the numerator than we call this \(R_{marginal}^2\), which is defined as:

\[ \begin{equation} R^2_{marginal} = \frac{\sigma^2_{fixed}}{\sigma^2_{fixed} + \sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual}} \ \tag{4} \end{equation} \]

Note that this formula does not include \(\sigma^2_{m}\) as sampling error variance is assumed to be known in meta-analysis. We won’t go into the details on how to calculate and interpret these because we’ll cover this in a later tutorial on publication bias. Suffice it to say that this is a useful statistic for describing how much variation your moderators or model explain.


Hopefully it’s clear from this tutorial why explicitly estimating and reporting upon heterogeneity is so critically important in meta-analysis. Of course, some of these measures are only possible if we known the total sampling variance, which is not possible if one doesn’t use a weighted meta-analytic model. While all heterogeneity measures are useful, we agree with Borenstein (2019) and Nakagawa et al. (2021) that prediction intervals are probably the best to present given they are a very intuitive interpretation of heterogeneity.


Borenstein, M. (2019). Heterogenity in meta-analysis. In (ed. Cooper, H.), Hedges, L. V.), and Valentine, J. C.), pp. 454–466. New York: Russell Sage Foundation.
Gurevitch, J., Koricheva, J., Nakagawa, S. and Stewart, G. (2018). Meta-analysis and the science of research synthesis. Nature 555, 176–182.
Higgins, J. P. T. and Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine 21, 1539–1558.
Higgins, J. P. T., Thompson, S. G., Deeks, J. J. and Altman, D. G. (2003). Measuring inconsistency in meta-analyses. British Medical Journal 327, 557–560.
Lajeunesse, M. J. (2010). Achieving synthesis with meta-analysis by combining and comparing all available studies. Ecology 91, 2561–2564.
Nakagawa, S. and Santos, E. S. (2012). Methodological issues and advances in biological meta-analysis. Evolutionary Ecology 26, 1253–1274.
Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for obtaining R2 from generalized linear mixed‐effects models. Methods in Ecology and Evolution 4, 133–142.
Nakagawa, S., Noble, D. W. A., Senior, A. M. and Lagisz, M. (2017a). Meta-evaluation of meta-analysis: Ten appraisal questions for biologists. BMC Biology 15–18, DOI 10.1186/s12915-017-0357-7.
Nakagawa, S., Johnson, P. C. and Schielzeth, H. (2017b). The coefficient of determination r 2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of the Royal Society Interface 14, 20170213.
Nakagawa, S., Lagisz, M., O’Dea, R. E., Rutkowska, J., Yang, Y., Noble, D. W. A. and Senior, A. M. (2021). The orchard plot: Cultivating forest plots for use in ecology, evolution and beyond. Research Synthesis Methods 12, 4–12.
Noble, D. W. A., Pottier, P., Lagisz, M., Burke, S., Drobniak, S. M., O’Dea, R. E. and Nakagawa, S. (2022). Meta-analytic approaches and effect sizes to account for “nuisance heterogeneity” in comparative physiology. Journal of Experimental Biology 225, jeb243225.
O’Dea, R. E., Lagisz, M., Jennions, M. D., Koricheva, J., Noble, D. W. A., Parker, T. H., Gurevitch, J., Page, M. J., Stewart, G., Moher, D., et al. (2021). Preferred reporting items for systematic reviews and meta-analyses in ecology and evolutionary biology: A PRISMA extension. Biological Reviews doi: 10.1111/brv.12721,.
Pottier, P., Burke, S., Drobniak, S. M., Lagisz, M. and Nakagawa, S. (2021). Sexual (in) equality? A meta‐analysis of sex differences in thermal acclimation capacity across ectotherms. Functional Ecology 35, 2663–2678,–2435.13899.
Senior, A. M., Grueber, C. E., Kamiya, T., Lagisz, M., O’dwyer, K., Santos, E. S. A. and Nakagawa, S. (2016). Heterogeneity in ecological and evolutionary meta‐analyses: Its magnitude and implications. Ecology 97, 3293–3299.

Session Information

R version 4.0.3 (2020-10-10)

Platform: x86_64-apple-darwin17.0 (64-bit)

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: emmeans(v.1.7.5), R.rsp(v.0.44.0), patchwork(v.1.1.1), devtools(v.2.4.3), usethis(v.2.1.5), magick(v.2.7.2), vembedr(v.0.1.5), equatags(v.0.1.0), mathjaxr(v.1.6-0), pander(v.0.6.4), orchaRd(v.2.0), forcats(v.0.5.1), stringr(v.1.4.0), dplyr(v.1.0.9), purrr(v.0.3.4), readr(v.2.1.2), tidyr(v.1.2.0), tibble(v.3.1.7), ggplot2(v.3.3.6), tidyverse(v.1.3.1), flextable(v.0.6.6), metafor(v.3.4-0), metadat(v.1.3-0) and Matrix(v.1.3-4)

loaded via a namespace (and not attached):, colorspace(v.2.0-3), ellipsis(v.0.3.2), rprojroot(v.2.0.3), estimability(v.1.3), base64enc(v.0.1-3), fs(v.1.5.2), rstudioapi(v.0.13), remotes(v.2.4.2), mvtnorm(v.1.1-3), fansi(v.1.0.3), lubridate(v.1.8.0), xml2(v.1.3.3), codetools(v.0.2-18), splines(v.4.0.3), R.methodsS3(v.1.8.1), cachem(v.1.0.6), knitr(v.1.39), pkgload(v.1.2.4), jsonlite(v.1.8.0), broom(v.1.0.0), dbplyr(v.2.2.1), R.oo(v.1.24.0), compiler(v.4.0.3), httr(v.1.4.3), backports(v.1.4.1), assertthat(v.0.2.1), fastmap(v.1.1.0), cli(v.3.3.0), formatR(v.1.11), htmltools(v.0.5.2), prettyunits(v.1.1.1), tools(v.4.0.3), coda(v.0.19-4), gtable(v.0.3.0), glue(v.1.6.2), rappdirs(v.0.3.3), Rcpp(v., cellranger(v.1.1.0), jquerylib(v.0.1.4), vctrs(v.0.4.1), nlme(v.3.1-152), xfun(v.0.31), ps(v.1.7.1), brio(v.1.1.3), testthat(v.3.1.4), rvest(v.1.0.2), lifecycle(v.1.0.1), pacman(v.0.5.1), klippy(v., MASS(v.7.3-54), zoo(v.1.8-10), scales(v.1.2.0), hms(v.1.1.1), sandwich(v.3.0-1), yaml(v.2.3.5), curl(v.4.3.2), memoise(v.2.0.1), gdtools(v.0.2.3), sass(v.0.4.1), stringi(v.1.7.6), desc(v.1.4.1), pkgbuild(v.1.3.1), zip(v.2.2.0), rlang(v.1.0.3), pkgconfig(v.2.0.3), systemfonts(v.1.0.2), evaluate(v.0.15), lattice(v.0.20-44), tidyselect(v.1.1.2), processx(v.3.6.1), magrittr(v.2.0.3), bookdown(v.0.22), R6(v.2.5.1), generics(v.0.1.2), multcomp(v.1.4-17), DBI(v.1.1.3), pillar(v.1.7.0), haven(v.2.5.0), withr(v.2.5.0), survival(v.3.2-11), modelr(v.0.1.8), crayon(v.1.5.1), locatexec(v.0.1.1), uuid(v.1.1-0), utf8(v.1.2.2), tzdb(v.0.3.0), rmarkdown(v.2.14), officer(v.0.3.18), progress(v.1.2.2), grid(v.4.0.3), readxl(v.1.4.0), data.table(v.1.14.2), callr(v.3.7.0), xslt(v.1.4.3), reprex(v.2.0.1), digest(v.0.6.29), xtable(v.1.8-4), R.cache(v.0.15.0), R.utils(v.2.11.0), munsell(v.0.5.0), bslib(v.0.3.1) and sessioninfo(v.1.2.2)