In physiology, ecology and evolution, we deal with many different populations, species, experimental designs etc. As such, we’re not just interested in understanding ‘what the overall effect’ actually is (in fact, we may not even care in many cases), but we are mainly focused on attempting to understand what factors (e.g., biological, methodological) explain variation in effects (Gurevitch et al., 2018; Lajeunesse, 2010; Noble et al., 2022). Understanding and quantifying how much variability in effects exists, whether this is more than what we expect by chance, and what factors explain variation is a primary goal of meta-analysis.
Meta-analytic mean estimates need to be interpreted in the context of how much variation in effects exists within and across studies. As such, reporting upon measures of variability, or what is referred to as ‘heterogeneity’ in meta-analysis, is essential to meta-analysis and should never be ignored (even though it often is unfortunately) (Borenstein, 2019; Gurevitch et al., 2018; Nakagawa and Santos, 2012; Nakagawa et al., 2017a; O’Dea et al., 2021).
There are a number of important metrics of heterogeneity that are commonly used and reported upon in the meta-analytic literature (Borenstein, 2019; Nakagawa and Santos, 2012; Nakagawa et al., 2017a). We’ll overview in this tutorial how to calculate, and interpret, many of the common types and what they mean with respect to interpreting the meta-analytic mean. Understanding the consistency of results across studies tells one a great deal about how likely an effect is going to be picked up in future studies and whether we can make broad general conclusions.
Measures of Heterogeneity
The most commonly encountered measures of heterogeneity in comparative physiology, and indeed ecology and evolution more generally, are results of Q tests, \(I^2\) metrics (or raw \(\tau^2\) or variance estimates), and less commonly, prediction intervals. These measures of heterogeneity (even if just a few) should always be presented alongside meta-analytic mean effect size estimates because they tell the reader a great deal about the ‘consistency’ of results (e.g., prediction intervals, total heterogeneity) or the relative contribution of different factors to effect size variation (i.e., different measures of \(I^2\)). We’ll overview these different metrics and show how to calculate and interpret them in meta-analyses.
For the purpose of this tutorial we’ll assume the following multilevel meta-analytic model is fit as described in the multilevel model tutorial. If you can’t quite remember all the notation we recommend going back to review that page.
\[
\begin{aligned}
y_{i} &= \mu + s_{j[i]} + spp_{k[i]} + e_{i} + m_{i} \\
m_{i} &\sim N(0, v_{i}) \\
s_{j} &\sim N(0, \tau^2) \\
spp_{k} &\sim N(0, \sigma_{k}^2) \\
e_{i} &\sim N(0, \sigma_{e}^2)
\end{aligned}
\] We will return to the meta-analysis by Pottier et al. (2021) that we already detailed in the multilevel meta-analysis tutorial. We’ll walk through the different types of heterogeneity statistics that we can calculate for the model.
We can re-load that data again and get it ready for analysis using the code below:
Code
# install.packages("pacman") ; uncomment this line if you haven't already installed 'pacman'pacman::p_load(metafor, tidyverse, orchaRd, devtools, patchwork, R.rsp, emmeans, flextable)asr_dat <-read.csv("https://osf.io/qn2af/download")#' @title arr#' @description Calculates the acclimation response ratio (ARR). #' @param t2_l Lowest of the two treatment temperatures#' @param t1_h Highest of the two treatment temperatures#' @param x1_h Mean trait value at high temperature#' @param x2_l Mean trait value at low temperature#' @param sd1_h Standard deviation of mean trait value at high temperature#' @param sd2_l Standard deviation of mean trait value at low temperature#' @param n1_h Sample size at high temperature#' @param n2_l Sample size at low temperaturearr <-function(x1_h, x2_l, sd1_h, sd2_l, n1_h, n2_l, t1_h, t2_l){ ARR <- (x1_h - x2_l)/(t1_h - t2_l) V_ARR <- ((1/(t1_h - t2_l))^2*(sd2_l^2/n2_l + sd1_h^2/n1_h))return(data.frame(ARR, V_ARR))}# Calculate the effect sizesasr_dat<-asr_dat %>%mutate(ARR=arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low, sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[,1], V_ARR =arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low, sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[,2]) %>%filter(sex =="female")# Re-fit the multilevel meta-analytic modelMLMA <- metafor::rma.mv(yi= ARR~1, V = V_ARR, method="REML",random=list(~1|species_ID,~1|study_ID,~1|es_ID), dfs ="contain",test="t",data=asr_dat)
We’ll now use this model to calculate various heterogeneity statistics in addition to the ones already calculated (which we will describe below). Before doing so, it helps to visualise the data and the model output together.
Visualising Heterogeneity with an Orchard Plot
The orchaRd package provides the orchard_plot() function, which is a great tool for communicating meta-analytic results and heterogeneity at the same time (Nakagawa et al., 2021; Nakagawa et al., 2023). The plot shows:
the meta-analytic mean as a point estimate (the filled circle in Figure 7.1);
a thick bar spanning the 95% confidence interval (uncertainty around the mean);
a thin bar spanning the 95% prediction interval (the expected range of a future effect size — the key to visualising heterogeneity);
individual effect sizes are plotted as points scaled by their precision (\(1/\sqrt{v_i}\)), so precise estimates appear larger.
Code
orchaRd::orchard_plot(MLMA, group ="species_ID", xlab ="Acclimation Response Ratio (ARR, °C / °C)")
Figure 7.1: Orchard plot of acclimation response ratio (ARR) estimates across ectotherm species. The filled circle is the meta-analytic mean, the thick bar is the 95% confidence interval, and the thin bar is the 95% prediction interval. Individual effect sizes are scaled by precision.
The wide prediction interval in Figure 7.1 immediately makes clear that there is substantial heterogeneity in ARR across species — the spread of effects is far greater than sampling error alone would produce. The numerical heterogeneity metrics below quantify this formally.
Proportion of Total Heterogeneity: \(I_{total}^2\)
where \(\sigma^2_{total} = \sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual} +\sigma^2_{m}\) is the total effect size variance and \(\sigma^2_{m}\) is the ‘typical’ sampling error variance calculated as:
Note
Equation 7.1 is the general form and may include a phylogenetic variance component (\(\sigma^2_{phylogeny}\)) when a phylogenetic correlation structure is added to the model. In the model fitted here, no phylogenetic term is included, so \(\sigma^2_{phylogeny} = 0\) and the numerator reduces to \(\sigma^2_{study} + \sigma^2_{species} + \sigma^2_{residual}\).
where k is the number of studies and the weights, \(w_{i} = \frac{1}{v_{i}}\), can be calculated using the inverse of the sampling variance (\(v_{i}\)) for each effect size, i. Below, we’ll make use of the orchaRd R package (Nakagawa et al., 2021; Nakagawa et al., 2023) to calculate various \(I^2\) metrics. In fact, we can simply change what is in the numerator of Equation 7.1 to calculate a multitude of different \(I^2\) metrics that can be useful in understanding the relative importance of factors most important in explaining effect size variation. We’ll do this below:
Code
# The orchaRd package has some convenient functions for calculating various I2 estimates including total. We'll load and install that package#install.packages("pacman")pacman::p_load(devtools, tidyverse, metafor, patchwork, R.rsp, emmeans, flextable)#devtools::install_github("daniel1noble/orchaRd", force = TRUE, build_vignettes = TRUE)library(orchaRd)orchaRd::i2_ml(MLMA)
Interpret the meaning of \(I_{Total}^2\) from the multilevel meta-analytic model
Overall, we have highly heterogeneous effect size data because sampling variation only contributes to 1.581% of the total variation in effects.
Interpret the meaning of \(I_{study}^2\) from the multilevel meta-analytic model
From the multilevel meta-analytic model we find that 58.481% of the total variation in effect size estimates is the result of differences between studies.
Bootstrapping \(I^2\) Estimates
There are also times that we may want to estimate, and present, uncertainty about these heterogeneity estimates. We can do that by bootstrapping 1000 times:
Code
# The orchaRd package has some convenient functions for calculating various I2 estimates including total. We'll load and install that package#install.packages("pacman")pacman::p_load(devtools, tidyverse, metafor, patchwork, R.rsp, emmeans, flextable)#devtools::install_github("daniel1noble/orchaRd", force = TRUE, build_vignettes = TRUE)library(orchaRd)orchaRd::i2_ml(MLMA, boot =1000)
Here, we can now see that \(I_{total}^2\) has a fairly narrow 95% confidence interval of ~ 0.97 to 0.99.
Beyond \(I^2\): Mean-Standardised Heterogeneity (\(CVH_2\) and \(M_2\))
\(I^2\) is a variance-partitioning metric: it tells you what fraction of total variance is genuine heterogeneity, not sampling error. That is useful, but it says nothing about whether the heterogeneity is large or small relative to the mean effect. Consider two studies that both have \(I^2_{total} = 0.90\). If one has a mean ARR of 0.001 °C/°C the absolute spread of effects is biologically negligible; if the other has a mean of 1 °C/°C the same 90% translates to enormous between-study variation. Yang et al. (2025) propose two mean-standardised alternatives that capture this information:
where \(\hat{\mu}\) is the meta-analytic mean and \(\sum \sigma^2\) is the sum of all random-effect variance components (excluding sampling variance \(\sigma^2_m\)). \(CVH_2\) (Equation 7.3) is analogous to a squared coefficient of variation: values \(> 1\) indicate that heterogeneity variance exceeds the mean effect squared. \(M_2\) (Equation 7.4) is bounded \([0, 1]\) and can be read as the proportion of total ‘signal’ (mean\(^2\) + variance) that is heterogeneity. Yang et al. (2025) recommend reporting both \(CVH_2\) and \(M_2\) alongside \(I^2\) for a more complete picture. These metrics are implemented in orchaRd:
Given the values of \(CVH_2\) and \(M_2\) above, what do they tell you about ARR heterogeneity that \(I^2_{total}\) does not?
\(I^2_{total}\) tells us that almost all of the variance in ARR estimates is real heterogeneity rather than sampling noise. But it does not tell us whether that heterogeneity is large or small on a biologically meaningful scale. \(CVH_2\) answers that: a value much greater than 1 means the variance across studies is many times larger than the mean effect squared — i.e., the spread of true effects dwarfs the average effect. \(M_2\) near 1 means variation dominates the signal; near 0 means the mean is large and stable relative to the scatter. Together these metrics clarify whether a high \(I^2\) reflects a lot of noise around a strong signal or a weak signal swamped by variation.
Prediction Intervals
Prediction intervals (PI) are probably the best and most intuitive way to report heterogeneity of meta-analytic results (Borenstein, 2019; Nakagawa et al., 2021; Nakagawa et al., 2023; Noble et al., 2022). Prediction intervals tell us how much we can expect a given effect size to vary across studies. More specifically, if we were to go out and conduct another study or experiment they tell us the range of effect size estimates we are expected to observe 95% of the time from that new study (Borenstein, 2019; Nakagawa et al., 2021; Noble et al., 2022).
Prediction intervals can be calculated in a similar way to confidence intervals but instead of just using the standard error, like we do with confidence interval construction, we add in the extra random-effect variance estimates. For our model above, the total PI — spanning all sources of variance — is:
However, Yang et al. (2024) highlight that the total PI conflates two distinct questions: (1) how variable are individual effect size estimates across all studies (which includes within-study residual variance, \(\sigma^2_{residual}\)), and (2) how variable are study-level means across different biological contexts (which excludes within-study residual variance). The latter — called the study-level PI — is often the more relevant question for biological generality, because it captures how much we expect the average result to shift if we replicated the study in a new species or population context:
Yang et al. (2024) found that across 512 ecological and evolutionary meta-analyses, most of the wide total PI came from within-study (residual) variance, and that study-level generality was far more common than the total PI implied.
We can compute both types easily:
Code
# Total PI — all variance components (predict() uses all sigma2 by default)predict(MLMA)
pred se ci.lb ci.ub pi.lb pi.ub
0.1668 0.0316 0.1010 0.2327 -0.1755 0.5091
Code
# Study-level PI — between-study variance only (species_ID + study_ID, excluding es_ID)# sigma2 order matches the random= list: [species_ID, study_ID, es_ID]vc <- MLMA$sigma2pi_study_se <-sqrt(sum(vc[1:2]) + MLMA$se^2)cat("Study-level 95% PI: [",round(MLMA$b -1.96* pi_study_se, 3), ",",round(MLMA$b +1.96* pi_study_se, 3), "]\n")
The meta-analytic mean ARR is 0.167. Compare the total PI to the study-level PI. What does each tell you, and why might one be more useful for assessing biological generality?
The total 95% PI is wide — from -0.175 to 0.509 — suggesting a lot of inconsistency across individual effect size estimates. But much of that width comes from within-study residual variance (variation among effect sizes within a single study). The study-level PI, which retains only between-study variance (species and study components), is narrower and tells us how much the average result is expected to shift if we repeated the study in a new biological context. For biological generality, the study-level PI is the more relevant question: if most replications in new species or populations land near the mean, the phenomenon generalises well even if individual effect sizes are noisy.
Q-tests: Inferential Tests to Determine Excess Heterogeneity
So far we’ve talked about important heterogeneity statistics that describe, and quantify, the proportion of variability (\(I^2\)) or the expected range of plausible effect size values which incorporate all the variance estimates (PI). But, how do we know that the heterogeneity in the data is greater than what we would expect by chance? It seems obvious that, if the \(I_{total}^2\) is really high (i.e., >85%) there’s a lot of variability in effects within and across studies. While these statistics are probably sufficient, Q tests can also be useful to more formally make an inferential statement about whether heterogeneity is significantly greater than chance.
We have already encountered how to calculate the Q-statistic when discussing fixed and random effects models. We can use the Q statistic now to compare this against a \(\chi^2\)-distribution to calculate a p-value. Wolfgang’s documentation (?rma.mv) describes what the Q-test means:
Cochran’s 𝑄-test, which tests whether the variability in the observed effect sizes or outcomes is larger than one would expect based on sampling variability (and the given covariances among the sampling errors) alone. A significant test suggests that the true effects/outcomes are heterogeneous.
We don’t need to do anything special to get this test. In fact, this is automatically done by metafor:
Code
print(MLMA)
Multivariate Meta-Analysis Model (k = 123; method: REML)
Variance Components:
estim sqrt nlvls fixed factor
sigma^2.1 0.0008 0.0280 29 no species_ID
sigma^2.2 0.0154 0.1241 21 no study_ID
sigma^2.3 0.0097 0.0987 123 no es_ID
Test for Heterogeneity:
Q(df = 122) = 3941.0055, p-val < .0001
Model Results:
estimate se tval df pval ci.lb ci.ub
0.1668 0.0316 5.2857 20 <.0001 0.1010 0.2327 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How would you describe in a paper the Q-test from the metafor model above?
Of course, there are a multitude of ways that we could describe this test in words, but one way to describe it in your results section might be as follows:
The overall mean ARR across ectotherms was 0.167 and differed significantly from zero (95% CI: 0.101 to 0.233, df = 20, t = 5.286, p < 0.001). However, there was significant residual heterogeneity (Q = 3941.005, df = 122, p < 0.001) with estimates expected to range from -0.175 to 0.509 (95% Prediction Interval) (\(I_{total}^2\) = 98.419).
Proportion of Variation Explained by the Model: \(R^2_{marginal}\) and \(R^2_{conditional}\)
You will probably have noticed that \(I^2\) looks awfully familiar. It looks a lot like the way we might calculate \(R^2\) described by Nakagawa and Schielzeth (2013). Well, you would be correct! Often we also want to quantify how much variation is explained by our moderators and random effects in our multilevel meta-regression model (Nakagawa and Schielzeth, 2013; Nakagawa et al., 2017b). This is done using \(R^2\). We can calculate different \(R^2\) values depending on whether we ignore or include random effects in the numerator. If only the variation explained by moderators is included in the numerator than we call this \(R_{marginal}^2\), which is defined as:
Note that this formula does not include \(\sigma^2_{m}\) as sampling error variance is assumed to be known in meta-analysis. We won’t go into the details on how to calculate and interpret these because we’ll cover this in a later tutorial on publication bias. Suffice it to say that this is a useful statistic for describing how much variation your moderators or model explain.
Conclusion
Hopefully it’s clear from this tutorial why explicitly estimating and reporting upon heterogeneity is so critically important in meta-analysis. Of course, some of these measures are only possible if we know the total sampling variance, which is not possible if one doesn’t use a weighted meta-analytic model.
No single metric tells the whole story. Yang et al. (2025) recommend a pluralistic approach to reporting heterogeneity:
\(I^2_{total}\) — tells you what fraction of variance is real signal (not sampling noise);
\(CVH_2\) and \(M_2\) — tell you how large that heterogeneity is relative to the mean effect, answering the question “does the average matter if effects are so variable?”;
Prediction intervals — give the most intuitive summary of heterogeneity as the expected range of a new result. Reporting both the total PI and the study-level PI (Yang et al., 2024) helps distinguish noise within studies from genuine variation across biological contexts;
Together these metrics give a far richer picture of heterogeneity than \(I^2\) alone.
References
Borenstein, M. (2019). Heterogenity in meta-analysis. In (ed. Cooper, H.), Hedges, L. V.), and Valentine, J. C.), pp. 454–466. New York: Russell Sage Foundation.
# Heterogeneity in Meta-analysis```{r}#| label: setup-packages-het#| echo: false#| message: false#| warning: false# Load packagespacman::p_load(metafor, flextable, tidyverse, orchaRd, pander, mathjaxr, equatags, magick)options(digits=3)```## **Introduction to Heterogeneity in Meta-analysis**In physiology, ecology and evolution, we deal with many different populations, species, experimental designs etc. As such, we're not just interested in understanding 'what the overall effect' actually is (in fact, we may not even care in many cases), but we are mainly focused on attempting to understand what factors (e.g., biological, methodological) explain variation in effects [@Noble2022; @Lag2010; @Gurevitch2018]. Understanding and quantifying how much variability in effects exists, whether this is more than what we expect by chance, and what factors explain variation is a primary goal of meta-analysis. Meta-analytic mean estimates need to be interpreted in the context of how much variation in effects exists within and across studies. As such, reporting upon measures of variability, or what is referred to as 'heterogeneity' in meta-analysis, is essential to meta-analysis and should never be ignored (even though it often is unfortunately) [@ODea2021; @NakagawaSantos2012; @Borenstein2019; @Nakagawa2017; @Gurevitch2018]. There are a number of important metrics of heterogeneity that are commonly used and reported upon in the meta-analytic literature [@NakagawaSantos2012; @Borenstein2019; @Nakagawa2017]. We'll overview in this tutorial how to calculate, and interpret, many of the common types and what they mean with respect to interpreting the meta-analytic mean. Understanding the consistency of results across studies tells one a great deal about how likely an effect is going to be picked up in future studies and whether we can make broad general conclusions. ## **Measures of Heterogeneity**The most commonly encountered measures of heterogeneity in comparative physiology, and indeed ecology and evolution more generally, are results of *Q* tests, $I^2$ metrics (or raw $\tau^2$ or variance estimates), and less commonly, prediction intervals. These measures of heterogeneity (even if just a few) should always be presented alongside meta-analytic mean effect size estimates because they tell the reader a great deal about the 'consistency' of results (e.g., prediction intervals, total heterogeneity) or the relative contribution of different factors to effect size variation (i.e., different measures of $I^2$). We'll overview these different metrics and show how to calculate and interpret them in meta-analyses. For the purpose of this tutorial we'll assume the following multilevel meta-analytic model is fit as described in the [multilevel model tutorial](multi-level.qmd). If you can't quite remember all the notation we recommend going back to review that page. $$\begin{aligned}y_{i} &= \mu + s_{j[i]} + spp_{k[i]} + e_{i} + m_{i} \\m_{i} &\sim N(0, v_{i}) \\s_{j} &\sim N(0, \tau^2) \\spp_{k} &\sim N(0, \sigma_{k}^2) \\e_{i} &\sim N(0, \sigma_{e}^2)\end{aligned}$$We will return to the meta-analysis by @Pottier2021 that we already detailed in the [multilevel meta-analysis tutorial](multi-level.qmd). We'll walk through the different types of heterogeneity statistics that we can calculate for the model. We can re-load that data again and get it ready for analysis using the code below:```{r}#| label: load-data#| message: false#| warning: false# install.packages("pacman") ; uncomment this line if you haven't already installed 'pacman'pacman::p_load(metafor, tidyverse, orchaRd, devtools, patchwork, R.rsp, emmeans, flextable)asr_dat <-read.csv("https://osf.io/qn2af/download")#' @title arr#' @description Calculates the acclimation response ratio (ARR). #' @param t2_l Lowest of the two treatment temperatures#' @param t1_h Highest of the two treatment temperatures#' @param x1_h Mean trait value at high temperature#' @param x2_l Mean trait value at low temperature#' @param sd1_h Standard deviation of mean trait value at high temperature#' @param sd2_l Standard deviation of mean trait value at low temperature#' @param n1_h Sample size at high temperature#' @param n2_l Sample size at low temperaturearr <-function(x1_h, x2_l, sd1_h, sd2_l, n1_h, n2_l, t1_h, t2_l){ ARR <- (x1_h - x2_l)/(t1_h - t2_l) V_ARR <- ((1/(t1_h - t2_l))^2*(sd2_l^2/n2_l + sd1_h^2/n1_h))return(data.frame(ARR, V_ARR))}# Calculate the effect sizesasr_dat<-asr_dat %>%mutate(ARR=arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low, sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[,1], V_ARR =arr(x1_h = mean_high, x2_l = mean_low, t1_h = acc_temp_high, t2_l = acc_temp_low, sd1_h = sd_high, sd2_l = sd_low, n1_h = n_high_adj, n2_l = n_low_adj)[,2]) %>%filter(sex =="female")# Re-fit the multilevel meta-analytic modelMLMA <- metafor::rma.mv(yi= ARR~1, V = V_ARR, method="REML",random=list(~1|species_ID,~1|study_ID,~1|es_ID), dfs ="contain",test="t",data=asr_dat)```We'll now use this model to calculate various heterogeneity statistics in addition to the ones already calculated (which we will describe below). Before doing so, it helps to visualise the data and the model output together.### Visualising Heterogeneity with an Orchard PlotThe `orchaRd` package provides the `orchard_plot()` function, which is a great tool for communicating meta-analytic results and heterogeneity at the same time [@Nakagawa2021c; @Nakagawa2023]. The plot shows:- the **meta-analytic mean** as a point estimate (the filled circle in @fig-orchard);- a **thick bar** spanning the 95% confidence interval (uncertainty around the mean);- a **thin bar** spanning the 95% **prediction interval** (the expected range of a future effect size — the key to visualising heterogeneity);- individual effect sizes are plotted as points scaled by their precision ($1/\sqrt{v_i}$), so precise estimates appear larger.```{r}#| label: fig-orchard#| echo: true#| fig-align: center#| fig-cap: "Orchard plot of acclimation response ratio (ARR) estimates across ectotherm species. The filled circle is the meta-analytic mean, the thick bar is the 95% confidence interval, and the thin bar is the 95% prediction interval. Individual effect sizes are scaled by precision."orchaRd::orchard_plot(MLMA, group ="species_ID", xlab ="Acclimation Response Ratio (ARR, °C / °C)")```The wide prediction interval in @fig-orchard immediately makes clear that there is substantial heterogeneity in ARR across species — the spread of effects is far greater than sampling error alone would produce. The numerical heterogeneity metrics below quantify this formally.### Proportion of Total Heterogeneity: $I_{total}^2$$I^2$ estimates are probably most commonly presented in the literature [@Higgins2003; @Higgins2002; @Nakagawa2017; @Senior2016; @Borenstein2019]. There are different forms of $I^2$ that can be calculated, but the one that describes the proportion of effect size variation after accounting for total sampling variation is $I_{total}^2$ [@NakagawaSantos2012]. Assuming we're using our multilevel model described above, it's calculated as follows:$$I^2_{total} = \frac{\sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual}}{\sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual} +\sigma^2_{m}}$$ {#eq-itot}where $\sigma^2_{total} = \sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual} +\sigma^2_{m}$ is the total effect size variance and $\sigma^2_{m}$ is the 'typical' sampling error variance calculated as:::: {.callout-note}@eq-itot is the general form and may include a phylogenetic variance component ($\sigma^2_{phylogeny}$) when a phylogenetic correlation structure is added to the model. In the model fitted here, no phylogenetic term is included, so $\sigma^2_{phylogeny} = 0$ and the numerator reduces to $\sigma^2_{study} + \sigma^2_{species} + \sigma^2_{residual}$.:::$$\sigma_{m}^2 = \sum w_{i}\left( k-1\right) / \left[ \left( \sum w_{i}\right)^2 + \sum w_{i}^2\right]$$ {#eq-w}where *k* is the number of studies and the weights, $w_{i} = \frac{1}{v_{i}}$, can be calculated using the inverse of the sampling variance ($v_{i}$) for each effect size, *i*. Below, we'll make use of the `orchaRd` R package [@Nakagawa2021c; @Nakagawa2023] to calculate various $I^2$ metrics. In fact, we can simply change what is in the numerator of @eq-itot to calculate a multitude of different $I^2$ metrics that can be useful in understanding the relative importance of factors most important in explaining effect size variation. We'll do this below:```{r}#| label: i2#| echo: true# The orchaRd package has some convenient functions for calculating various I2 estimates including total. We'll load and install that package#install.packages("pacman")pacman::p_load(devtools, tidyverse, metafor, patchwork, R.rsp, emmeans, flextable)#devtools::install_github("daniel1noble/orchaRd", force = TRUE, build_vignettes = TRUE)library(orchaRd)orchaRd::i2_ml(MLMA)```#### **Interpreting $I^2$ Estimates**::: {.panel-tabset}## First Task!>**Interpret the meaning of $I_{Total}^2$ from the multilevel meta-analytic model**## First Answer!>Overall, we have highly heterogeneous effect size data because sampling variation only contributes to `r 100-orchaRd::i2_ml(MLMA)[1]`% of the total variation in effects.## Second Task!>**Interpret the meaning of $I_{study}^2$ from the multilevel meta-analytic model**## Second Answer!>From the multilevel meta-analytic model we find that `r orchaRd::i2_ml(MLMA)[3]`% of the total variation in effect size estimates is the result of differences between studies.:::<br>#### Bootstrapping $I^2$ EstimatesThere are also times that we may want to estimate, and present, uncertainty about these heterogeneity estimates. We can do that by bootstrapping 1000 times:```{r}#| label: i2boot#| echo: true#| cache: true# The orchaRd package has some convenient functions for calculating various I2 estimates including total. We'll load and install that package#install.packages("pacman")pacman::p_load(devtools, tidyverse, metafor, patchwork, R.rsp, emmeans, flextable)#devtools::install_github("daniel1noble/orchaRd", force = TRUE, build_vignettes = TRUE)library(orchaRd)orchaRd::i2_ml(MLMA, boot =1000)```Here, we can now see that $I_{total}^2$ has a fairly narrow 95% confidence interval of ~ 0.97 to 0.99.### Beyond $I^2$: Mean-Standardised Heterogeneity ($CVH_2$ and $M_2$)$I^2$ is a *variance-partitioning* metric: it tells you what fraction of total variance is genuine heterogeneity, not sampling error. That is useful, but it says nothing about whether the heterogeneity is large or small *relative to the mean effect*. Consider two studies that both have $I^2_{total} = 0.90$. If one has a mean ARR of 0.001 °C/°C the absolute spread of effects is biologically negligible; if the other has a mean of 1 °C/°C the same 90% translates to enormous between-study variation. @Yang2025het propose two mean-standardised alternatives that capture this information:$$CVH_2 = \frac{\sum \sigma^2}{\hat{\mu}^2}$$ {#eq-cvh2}$$M_2 = \frac{\sum \sigma^2}{\hat{\mu}^2 + \sum \sigma^2}$$ {#eq-m2}where $\hat{\mu}$ is the meta-analytic mean and $\sum \sigma^2$ is the sum of all random-effect variance components (excluding sampling variance $\sigma^2_m$). $CVH_2$ (@eq-cvh2) is analogous to a squared coefficient of variation: values $> 1$ indicate that heterogeneity variance exceeds the mean effect squared. $M_2$ (@eq-m2) is bounded $[0, 1]$ and can be read as the proportion of total 'signal' (mean$^2$ + variance) that is heterogeneity. @Yang2025het recommend reporting both $CVH_2$ and $M_2$ alongside $I^2$ for a more complete picture. These metrics are implemented in `orchaRd`:```{r}#| label: cvhm#| echo: trueorchaRd::cvh2_ml(MLMA) # Mean-standardised heterogeneity (variance scale)orchaRd::m2_ml(MLMA) # Heterogeneity as fraction of total signal```#### **Interpreting $CVH_2$ and $M_2$**::: {.panel-tabset}## Task!>**Given the values of $CVH_2$ and $M_2$ above, what do they tell you about ARR heterogeneity that $I^2_{total}$ does not?**## Answer!>$I^2_{total}$ tells us that almost all of the variance in ARR estimates is real heterogeneity rather than sampling noise. But it does not tell us whether that heterogeneity is large or small on a biologically meaningful scale. $CVH_2$ answers that: a value much greater than 1 means the variance across studies is many times larger than the mean effect squared — i.e., the spread of true effects dwarfs the average effect. $M_2$ near 1 means variation dominates the signal; near 0 means the mean is large and stable relative to the scatter. Together these metrics clarify whether a high $I^2$ reflects *a lot of noise around a strong signal* or *a weak signal swamped by variation*.:::<br>### Prediction IntervalsPrediction intervals (PI) are probably the best and most intuitive way to report heterogeneity of meta-analytic results [@Borenstein2019; @Noble2022; @Nakagawa2021c; @Nakagawa2023]. Prediction intervals tell us how much we can expect a given effect size to vary across studies. More specifically, if we were to go out and conduct another study or experiment they tell us the range of effect size estimates we are expected to observe 95% of the time from that new study [@Borenstein2019; @Noble2022; @Nakagawa2021c].Prediction intervals can be calculated in a similar way to confidence intervals but instead of just using the standard error, like we do with confidence interval construction, we add in the extra random-effect variance estimates. For our model above, the **total** PI — spanning all sources of variance — is:$$PI_{total} \sim \bar{\mu} \pm 1.96 \sqrt{SE^2 + \sigma^2_{study} + \sigma^2_{species} + \sigma^2_{residual}}$$ {#eq-pi}However, @Yang2024pi highlight that the total PI conflates two distinct questions: (1) how variable are individual *effect size estimates* across all studies (which includes within-study residual variance, $\sigma^2_{residual}$), and (2) how variable are *study-level means* across different biological contexts (which excludes within-study residual variance). The latter — called the **study-level PI** — is often the more relevant question for biological generality, because it captures how much we expect the average result to shift if we replicated the study in a new species or population context:$$PI_{study} \sim \bar{\mu} \pm 1.96 \sqrt{SE^2 + \sigma^2_{study} + \sigma^2_{species}}$$ {#eq-pi-study}@Yang2024pi found that across 512 ecological and evolutionary meta-analyses, most of the wide total PI came from within-study (residual) variance, and that study-level generality was far more common than the total PI implied.We can compute both types easily:```{r}#| label: pis#| echo: true# Total PI — all variance components (predict() uses all sigma2 by default)predict(MLMA)# Study-level PI — between-study variance only (species_ID + study_ID, excluding es_ID)# sigma2 order matches the random= list: [species_ID, study_ID, es_ID]vc <- MLMA$sigma2pi_study_se <-sqrt(sum(vc[1:2]) + MLMA$se^2)cat("Study-level 95% PI: [",round(MLMA$b -1.96* pi_study_se, 3), ",",round(MLMA$b +1.96* pi_study_se, 3), "]\n")```#### **Interpreting Prediction Intervals**::: {.panel-tabset}## Task!>**The meta-analytic mean ARR is `r round(predict(MLMA)[[1]], 3)`. Compare the total PI to the study-level PI. What does each tell you, and why might one be more useful for assessing biological generality?**## Answer!>The total 95% PI is wide — from `r round(predict(MLMA)[[5]], 3)` to `r round(predict(MLMA)[[6]], 3)` — suggesting a lot of inconsistency across individual effect size estimates. But much of that width comes from within-study residual variance (variation among effect sizes *within* a single study). The study-level PI, which retains only between-study variance (species and study components), is narrower and tells us how much the *average result* is expected to shift if we repeated the study in a new biological context. For biological generality, the study-level PI is the more relevant question: if most replications in new species or populations land near the mean, the phenomenon generalises well even if individual effect sizes are noisy.:::<br>### *Q*-tests: Inferential Tests to Determine Excess Heterogeneity So far we've talked about important heterogeneity statistics that describe, and quantify, the proportion of variability ($I^2$) or the expected range of plausible effect size values which incorporate all the variance estimates (PI). But, how do we know that the heterogeneity in the data is greater than what we would expect by chance? It seems obvious that, if the $I_{total}^2$ is really high (i.e., >85%) there's a lot of variability in effects within and across studies. While these statistics are probably sufficient, *Q* tests can also be useful to more formally make an inferential statement about whether heterogeneity is significantly greater than chance. We have already encountered how to calculate the *Q*-statistic when discussing [fixed and random effects](fixed-vs-random.qmd) models. We can use the *Q* statistic now to compare this against a $\chi^2$-distribution to calculate a p-value. [Wolfgang's documentation](https://wviechtb.github.io/metafor/reference/rma.mv.html) (`?rma.mv`) describes what the *Q*-test means:> Cochran's 𝑄-test, which tests whether the variability in the observed effect sizes or outcomes is larger than one would expect based on sampling variability (and the given covariances among the sampling errors) alone. A significant test suggests that the true effects/outcomes are heterogeneous. <br>We don't need to do anything special to get this test. In fact, this is automatically done by `metafor`:```{r}#| label: q#| echo: trueprint(MLMA)```#### **Describing the *Q*-test**::: {.panel-tabset}## Task!>**How would you describe in a paper the *Q*-test from the `metafor` model above?**## Answer!>Of course, there are a multitude of ways that we could describe this test in words, but one way to describe it in your results section might be as follows:<br>>The overall mean ARR across ectotherms was `r MLMA$b` and differed significantly from zero (95% CI: `r MLMA$ci.lb` to `r MLMA$ci.ub`, df = `r MLMA$ddf`, t = `r MLMA$b / MLMA$se`, p < 0.001). However, there was significant residual heterogeneity (*Q* = `r MLMA$QE`, df = `r dim(MLMA$data)[1]-1`, p < 0.001) with estimates expected to range from `r predict(MLMA)[[5]]` to `r predict(MLMA)[[6]]` (95% Prediction Interval) ($I_{total}^2$ = `r orchaRd::i2_ml(MLMA)[1]`).:::<br>### Proportion of Variation Explained by the Model: $R^2_{marginal}$ and $R^2_{conditional}$You will probably have noticed that $I^2$ looks awfully familiar. It looks a lot like the way we might calculate $R^2$ described by @NakagawaSchielzeth2013. Well, you would be correct! Often we also want to quantify how much variation is explained by our moderators and random effects in our multilevel meta-regression model [@Nakagawa2017ab; @NakagawaSchielzeth2013]. This is done using $R^2$. We can calculate different $R^2$ values depending on whether we ignore or include random effects in the numerator. If only the variation explained by moderators is included in the numerator than we call this $R_{marginal}^2$, which is defined as:$$R^2_{marginal} = \frac{\sigma^2_{fixed}}{\sigma^2_{fixed} + \sigma^2_{study} + \sigma^2_{phylogeny} + \sigma^2_{species} + \sigma^2_{residual}}$$ {#eq-r2}Note that this formula does not include $\sigma^2_{m}$ as sampling error variance is assumed to be known in meta-analysis. We won't go into the details on how to calculate and interpret these because we'll cover this in a later tutorial on [publication bias](pub-bias1.qmd). Suffice it to say that this is a useful statistic for describing how much variation your moderators or model explain.## **Conclusion**Hopefully it's clear from this tutorial why explicitly estimating and reporting upon heterogeneity is so critically important in meta-analysis. Of course, some of these measures are only possible if we know the total sampling variance, which is not possible if one doesn't use a weighted meta-analytic model.No single metric tells the whole story. @Yang2025het recommend a **pluralistic approach** to reporting heterogeneity:- **$I^2_{total}$** — tells you what fraction of variance is real signal (not sampling noise);- **$CVH_2$ and $M_2$** — tell you how large that heterogeneity is relative to the mean effect, answering the question "does the average matter if effects are so variable?";- **Prediction intervals** — give the most intuitive summary of heterogeneity as the expected range of a new result. Reporting both the *total* PI and the *study-level* PI [@Yang2024pi] helps distinguish noise within studies from genuine variation across biological contexts;- **An orchard plot** — visualises all of the above at once and should accompany any numerical heterogeneity summary [@Nakagawa2021c; @Nakagawa2023].Together these metrics give a far richer picture of heterogeneity than $I^2$ alone.## **References**<div id="refs"></div><br>## **Session Information**```{r}#| label: sessioninfo-het#| echo: falsepander(sessionInfo(), locale =FALSE)```