Phylogenetic Meta-analysis

Introduction to Phylogenetic Meta-analysis

Meta-analyses in comparative physiology almost always synthesise effects across a multitude of species (Chamberlain et al., 2012; Hadfield, 2010; Nakagawa and Santos, 2012). Species share an evolutionary history which is described by a phylogeny (Chamberlain et al., 2012; Gurevitch and Hedges, 1999; Lajeunesse, 2009). As a result, the samples (and the effect sizes obtained from these samples) are not independent which violates the independence assumption underlying conventional meta-analytic models. For example, the standard fixed- and random-effects models often used for ecological meta-analyses (Lajeunesse, 2009; Nakagawa and Santos, 2012), assume independence among the effect sizes and therefore do not account for phylogeny (Chamberlain et al., 2012; Noble et al., 2017). Including a ‘species-level’ random effect will not deal with this type of non-independence because the ‘effects’ are themselves correlated based on the amount of shared evolutionary history between two species (See the Complex Non-independence Tutorial).

Recent simulations by Cinar et al. (2021) emphasise the importance of modelling phylogeny in meta-analytic models. Even when there is little phylogenetic signal in the data it will not compromise the analysis to have a phylogeny included, but it will protect you against Type I errors. In this tutorial, we expand on our multilevel meta-analytic models to account for phylogenetic relatedness. Part of the challenge in controlling for phylogeny in a meta-analysis is that the taxa included are often highly diverse and there is little resolution on the exact phylogenetic relationship among the taxa in question. Having said that, there are tools that will probably do a fairly good job delineating the, showing how we can build a rough phylogeny that can be included.

Formal Definition of Phylogenetic Meta-analytic Model

Recall our multilevel meta-analytic model we previously discussed:

\[ y_{i} = \mu + s_{j[i]} + spp_{k[i]} + e_{i} + m_{i} \\ m_{i} \sim N(0, v_{i}\textbf{I}) \\ s_{j} \sim N(0, \tau^2\textbf{I}) \\ s_{k} \sim N(0, \sigma_{k}^2\textbf{I}) \\ e_{i} \sim N(0, \sigma_{e}^2\textbf{I}) \]

Again, \(y_{i}\) is the ith effect size estimate and \(m_{i}\) is the sampling error (deviation from \(\mu\)) for effect size i. \(e_{i}\) is the effect-size-specific effect (within study effect) (that’s a mouthful!) applied to each effect i. \(s_{j[i]}\) is the study specific effect, j, applied to the ith effect size, where j = 1,…,\(N_{studies}\), \(spp_{k[i]}\) is the species specific effect, k, applied to the ith effect size, where k = 1,…,\(N_{species}\). As we discussed in the previous tutorial on non-independence, \(\textbf{I}\) denotes the fact that effect size estimates are independently and identically distributed.

This model accounts for the fact that effect size values from the same study and species are clustered (i.e. non-independent) because they all share the same ‘effect’ sampled from the same distribution. However, in reality, this doesn’t account for the fact that some species effects are more similar to each other than others because of the shared evolutionary history between species (Hadfield, 2010; Lajeunesse, 2009; Nakagawa and Santos, 2012).

To account for shared evolutionary history between species we need to modify the \(\textbf{I}\) matrix to model the correlation between ‘species effects’. This can be done by estimating a new random phylogenetic effect (\(a_{k}\)), and replacing the \(\textbf{I}\) matrix with an \(\textbf{A}\) matrix as follows:

\[ y_{i} = \mu + s_{j[i]} + spp_{k[i]} + a_{sp[i]}+ e_{i} + m_{i} \\ m_{i} \sim N(0, v_{i}\textbf{I}) \\ s_{j} \sim N(0, \tau^2\textbf{I}) \\ spp_{k} \sim N(0, \sigma_{k}^2\textbf{I}) \\ a_{k} \sim N(0, \sigma_{a}^2\textbf{A}) \\ e_{i} \sim N(0, \sigma_{e}^2\textbf{I}) \]

Here, \(a_{k[i]}\) is a phylogenetic effect for the k species applied to effect size i. These effects are now sampled from a normal distribution (denoted by N) with a mean of 0 and \(\sigma_{a}^2\) which is the phylogenetic variance. However, now \(\textbf{A}\) is a \(N_{species}\) by \(N_{species}\) correlation matrix of distances between species extracted from a phylogenetic tree, which means that species effects will be correlated (off-diagonals will be non-zero) based on the distance from the root of the tree to the most recent common ancestor (Hadfield, 2010; Nakagawa and Santos, 2012; Noble et al., 2017).

You will notice that we have two species-level random effects in our model. We are ‘trying’ (and we may not succeed) to distinguish between species-specific effects driven by shared evolutionary history (i.e., \(a_{k}\)) and non-phylogenetic species-specific effects (i.e., \(spp_{k}\)), driven by, say, shared ecology. It may not always be possible to estimate both effects in a single model. It is worth keeping this in mind as these are data-hungry models and estimating these parameters will require a large sample size. Often, it is necessary to resort to some form of model selection or to use your a priori judgement about what species-specific effect is most important.

Example of Phylogenetic Multi-level Meta-analysis

Format raw data for analysis

We will use the dispersal data set (Z-transformed correlation coefficient or Zr) from Wu and Seebacher (2022) (see Effect Size tutorial on background and effect size calculation) to demonstrate the following analysis. First, load the the raw data and add the following additional column effect size ID (es_ID). Then, calculate effect size as Zr and sampling variance as V_Zr.

## Load packages
# install.packages("pacman")
pacman::p_load(tidyverse, metafor, ape, phytools)

zr_data <- read.csv("https://raw.githubusercontent.com/daniel1noble/meta-workshop/gh-pages/data/ind_disp_raw_data.csv") %>%
  dplyr::select(-dispersal_def) %>% # remove irrelevant columns
  tibble::rowid_to_column("es_ID") %>% # add effect size ID
  dplyr::mutate(species_OTL = stringr::str_replace_all(species_OTL, "_", " ")) # remove underscore for some species with missing underscore. Some inconsistency when curating the database.

# Calculate Fisher's r-to-z transformed correlation coefficient (ZCOR) as yi = effect size and vi = sampling variances, where ri = raw correlation coefficients, and ni = sample size.
zr_data <- metafor::escalc(measure = "ZCOR", ri = corr_coeff, ni = sample_size, data = zr_data, var.names = c("Zr", "V_Zr"))

Phylogenetic reconstruction

We will now reconstruct the phylogeny for species extracted in the raw data set. There are different ways to obtain phylogenetic trees. You can extract trees as a .nex or .tre file from existing phylogeny database or studies with such data available (see https://vertlife.org/data/ for example taxa specific trees), load it to R using the ape package, and prune (keep only the species you have) the tree to match the data set. See the drop_tip function by Liam Revell to prune tree tips to match the data set species name.

If you are comparing effect sizes from a broad phylogenetic range (e.g. including Prokaryotes or Plantae), you can extract species names from the Open Tree of Life (OTL) data set using the rotl R package. Tutorials for the rotl R package can be found here: https://cran.r-project.org/web/packages/rotl/vignettes/rotl.html

For the purpose of this tutorial, we will follow the second method. Notice that we have a separate column in the raw data set called species_OTL. This is because to reconstruct the phylogeny, we need the names in the data set to match the OTL data set. The species names described in the paper (especially older papers) may not match the OTL names. Check that the names match before running the following codes or else it will not run successfully. This also applies to any phylogenetic trees extracted.

IMPORTANT NOTE

Errors in matching names

What can we do if errors occur when matching taxa names in the phylogeny with those in the data set?

Suggestions

Name matching errors come in different forms and flavours with the OTL. The easiest way to avoid most errors is to check that the names in the raw data match the OTL database each time you add a new species. OTL will also check for taxa synoymn’s – changes to taxa names. You can check for these and update your dataset so the taxa match. This will address many errors matching names down the track. We have included code to check which names are inconsistent in case you missed some. The remaining errors come down to formatting which you can see below. As an example with this dataset, we have to re-add the underscore between the Genus and species name because some names (I noticed 3 species in this dataset) such as “Gambusia affinis” (without the underscore) matches up with “Bambusia affinis” in the OTL database. Clearly this is a different taxa. We can get around it by first removing the underscore before matching the names (see code for raw before), then adding it back when matching the names to the tree (which has underscores).

Build the tree and create correlation matrix

# Load additional packages for phylogenetic reconstruction
pacman::p_load(rotl, ape)

# Create species list based on OTL names and match with OTL database
species <- sort(unique(as.character(zr_data$species_OTL)))  # generate list of species (as character format)
taxa <- rotl::tnrs_match_names(names = species)  # match taxonomic names to the OTL

# Check if species list do not match OTL identifier
taxa[taxa$approximate_match == TRUE, ]  # none so far

## [1] search_string     unique_name       approximate_match ott_id           
## [5] is_synonym        flags             number_matches   
## <0 rows> (or 0-length row.names)

length(unique(zr_data$species_OTL))  # 74 species

## [1] 74

nrow(taxa)  # 74 species

## [1] 74

# Retrieving phylogenetic relationships among taxa in the form of a trimmed
# sub-tree
tree <- rotl::tol_induced_subtree(ott_ids = ott_id(taxa), label_format = "name")

## 
Progress [---------------------------------] 0/984 (  0) ?s
Progress [==============================] 984/984 (100)  0s

tree_tip_label <- tree$tip.label  # extract tree tip names
species_list <- unique(stringr::str_replace_all(zr_data$species_OTL, " ", "_"))  # extract species name and add underscore to match tree_trip_label

setdiff(tree_tip_label, species_list)  # check what names from the raw data is not found in tree.

## character(0)

# If names do not match the tree, then replace name in raw data to match tree

# You can also visually inspect the tree via plot function to check if the
# species extracted are correct. plot(tree, cex = 0.6, label.offset = 0.1,
# no.margin = TRUE) # plot tree

set.seed(1)
tree <- ape::compute.brlen(tree, method = "Grafen", power = 1)  # compute branch lengths
tree <- ape::multi2di(tree, random = TRUE)  # use a randomization approach to deal with polytomies

# Note, an internal node of a phylogenetic tree is described as a polytomy if
# (i) it is in a rooted tree and is linked to three or more subtrees or (ii) it
# is in an unrooted tree and is attached to four or more branches which is
# commonly observed with data extracted from OTL. This is the result of
# insufficient phylogenetic information such such as divergence time. To
# resolve polytomies, we use the multi2di function from the ape package.

# Check tree is ultrametric ape::is.ultrametric(tree) # TRUE

# Create correlation matrix for analysis
phylo_cor <- vcv(tree, cor = T)

We can also have a look at the final tree:

plot(tree, type = "phylogram", cex = 0.3)

Figure 1: Phylogenetic tree generated for species in the Wu & Seebacher (2022) dataset.

Run phylogenetically corrected, multi-level meta-analysis on overall effect

Lets say we are interested in the overall relationship of physiology and movement (Zr). The following code runs a phylogeneticlly corrected, multi-level meta-analysis via the rma.mv function. Building upon the multi-level meta-analytic model from the Multi-level Meta-analysis tutorial, we also included random effects (random argument) to account for non-independence such as es_ID (effect size identity), study_ID (unique study identity), species (unique species identity), species_OTL (species relatedness). The R argument allows the correlation matrix (here, phylo_cor) to be incorporated to the model which corresponds to the components specified in the random argument ~1 | species_OTL.

Note: remember to add underscore to the species OTL because the tree tip names have underscores.

overall_model <- metafor::rma.mv(yi = Zr, V = V_Zr, mod = ~1, random = list(~1 |
    study_ID, ~1 | es_ID, ~1 | species, ~1 | species_OTL), R = list(species_OTL = phylo_cor),
    method = "REML", test = "t", dfs = "contain", data = zr_data %>%
        dplyr::mutate(species_OTL = stringr::str_replace_all(species_OTL, " ", "_")))
summary(overall_model)

## 
## Multivariate Meta-Analysis Model (k = 272; method: REML)
## 
##    logLik   Deviance        AIC        BIC       AICc   
## -160.5426   321.0852   331.0852   349.0958   331.3116   
## 
## Variance Components:
## 
##             estim    sqrt  nlvls  fixed       factor    R 
## sigma^2.1  0.0016  0.0398     88     no     study_ID   no 
## sigma^2.2  0.1155  0.3399    272     no        es_ID   no 
## sigma^2.3  0.0260  0.1614     79     no      species   no 
## sigma^2.4  0.0000  0.0000     74     no  species_OTL  yes 
## 
## Test for Heterogeneity:
## Q(df = 271) = 2508.8317, p-val < .0001
## 
## Model Results:
## 
## estimate      se    tval  df    pval   ci.lb   ci.ub     
##   0.0979  0.0347  2.8202  73  0.0062  0.0287  0.1671  ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What does the model output tell us?

Task

Is there an overall positive or negative relationship between physiology and movement?

Answer

We can see from above that the model estimates a positive relationship between physiology and movement with an overall mean of 0.098 with a 95% confidence interval (95% CI) of 0.029 to 0.167. In this analysis, we also observe high heterogeneity among our effect size estimates with our Q-test being significant, a very common observation in ecological and evolutionary meta-analysis studies. Some of this heterogeneity may be driven by variation among studies, within studies, among species, and phylogeny as seen by the variance components of the model.

References

Chamberlain, S. A., Hovick, S. M., Dibble, C. J., Rasmussen, N. L., Van Allen, B. G. and Maitner, B. S. (2012). Does phylogeny matter? Assessing the impact of phylogenetic information in ecological meta-analysis. Ecology Letters 15, 627–636.

Cinar, O., Nakagawa, S. and Viechtbauer, W. (2021). Phylogenetic multilevel meta-analysis: A simulation study on the importance1of modeling the phylogeny. EcoEvoRxiv, https://ecoevorxiv.org/su4zv/.

Gurevitch, J. and Hedges, L. V. (1999). Statistical issues in ecological meta-analyses. Ecology 80, 1142–1149.

Hadfield, J. (2010). MCMC methods for multi-response generalized linear mixed models: The MCMCglmm R package. Journal of Statistical Software 33, 1–22.

Lajeunesse, M. J. (2009). Meta-analysis and the comparative phylogenetic method. The American Naturalist 174, 369–381.

Nakagawa, S. and Santos, E. S. (2012). Methodological issues and advances in biological meta-analysis. Evolutionary Ecology 26, 1253–1274.

Noble, D. W. A., Lagisz, M., O’Dea, R. E. and Nakagawa, S. (2017). Non‐independence and sensitivity analyses in ecological and evolutionary meta‐analyses. Molecular Ecology 26, 2410–2425.

Wu, N. C. and Seebacher, F. (2022). Physiology can predict animal activity, exploration, and dispersal. Communications Biology 5, 1–11.

Session Information

R version 4.0.3 (2020-10-10)

Platform: x86_64-apple-darwin17.0 (64-bit)

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: rotl(v.3.0.11), phytools(v.0.7-80), maps(v.3.4.0), ape(v.5.6-2), magick(v.2.7.2), vembedr(v.0.1.5), equatags(v.0.1.0), mathjaxr(v.1.6-0), pander(v.0.6.4), orchaRd(v.2.0), forcats(v.0.5.1), stringr(v.1.4.0), dplyr(v.1.0.9), purrr(v.0.3.4), readr(v.2.1.2), tidyr(v.1.2.0), tibble(v.3.1.6), ggplot2(v.3.3.5), tidyverse(v.1.3.1), flextable(v.0.6.6), metafor(v.3.4-0), metadat(v.0.2-0) and Matrix(v.1.3-4)

loaded via a namespace (and not attached): colorspace(v.2.0-3), ellipsis(v.0.3.2), base64enc(v.0.1-3), fs(v.1.5.2), rstudioapi(v.0.13), remotes(v.2.4.2), fansi(v.1.0.3), lubridate(v.1.8.0), xml2(v.1.3.3), codetools(v.0.2-18), mnormt(v.2.0.2), knitr(v.1.39), jsonlite(v.1.8.0), broom(v.0.7.12), dbplyr(v.2.1.1), rentrez(v.1.2.3), compiler(v.4.0.3), httr(v.1.4.2), backports(v.1.4.1), assertthat(v.0.2.1), fastmap(v.1.1.0), cli(v.3.3.0), formatR(v.1.11), prettyunits(v.1.1.1), htmltools(v.0.5.2), tools(v.4.0.3), igraph(v.1.2.6), coda(v.0.19-4), gtable(v.0.3.0), glue(v.1.6.2), clusterGeneration(v.1.3.7), rappdirs(v.0.3.3), fastmatch(v.1.1-0), Rcpp(v.1.0.8.3), cellranger(v.1.1.0), jquerylib(v.0.1.4), vctrs(v.0.4.1), nlme(v.3.1-152), xfun(v.0.30), rvest(v.1.0.2), lifecycle(v.1.0.1), pacman(v.0.5.1), phangorn(v.2.7.0), XML(v.3.99-0.9), klippy(v.0.0.0.9500), MASS(v.7.3-54), scales(v.1.2.0), hms(v.1.1.1), parallel(v.4.0.3), expm(v.0.999-6), yaml(v.2.3.5), curl(v.4.3.2), gdtools(v.0.2.3), sass(v.0.4.1), stringi(v.1.7.6), highr(v.0.9), plotrix(v.3.8-1), zip(v.2.2.0), rlang(v.1.0.2), pkgconfig(v.2.0.3), systemfonts(v.1.0.2), rncl(v.0.8.4), evaluate(v.0.15), lattice(v.0.20-44), tidyselect(v.1.1.2), magrittr(v.2.0.3), bookdown(v.0.22), R6(v.2.5.1), generics(v.0.1.2), combinat(v.0.0-8), DBI(v.1.1.2), pillar(v.1.7.0), haven(v.2.4.3), withr(v.2.5.0), scatterplot3d(v.0.3-41), modelr(v.0.1.8), crayon(v.1.5.1), locatexec(v.0.1.1), uuid(v.1.0-4), utf8(v.1.2.2), tmvnsim(v.1.0-2), tzdb(v.0.3.0), rmarkdown(v.2.13), officer(v.0.3.18), progress(v.1.2.2), grid(v.4.0.3), readxl(v.1.4.0), data.table(v.1.14.2), xslt(v.1.4.3), reprex(v.2.0.1), digest(v.0.6.29), numDeriv(v.2016.8-1.1), munsell(v.0.5.0), bslib(v.0.3.1) and quadprog(v.1.5-8)