Meta-analyses in comparative physiology almost always synthesise effects across a multitude of species (Chamberlain et al., 2012; Hadfield, 2010; Nakagawa and Santos, 2012). Species share an evolutionary history which is described by a phylogeny (Chamberlain et al., 2012; Gurevitch and Hedges, 1999; Lajeunesse, 2009). As a result, the samples (and the effect sizes obtained from these samples) are not independent which violates the independence assumption underlying conventional meta-analytic models. For example, the standard fixed- and random-effects models often used for ecological meta-analyses (Lajeunesse, 2009; Nakagawa and Santos, 2012), assume independence among the effect sizes and therefore do not account for phylogeny (Chamberlain et al., 2012; Noble et al., 2017). Including a ‘species-level’ random effect will not deal with this type of non-independence because the ‘effects’ are themselves correlated based on the amount of shared evolutionary history between two species (See the Complex Non-independence Tutorial).
Recent simulations by Cinar et al. (2021) emphasise the importance of modelling phylogeny in meta-analytic models. Even when there is little phylogenetic signal in the data it will not compromise the analysis to have a phylogeny included, but it will protect you against Type I errors. In this tutorial, we expand on our multilevel meta-analytic models to account for phylogenetic relatedness. Part of the challenge in controlling for phylogeny in a meta-analysis is that the taxa included are often highly diverse and there is little resolution on the exact phylogenetic relationship among the taxa in question. Having said that, there are tools that will probably do a fairly good job delineating the, showing how we can build a rough phylogeny that can be included.
Recall our multilevel meta-analytic model we previously discussed:
\[ y_{i} = \mu + s_{j[i]} + spp_{k[i]} + e_{i} + m_{i} \\ m_{i} \sim N(0, v_{i}\textbf{I}) \\ s_{j} \sim N(0, \tau^2\textbf{I}) \\ s_{k} \sim N(0, \sigma_{k}^2\textbf{I}) \\ e_{i} \sim N(0, \sigma_{e}^2\textbf{I}) \]
Again, \(y_{i}\) is the ith effect size estimate and \(m_{i}\) is the sampling error (deviation from \(\mu\)) for effect size i. \(e_{i}\) is the effect-size-specific effect (within study effect) (that’s a mouthful!) applied to each effect i. \(s_{j[i]}\) is the study specific effect, j, applied to the ith effect size, where j = 1,…,\(N_{studies}\), \(spp_{k[i]}\) is the species specific effect, k, applied to the ith effect size, where k = 1,…,\(N_{species}\). As we discussed in the previous tutorial on non-independence, \(\textbf{I}\) denotes the fact that effect size estimates are independently and identically distributed.
This model accounts for the fact that effect size values from the same study and species are clustered (i.e. non-independent) because they all share the same ‘effect’ sampled from the same distribution. However, in reality, this doesn’t account for the fact that some species effects are more similar to each other than others because of the shared evolutionary history between species (Hadfield, 2010; Lajeunesse, 2009; Nakagawa and Santos, 2012).
To account for shared evolutionary history between species we need to modify the \(\textbf{I}\) matrix to model the correlation between ‘species effects’. This can be done by estimating a new random phylogenetic effect (\(a_{k}\)), and replacing the \(\textbf{I}\) matrix with an \(\textbf{A}\) matrix as follows:
\[ y_{i} = \mu + s_{j[i]} + spp_{k[i]} + a_{sp[i]}+ e_{i} + m_{i} \\ m_{i} \sim N(0, v_{i}\textbf{I}) \\ s_{j} \sim N(0, \tau^2\textbf{I}) \\ spp_{k} \sim N(0, \sigma_{k}^2\textbf{I}) \\ a_{k} \sim N(0, \sigma_{a}^2\textbf{A}) \\ e_{i} \sim N(0, \sigma_{e}^2\textbf{I}) \]
Here, \(a_{k[i]}\) is a phylogenetic effect for the k species applied to effect size i. These effects are now sampled from a normal distribution (denoted by N) with a mean of 0 and \(\sigma_{a}^2\) which is the phylogenetic variance. However, now \(\textbf{A}\) is a \(N_{species}\) by \(N_{species}\) correlation matrix of distances between species extracted from a phylogenetic tree, which means that species effects will be correlated (off-diagonals will be non-zero) based on the distance from the root of the tree to the most recent common ancestor (Hadfield, 2010; Nakagawa and Santos, 2012; Noble et al., 2017).
You will notice that we have two species-level random effects in our model. We are ‘trying’ (and we may not succeed) to distinguish between species-specific effects driven by shared evolutionary history (i.e., \(a_{k}\)) and non-phylogenetic species-specific effects (i.e., \(spp_{k}\)), driven by, say, shared ecology. It may not always be possible to estimate both effects in a single model. It is worth keeping this in mind as these are data-hungry models and estimating these parameters will require a large sample size. Often, it is necessary to resort to some form of model selection or to use your a priori judgement about what species-specific effect is most important.
We will use the dispersal data set (Z-transformed correlation coefficient or Zr) from Wu and Seebacher (2022) (see Effect Size tutorial on background and effect size calculation) to demonstrate the following analysis. First, load the the raw data and add the following additional column effect size ID (es_ID
). Then, calculate effect size as Zr
and sampling variance as V_Zr
.
## Load packages
# install.packages("pacman")
pacman::p_load(tidyverse, metafor, ape, phytools)
zr_data <- read.csv("https://raw.githubusercontent.com/daniel1noble/meta-workshop/gh-pages/data/ind_disp_raw_data.csv") %>%
dplyr::select(-dispersal_def) %>% # remove irrelevant columns
tibble::rowid_to_column("es_ID") %>% # add effect size ID
dplyr::mutate(species_OTL = stringr::str_replace_all(species_OTL, "_", " ")) # remove underscore for some species with missing underscore. Some inconsistency when curating the database.
# Calculate Fisher's r-to-z transformed correlation coefficient (ZCOR) as yi = effect size and vi = sampling variances, where ri = raw correlation coefficients, and ni = sample size.
zr_data <- metafor::escalc(measure = "ZCOR", ri = corr_coeff, ni = sample_size, data = zr_data, var.names = c("Zr", "V_Zr"))
We will now reconstruct the phylogeny for species extracted in the raw data set. There are different ways to obtain phylogenetic trees. You can extract trees as a .nex
or .tre
file from existing phylogeny database or studies with such data available (see https://vertlife.org/data/ for example taxa specific trees), load it to R using the ape
package, and prune (keep only the species you have) the tree to match the data set. See the drop_tip
function by Liam Revell to prune tree tips to match the data set species name.
If you are comparing effect sizes from a broad phylogenetic range (e.g. including Prokaryotes or Plantae), you can extract species names from the Open Tree of Life (OTL) data set using the rotl
R package. Tutorials for the rotl
R package can be found here: https://cran.r-project.org/web/packages/rotl/vignettes/rotl.html
For the purpose of this tutorial, we will follow the second method. Notice that we have a separate column in the raw data set called species_OTL
. This is because to reconstruct the phylogeny, we need the names in the data set to match the OTL data set. The species names described in the paper (especially older papers) may not match the OTL names. Check that the names match before running the following codes or else it will not run successfully. This also applies to any phylogenetic trees extracted.
What can we do if errors occur when matching taxa names in the phylogeny with those in the data set?
Name matching errors come in different forms and flavours with the OTL. The easiest way to avoid most errors is to check that the names in the raw data match the OTL database each time you add a new species. OTL will also check for taxa synoymn’s – changes to taxa names. You can check for these and update your dataset so the taxa match. This will address many errors matching names down the track. We have included code to check which names are inconsistent in case you missed some. The remaining errors come down to formatting which you can see below. As an example with this dataset, we have to re-add the underscore between the Genus and species name because some names (I noticed 3 species in this dataset) such as “Gambusia affinis” (without the underscore) matches up with “Bambusia affinis” in the OTL database. Clearly this is a different taxa. We can get around it by first removing the underscore before matching the names (see code for raw before), then adding it back when matching the names to the tree (which has underscores).
# Load additional packages for phylogenetic reconstruction
pacman::p_load(rotl, ape)
# Create species list based on OTL names and match with OTL database
species <- sort(unique(as.character(zr_data$species_OTL))) # generate list of species (as character format)
taxa <- rotl::tnrs_match_names(names = species) # match taxonomic names to the OTL
# Check if species list do not match OTL identifier
taxa[taxa$approximate_match == TRUE, ] # none so far
## [1] search_string unique_name approximate_match ott_id
## [5] is_synonym flags number_matches
## <0 rows> (or 0-length row.names)
length(unique(zr_data$species_OTL)) # 74 species
## [1] 74
nrow(taxa) # 74 species
## [1] 74
# Retrieving phylogenetic relationships among taxa in the form of a trimmed
# sub-tree
tree <- rotl::tol_induced_subtree(ott_ids = ott_id(taxa), label_format = "name")
##
Progress [---------------------------------] 0/984 ( 0) ?s
Progress [==============================] 984/984 (100) 0s
tree_tip_label <- tree$tip.label # extract tree tip names
species_list <- unique(stringr::str_replace_all(zr_data$species_OTL, " ", "_")) # extract species name and add underscore to match tree_trip_label
setdiff(tree_tip_label, species_list) # check what names from the raw data is not found in tree.
## character(0)
# If names do not match the tree, then replace name in raw data to match tree
# You can also visually inspect the tree via plot function to check if the
# species extracted are correct. plot(tree, cex = 0.6, label.offset = 0.1,
# no.margin = TRUE) # plot tree
set.seed(1)
tree <- ape::compute.brlen(tree, method = "Grafen", power = 1) # compute branch lengths
tree <- ape::multi2di(tree, random = TRUE) # use a randomization approach to deal with polytomies
# Note, an internal node of a phylogenetic tree is described as a polytomy if
# (i) it is in a rooted tree and is linked to three or more subtrees or (ii) it
# is in an unrooted tree and is attached to four or more branches which is
# commonly observed with data extracted from OTL. This is the result of
# insufficient phylogenetic information such such as divergence time. To
# resolve polytomies, we use the multi2di function from the ape package.
# Check tree is ultrametric ape::is.ultrametric(tree) # TRUE
# Create correlation matrix for analysis
phylo_cor <- vcv(tree, cor = T)
We can also have a look at the final tree:
plot(tree, type = "phylogram", cex = 0.3)
Lets say we are interested in the overall relationship of physiology and movement (Zr). The following code runs a phylogeneticlly corrected, multi-level meta-analysis via the rma.mv
function. Building upon the multi-level meta-analytic model from the Multi-level Meta-analysis tutorial, we also included random effects (random
argument) to account for non-independence such as es_ID
(effect size identity), study_ID
(unique study identity), species
(unique species identity), species_OTL
(species relatedness). The R
argument allows the correlation matrix (here, phylo_cor
) to be incorporated to the model which corresponds to the components specified in the random
argument ~1 | species_OTL
.
Note: remember to add underscore to the species OTL because the tree tip names have underscores.
overall_model <- metafor::rma.mv(yi = Zr, V = V_Zr, mod = ~1, random = list(~1 |
study_ID, ~1 | es_ID, ~1 | species, ~1 | species_OTL), R = list(species_OTL = phylo_cor),
method = "REML", test = "t", dfs = "contain", data = zr_data %>%
dplyr::mutate(species_OTL = stringr::str_replace_all(species_OTL, " ", "_")))
summary(overall_model)
##
## Multivariate Meta-Analysis Model (k = 272; method: REML)
##
## logLik Deviance AIC BIC AICc
## -160.5426 321.0852 331.0852 349.0958 331.3116
##
## Variance Components:
##
## estim sqrt nlvls fixed factor R
## sigma^2.1 0.0016 0.0398 88 no study_ID no
## sigma^2.2 0.1155 0.3399 272 no es_ID no
## sigma^2.3 0.0260 0.1614 79 no species no
## sigma^2.4 0.0000 0.0000 74 no species_OTL yes
##
## Test for Heterogeneity:
## Q(df = 271) = 2508.8317, p-val < .0001
##
## Model Results:
##
## estimate se tval df pval ci.lb ci.ub
## 0.0979 0.0347 2.8202 73 0.0062 0.0287 0.1671 **
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Is there an overall positive or negative relationship between physiology and movement?
We can see from above that the model estimates a positive relationship between physiology and movement with an overall mean of 0.098 with a 95% confidence interval (95% CI) of 0.029 to 0.167. In this analysis, we also observe high heterogeneity among our effect size estimates with our Q-test being significant, a very common observation in ecological and evolutionary meta-analysis studies. Some of this heterogeneity may be driven by variation among studies, within studies, among species, and phylogeny as seen by the variance components of the model.
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: rotl(v.3.0.11), phytools(v.0.7-80), maps(v.3.4.0), ape(v.5.6-2), magick(v.2.7.2), vembedr(v.0.1.5), equatags(v.0.1.0), mathjaxr(v.1.6-0), pander(v.0.6.4), orchaRd(v.2.0), forcats(v.0.5.1), stringr(v.1.4.0), dplyr(v.1.0.9), purrr(v.0.3.4), readr(v.2.1.2), tidyr(v.1.2.0), tibble(v.3.1.6), ggplot2(v.3.3.5), tidyverse(v.1.3.1), flextable(v.0.6.6), metafor(v.3.4-0), metadat(v.0.2-0) and Matrix(v.1.3-4)
loaded via a namespace (and not attached): colorspace(v.2.0-3), ellipsis(v.0.3.2), base64enc(v.0.1-3), fs(v.1.5.2), rstudioapi(v.0.13), remotes(v.2.4.2), fansi(v.1.0.3), lubridate(v.1.8.0), xml2(v.1.3.3), codetools(v.0.2-18), mnormt(v.2.0.2), knitr(v.1.39), jsonlite(v.1.8.0), broom(v.0.7.12), dbplyr(v.2.1.1), rentrez(v.1.2.3), compiler(v.4.0.3), httr(v.1.4.2), backports(v.1.4.1), assertthat(v.0.2.1), fastmap(v.1.1.0), cli(v.3.3.0), formatR(v.1.11), prettyunits(v.1.1.1), htmltools(v.0.5.2), tools(v.4.0.3), igraph(v.1.2.6), coda(v.0.19-4), gtable(v.0.3.0), glue(v.1.6.2), clusterGeneration(v.1.3.7), rappdirs(v.0.3.3), fastmatch(v.1.1-0), Rcpp(v.1.0.8.3), cellranger(v.1.1.0), jquerylib(v.0.1.4), vctrs(v.0.4.1), nlme(v.3.1-152), xfun(v.0.30), rvest(v.1.0.2), lifecycle(v.1.0.1), pacman(v.0.5.1), phangorn(v.2.7.0), XML(v.3.99-0.9), klippy(v.0.0.0.9500), MASS(v.7.3-54), scales(v.1.2.0), hms(v.1.1.1), parallel(v.4.0.3), expm(v.0.999-6), yaml(v.2.3.5), curl(v.4.3.2), gdtools(v.0.2.3), sass(v.0.4.1), stringi(v.1.7.6), highr(v.0.9), plotrix(v.3.8-1), zip(v.2.2.0), rlang(v.1.0.2), pkgconfig(v.2.0.3), systemfonts(v.1.0.2), rncl(v.0.8.4), evaluate(v.0.15), lattice(v.0.20-44), tidyselect(v.1.1.2), magrittr(v.2.0.3), bookdown(v.0.22), R6(v.2.5.1), generics(v.0.1.2), combinat(v.0.0-8), DBI(v.1.1.2), pillar(v.1.7.0), haven(v.2.4.3), withr(v.2.5.0), scatterplot3d(v.0.3-41), modelr(v.0.1.8), crayon(v.1.5.1), locatexec(v.0.1.1), uuid(v.1.0-4), utf8(v.1.2.2), tmvnsim(v.1.0-2), tzdb(v.0.3.0), rmarkdown(v.2.13), officer(v.0.3.18), progress(v.1.2.2), grid(v.4.0.3), readxl(v.1.4.0), data.table(v.1.14.2), xslt(v.1.4.3), reprex(v.2.0.1), digest(v.0.6.29), numDeriv(v.2016.8-1.1), munsell(v.0.5.0), bslib(v.0.3.1) and quadprog(v.1.5-8)