Code
# Install from github and load
#install.packages("pacman")
pacman::p_load(devtools)
devtools::install_github("daniel1noble/BACE")
pacman::p_load(BACE, ape, phytools, MCMCglmm, dplyr, magrittr)We provide a brief introduction to the BACE package for Bayesian phylogenetic and correlated trait imputation. The BACE package allows users to impute missing data in comparative datasets using Bayesian methods that account for phylogenetic relationships among species.
You can install the BACE package from GitHub using the following command:
# Install from github and load
#install.packages("pacman")
pacman::p_load(devtools)
devtools::install_github("daniel1noble/BACE")
pacman::p_load(BACE, ape, phytools, MCMCglmm, dplyr, magrittr)If you encounter any issues during installation, please ensure that you have the required dependencies installed.
The main function in the BACE package is bace_imp(), which performs Bayesian imputation using chained equations. You can access the documentation for this function using the following command:
?bace_impAs you will see, the function allows you to specify fixed effects, random effects, phylogenetic tree, and other parameters for the imputation process. At the bare minimum, you need to provide a formula for the fixed effects, a formula for the random effects (including phylogeny), a phylogenetic tree, and the dataset with missing values.
We’ll first simulate a dataset with missing values and a phylogenetic tree. We can do this using the sim_bace() function.
# Set seed for reproducibility
set.seed(123)
# Simulate data. Note that threshold4 indicates an ordinal variable with 4 levels, and multinomial3 indicates a categorical variable with 3 levels. You can replace the numbers
sim_data <- sim_bace(response_type = "poisson",
predictor_types = c("gaussian", "binary", "threshold4", "multinomial3"),
phylo_signal = c(0.55, 0.75, 0.8, 0.7, 0.7),
missingness = c(0.2, 0.3, 0.4, 0.2, 0.5),
rr = TRUE,
rr_form = list(species = c("x1")))Generated default beta_matrix with pre-set sparsity
Generated default beta_resp: -0.043, 0.457, -0.047, 0.178
We can then view the simulated data and phylogenetic tree because the sim_bace() function returns a list containing both objects. It also has a lot of other useful information about the simulation users can explore.
# Object contains both the simulated data and phylogeny
data <- sim_data$data
tree <- sim_data$tree
# View first few rows of the data
head(data) species y x1 x2 x3 x4
1 jdT90 8 0.02663503 <NA> 4 <NA>
2 hucQX 3 -3.14266778 0 4 A
3 utGJz 2 NA 0 3 A
4 m6Qzt 0 -1.71657132 0 3 <NA>
5 Trvsf NA -0.97949036 0 <NA> <NA>
6 Ec8xV 4 0.04636864 0 2 A
Here, we can see that the dataset contains lots of missing values across different variable types. Of course, with normal modelling, we would need to exclude these cases to use a complete dataset. However, with BACE we simply feed in the entire dataset, specify the models we want to use for imputation (or the real model of interest), and impute the missing data.
Now that we have the data and the phylogenetic tree we can use the bace_imp() function to impute the missing data.
It’s very important to ensure that all variables are correctly classified in the dataset (e.g., factors for categorical variables, numeric for continuous variables, etc.) before running the imputation. BACE will fit different models for different variable types, and so, if they have not been correctly classified you will be modeling the variable incorrectly.
If you want to see what BACE has classified each variable as, you can use the extract the types using the following command:
bace_types <- bace_impute$typesImportantly, users also need to think carefully about the distributions of the variables themselves and transform them if necessary prior to running bace_imp (e.g., log-transforming skewed continuous variables).
There are two ways you can use bace_imp. The first is to specify a list of all the formulas for each variable you would like to fit. It is important here that, if you provide a list, that any variable with missing data is found as a response variable. This must be the case because BACE will not automatically build models for other variables if you provide a list of formulas (but see below). For example, we can run BACE as follows. Note that we specify a formula for each variable in the dataset that has missing data.
# First check that all variables are correctly classified. If using sim_bace() this is done automatically, but always worth checking.
str(data)'data.frame': 200 obs. of 6 variables:
$ species: chr "jdT90" "hucQX" "utGJz" "m6Qzt" ...
$ y : int 8 3 2 0 NA 4 0 NA 8474 126 ...
$ x1 : num 0.0266 -3.1427 NA -1.7166 -0.9795 ...
$ x2 : Ord.factor w/ 2 levels "0"<"1": NA 1 1 1 1 1 NA 1 NA NA ...
$ x3 : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 4 4 3 3 NA 2 3 2 4 NA ...
$ x4 : Factor w/ 3 levels "A","B","C": NA 1 1 NA NA 1 NA 1 3 NA ...
# Perform BACE imputation. Here we specify models for each variable in the dataset.
bace_impute <- bace_imp(fixformula = list("y ~ x1 + x2",
"x1 ~ x2",
"x2 ~ x4",
"x3 ~ x1 + x2",
"x4 ~ x1 + x2"),
ran_phylo_form = "~ 1 |species",
phylo = tree,
data = data,
runs = 5, verbose = FALSE)What BACE will do here is fit a MCMCglmm model for each of the response variables specified in the formula list. At the first iteration, it will fill in missing values for any predictors using either their mean (for continuous variables) or randomly sampled values (for categorical variables). Models will then be fit using this complete predictor set and predictions will be made for missing values of the response variable. In the next run, those predictions will be used to ‘fill’ in missing values in any predictors instead. This cycle will continue until the predicted values converge. This process is repeated for the number of iterations specified by the runs argument. All the models will include the phylogenetic random effect specified in the ran_phylo_form argument.
The second way to use BACE is to just provide a model with your main variables. This may be because you have an a priori model in mind, as such, you can also use bace_imp by simply feeding in a single formula as below. Here, BACE will automatically build models for the other variables based on their types and the variables present in the dataset. It will impute missing values in the same way as described above, but using all other variables as predictors. You can also specify interactions if you wish. We can run BACE in this way as follows (this time using verbose = TRUE to see more output during the imputation process):
# Fit BACE with a single formula
bace_impute2 <- bace_imp(fixformula = "y ~ x1 + x2",
ran_phylo_form = "~ 1 |species",
phylo = tree,
data = data,
runs = 5, nitt = 20000, burnin = 5000, thin = 10) Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 993.0
Variance components: 878.1
Convergence Assessment:
Check : 5 parameters (33.3%)
Fixed : 1 parameters (6.7%)
Good : 9 parameters (60.0%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[CH] y : 3/ 5 parameters good
Check convergence: (Intercept), species
[OK] x1 : 5/ 5 parameters good
[CH] x2 : 1/ 4 parameters good (1 fixed)
Check convergence: (Intercept), y, species
========================================
As you can see from the printout above, BACE provides a summary of the imputation process, including the number of models fitted, parameters estimated, and effective sample sizes which evaluate how well the MCMC chains for parameters are mixing. This is important to check to ensure that the imputation is performing well.
BACE returns a list of models for each variable that was imputed. You can access these models and evaluate them as you would any MCMCglmm model. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable y as follows:
# Access the model for the response variable 'y'
model_y <- bace_impute2$models[["y"]]
# View summary of the model
summary(model_y)
Iterations = 5001:19991
Thinning interval = 10
Sample size = 1500
DIC: 438.3994
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 0.02665 0.000295 0.08646 134.6
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.8743 0.682 1.073 1500
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.01735 -0.24912 0.29627 1492 0.901
x1 -0.16399 -0.25429 -0.07381 1500 <7e-04 ***
x2.L 0.19051 -0.08472 0.42757 1500 0.152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Plot trace and density plots for model parameters
plot(model_y)We can see here that the phylogenetic variance estimate (i.e., species random effect) is mixing poorly. We should go back and re-run the imputation with more iterations or possibly adjust our prior to ensure better mixing for this parameter.