We provide a brief introduction to the BACE package for Bayesian phylogenetic and correlated trait imputation. The BACE package allows users to impute missing data in comparative datasets using Bayesian methods that account for phylogenetic relationships among species.
2 Installation
You can install the BACE package from GitHub using the following command:
Code
# Install from github and load#install.packages("pacman") pacman::p_load(devtools)devtools::install_github("daniel1noble/BACE")
── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/private/var/folders/48/07fr3p6n4nv8gq_s3kl6vj2c0000gn/T/RtmpOVx4cy/remotesd45c5c94aa65/daniel1noble-BACE-c09d7ad/DESCRIPTION’ ... OK
* preparing ‘BACE’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Removed empty directory ‘BACE/.vscode’
Removed empty directory ‘BACE/dev’
Removed empty directory ‘BACE/ms’
Removed empty directory ‘BACE/vignettes’
* building ‘BACE_0.0.0.9000.tar.gz’
If you encounter any issues during installation, please ensure that you have the required dependencies installed.
3 Documentation
The main functions in the BACE package are bace() and bace_imp(), which performs Bayesian imputation using chained equations. You can access the documentation for these functions using the following commands:
Code
?bace
As you will see, the function allows you to specify fixed effects, random effects, phylogenetic tree, and other parameters for the imputation process. At the bare minimum, you need to provide a formula for the fixed effects, a formula for the random effects (including phylogeny), a phylogenetic tree, and the dataset with missing values.
At present, BACE only works with intercept-only phylogenetic and species-level random effects, but we are working on adding support for other types of random effects in the future. For most comparative datasets, this will be sufficient for modeling the phylogenetic and species-level variation in the data, but not always.
4 Example Usage
4.1 Simulating Data
We’ll first simulate a dataset with missing values and a phylogenetic tree. We can do this using the sim_bace() function.
Code
# Set seed for reproducibilityset.seed(123)# Simulate data. Note that threshold4 indicates an ordinal variable with 4 levels, and multinomial3 indicates a categorical variable with 3 levels. You can replace the numbers sim_data <-sim_bace(response_type ="poisson",predictor_types =c("gaussian", "binary", "threshold4", "multinomial3"),phylo_signal =c(0.55, 0.75, 0.8, 0.7, 0.7),missingness =c(0.2, 0.3, 0.4, 0.2, 0.5),n_cases =100,n_species =100)
Generated default beta_matrix with pre-set sparsity
We can then view the simulated data and phylogenetic tree because the sim_bace() function returns a list containing both objects. It also has a lot of other useful information about the simulation users can explore.
Code
# Object contains both the simulated data and phylogeny data <- sim_data$data tree <- sim_data$tree# View first few rows of the datahead(data)
species y x1 x2 x3 x4
1 pJUhd 1 NA <NA> 2 A
2 h62z2 0 NA <NA> <NA> B
3 69Haj 0 -3.045276 <NA> 2 B
4 gv77g 10 NA <NA> 3 <NA>
5 Arhl9 13 -1.785486 <NA> 4 <NA>
6 wLS5r 2 NA 1 2 C
Here, we can see that the dataset contains lots of missing values across different variable types. Of course, normally, we would need to exclude these cases to use a complete dataset. However, with BACE we simply feed in the entire dataset, specify the models we want to use for imputation (or the real model of interest), and impute the missing data.
4.2 Using BACE for Imputation
Now that we have the data and the phylogenetic tree we can do the imputation is to use the bace() or bace_imp() function.
Caution
It’s very important to ensure that all variables are correctly classified in the dataset (e.g., factors for categorical variables, numeric for continuous variables, etc.) before running the imputation. BACE will fit different models for different variable types, and so, if they have not been correctly classified you will be modeling the variable incorrectly.
If you want to see what BACE has classified each variable as, you can use the extract the types using the following command:
Code
bace_types <- bace_impute$types
Importantly, users also need to think carefully about the distributions of the variables themselves and transform them if necessary prior to running bace_imp (e.g., log-transforming skewed continuous variables).
There are two ways you can use bace/bace_imp. The first is to specify a list of all the formulas for each variable you would like to fit. It is important here that, if you provide a list, that any variable with missing data is found as a response variable. This must be the case because BACE will not automatically build models for other variables if you provide a list of formulas (but see below).
For now, we’ll focus on the simplified bace workflow because it does the full process for users, but later we will show how different components of the workflow can be run separately for more control over the imputation process.
4.3 Simple BACE Workflow
The bace() function is the main function to conduct a BACE analysis running each of the core steps: 1) initial imputations to reach convergence in imputed data; 2) convergence assessment to ensure that imputed data has reached a stable state; 3) imputing missing data 10 times using the final model, re-fitting and estimating parameters and then pooling posteriors across the final runs. If convergence is not achieved, it will automatically restart the initial steps to allow chains to converge before running the final set of models. You can also specify the number of runs, MCMC parameters, and whether to plot the results. The final output will include pooled models that account for imputation uncertainty.
=======================================================
BACE: Bayesian Analysis with Chained Equations
=======================================================
Step 1: Initial imputation for convergence assessment
-------------------------------------------------------
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 5242.5
Variance components: 1980.2
Convergence Assessment:
Acceptable : 3 parameters (20.0%)
Check : 3 parameters (20.0%)
Fixed : 1 parameters (6.7%)
Good : 8 parameters (53.3%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[CH] y : 2/ 5 parameters good
Check convergence: x2.L, species, units
[OK] x1 : 4/ 5 parameters good
[~] x2 : 2/ 4 parameters good (1 fixed)
========================================
Step 2: Convergence assessment
-------------------------------------------------------
Convergence not achieved. Running additional iterations...
Attempt 2 of 3
-------------------------------------------------------
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2
Run 16, Imputed variable: y
Run 16, Imputed variable: x1
Run 16, Imputed variable: x2
Run 17, Imputed variable: y
Run 17, Imputed variable: x1
Run 17, Imputed variable: x2
Run 18, Imputed variable: y
Run 18, Imputed variable: x1
Run 18, Imputed variable: x2
Run 19, Imputed variable: y
Run 19, Imputed variable: x1
Run 19, Imputed variable: x2
Run 20, Imputed variable: y
Run 20, Imputed variable: x1
Run 20, Imputed variable: x2
Run 21, Imputed variable: y
Run 21, Imputed variable: x1
Run 21, Imputed variable: x2
Run 22, Imputed variable: y
Run 22, Imputed variable: x1
Run 22, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 6010.8
Variance components: 2081.7
Convergence Assessment:
Acceptable : 4 parameters (26.7%)
Check : 2 parameters (13.3%)
Fixed : 1 parameters (6.7%)
Good : 8 parameters (53.3%)
Overall: ACCEPTABLE - Most parameters converged adequately
Convergence by Response Variable:
----------------------------------------
[~] y : 3/ 5 parameters good
Check convergence: species
[OK] x1 : 4/ 5 parameters good
[~] x2 : 1/ 4 parameters good (1 fixed)
Check convergence: x1
========================================
Convergence not achieved. Running additional iterations...
Attempt 3 of 3
-------------------------------------------------------
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2
Run 16, Imputed variable: y
Run 16, Imputed variable: x1
Run 16, Imputed variable: x2
Run 17, Imputed variable: y
Run 17, Imputed variable: x1
Run 17, Imputed variable: x2
Run 18, Imputed variable: y
Run 18, Imputed variable: x1
Run 18, Imputed variable: x2
Run 19, Imputed variable: y
Run 19, Imputed variable: x1
Run 19, Imputed variable: x2
Run 20, Imputed variable: y
Run 20, Imputed variable: x1
Run 20, Imputed variable: x2
Run 21, Imputed variable: y
Run 21, Imputed variable: x1
Run 21, Imputed variable: x2
Run 22, Imputed variable: y
Run 22, Imputed variable: x1
Run 22, Imputed variable: x2
Run 23, Imputed variable: y
Run 23, Imputed variable: x1
Run 23, Imputed variable: x2
Run 24, Imputed variable: y
Run 24, Imputed variable: x1
Run 24, Imputed variable: x2
Run 25, Imputed variable: y
Run 25, Imputed variable: x1
Run 25, Imputed variable: x2
Run 26, Imputed variable: y
Run 26, Imputed variable: x1
Run 26, Imputed variable: x2
Run 27, Imputed variable: y
Run 27, Imputed variable: x1
Run 27, Imputed variable: x2
Run 28, Imputed variable: y
Run 28, Imputed variable: x1
Run 28, Imputed variable: x2
Run 29, Imputed variable: y
Run 29, Imputed variable: x1
Run 29, Imputed variable: x2
Run 30, Imputed variable: y
Run 30, Imputed variable: x1
Run 30, Imputed variable: x2
Run 31, Imputed variable: y
Run 31, Imputed variable: x1
Run 31, Imputed variable: x2
Run 32, Imputed variable: y
Run 32, Imputed variable: x1
Run 32, Imputed variable: x2
Run 33, Imputed variable: y
Run 33, Imputed variable: x1
Run 33, Imputed variable: x2
Run 34, Imputed variable: y
Run 34, Imputed variable: x1
Run 34, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 5553.1
Variance components: 2083.8
Convergence Assessment:
Acceptable : 2 parameters (13.3%)
Check : 2 parameters (13.3%)
Fixed : 1 parameters (6.7%)
Good : 10 parameters (66.7%)
Overall: ACCEPTABLE - Most parameters converged adequately
Convergence by Response Variable:
----------------------------------------
[CH] y : 3/ 5 parameters good
Check convergence: species, units
[OK] x1 : 5/ 5 parameters good
[~] x2 : 2/ 4 parameters good (1 fixed)
========================================
WARNING: Convergence not achieved after 3 attempts
Consider increasing 'runs' or 'max_attempts' parameters
Proceeding with final imputation...
-------------------------------------------------------
=== Running 10 final imputation iterations ===
Run 1/10 - Imputed variable: y
Run 1/10 - Imputed variable: x1
Run 1/10 - Imputed variable: x2
Run 2/10 - Imputed variable: y
Run 2/10 - Imputed variable: x1
Run 2/10 - Imputed variable: x2
Run 3/10 - Imputed variable: y
Run 3/10 - Imputed variable: x1
Run 3/10 - Imputed variable: x2
Run 4/10 - Imputed variable: y
Run 4/10 - Imputed variable: x1
Run 4/10 - Imputed variable: x2
Run 5/10 - Imputed variable: y
Run 5/10 - Imputed variable: x1
Run 5/10 - Imputed variable: x2
Run 6/10 - Imputed variable: y
Run 6/10 - Imputed variable: x1
Run 6/10 - Imputed variable: x2
Run 7/10 - Imputed variable: y
Run 7/10 - Imputed variable: x1
Run 7/10 - Imputed variable: x2
Run 8/10 - Imputed variable: y
Run 8/10 - Imputed variable: x1
Run 8/10 - Imputed variable: x2
Run 9/10 - Imputed variable: y
Run 9/10 - Imputed variable: x1
Run 9/10 - Imputed variable: x2
Run 10/10 - Imputed variable: y
Run 10/10 - Imputed variable: x1
Run 10/10 - Imputed variable: x2
=== Final imputation complete ===
Saved 10 imputed datasets with corresponding models
Pooling posteriors with sampling: 1000 samples per imputation
=======================================================
BACE Analysis Complete!
=======================================================
Convergence achieved: FALSE
Number of attempts: 3
Final imputations: 10
Access results via:
- $pooled_models: Pooled posterior distributions
- $imputed_datasets: List of 10 imputed datasets
- $convergence: Convergence diagnostics
- $final_results: Full final imputation results
As we can see bace provides a comprehensive overview of the imputation process including each of the core steps. This includes the initial set of runs to allow data to converge, followed by convergence assessment checks both across runs and within parameters of models. Note that, if convergence is not achieved, the function will automatically restart the initial steps to allow chains to converge before running the final set of models. However, users should be aware that sometimes convergence fail yet convergence is likely achieved. If you think this is the case than you can do initial runs and skip the convergence assessment (see below). Note, it is also not always the case that parameter checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.
It’s worth highlighting a few key arguments within bace.
First, the skip_conv argument allows users to skip the convergence checks and just run the imputation for the number of runs specified. This is not recommended, but it can be useful if you have a large dataset and want to run the imputation for a long time without checking convergence.
Second, the sample_size argument allows users to specify the number of posterior samples to pool across final runs. This can be useful if you have a large dataset and want to reduce the size of the final bace object.
Third, the species argument allows users to specify whether to include a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species).
Once the imputation is complete, you can access the final pooled models and evaluate them as you would any MCMCglmm model fore each of the models fit. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable y as follows:
Code
# Now have a look at the final models and the pooled resultsy_model <- bace_final1$pooled_models$models$ysummary(y_model)
+--------------------------------------------------------------+
| BACE Pooled MCMCglmm Summary (Imputation Uncertainty Included)|
+--------------------------------------------------------------+
Pooled from 10 imputations
Total posterior samples: 10000
(= 1000 sampled per imputation x 10 imputations)
Original samples per imputation: 9000
Note: Posterior summaries account for both estimation and imputation uncertainty.
--------------------------------------------------------------
Iterations = 1:10000
Thinning interval = 1
Sample size = 10000
DIC: 223.8593
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 1.33 0.003282 2.216 615.2
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.385 0.1591 0.6848 223.6
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.02102 -0.94810 0.98373 10000 0.9580
x1 -0.15900 -0.43023 0.11108 1360 0.2440
x2.L 0.36599 0.14024 0.61457 10000 0.0028 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
plot(y_model)
Code
# We can now extract posteriors in the same way we would use a normal MCMCglmm model, but now we have the pooled results that account for imputation uncertainty. For example, we can extract the posterior samples for the fixed effects as follows:hist(y_model$Sol[, "x1"], main ="Posterior distribution for x1", xlab ="x1 coefficient")
Here, we see convergence wasn’t achieved with this particular dataset after 3 attempts, and we can see in the plots of the MCMCglmm models that some of the variance estimates are not quite converged yet. It’s best to adjust bace parameters to allow for longer runtimes and better mixing of the MCMC chains. For now, however, we will ignore the convergence issues for the sake of showing the functionality.
Another important feature is that users can also access all the final datasets with imputed values for each run (as a list). This can be useful for checking the imputed values (i.e., cross-validation) and exploring the imputation process in more detail. You can access the imputed datasets as follows:
Code
# Access the imputed datasets for each run. imputed_datasets <- bace_final1$imputed_datasets
For example, we can run BACE as follows. Note that we specify a formula for each variable in the dataset that has missing data.
Code
# First check that all variables are correctly classified. If using sim_bace() this is done automatically, but always worth checking.str(data)# Perform BACE imputation. Here we specify models for each variable in the dataset.bace_impute <-bace_imp(fixformula =list("y ~ x1 + x2","x1 ~ x2","x2 ~ x4","x3 ~ x1 + x2","x4 ~ x1 + x2"),ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =5, verbose =FALSE)
What BACE will do here is fit a MCMCglmm model for each of the response variables specified in the formula list. At the first iteration, it will fill in missing values for any predictors using either their mean (for continuous variables) or randomly sampled values (for categorical variables). Models will then be fit using this complete predictor set and predictions will be made for missing values of the response variable.
In the next run, model predictions will be used to ‘fill’ in missing values in any predictors instead. This cycle will continue until the predicted values converge. This process is repeated for the number of iterations specified by the runs argument. All the models will include the phylogenetic random effect specified in the ran_phylo_form argument.
The second way to use BACE is to just provide a model with your main variables. This may be because you have an a priori model in mind. As such, you can also simply feeding in a single formula as below. Here, BACE will automatically build models for the other variables based on their types and the variables present in the dataset. It will impute missing values in the same way as described above, but using all other variables as predictors. You can also specify interactions if you wish. We can run BACE in this way as follows (this time using verbose = TRUE to see more output during the imputation process):
Code
# Fit BACE with a single formulabace_impute2 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =5, nitt =20000, burnin =5000, thin =10)
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 1033.7
Variance components: 318.4
Convergence Assessment:
Acceptable : 1 parameters (6.7%)
Check : 7 parameters (46.7%)
Fixed : 1 parameters (6.7%)
Good : 6 parameters (40.0%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[CH] y : 1/ 5 parameters good
Check convergence: x1, x2.L, species, units
[OK] x1 : 4/ 5 parameters good
[CH] x2 : 1/ 4 parameters good (1 fixed)
Check convergence: x1, y, species
========================================
As you can see from the printout above, BACE provides a summary of the imputation process, including the number of models fitted, parameters estimated, and effective sample sizes which evaluate how well the MCMC chains for parameters are mixing. This is important to check to ensure that the imputation is performing well. Note, it is not always the case that these checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.
4.4 Evaluating Final Models
BACE returns a list of models for each variable that was imputed. You can access these models and evaluate them as you would any MCMCglmm model. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable y as follows:
Code
# Access the model for the response variable 'y'model_y <- bace_impute2$models[["y"]] # View summary of the modelsummary(model_y)
Iterations = 5001:19991
Thinning interval = 10
Sample size = 1500
DIC: 214.6291
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 0.4175 0.000383 1.954 35.31
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.7966 0.2219 1.189 44.28
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) -0.01617 -0.75466 0.55179 2273.9 0.8947
x1 -0.05025 -0.20268 0.09815 193.2 0.5347
x2.L 0.31769 -0.01604 0.70152 330.9 0.0747 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Plot trace and density plots for model parametersplot(model_y)
We can see here that the phylogenetic variance estimate (i.e., species random effect) is mixing poorly. We should go back and re-run the imputation with more iterations or possibly adjust our prior to ensure better mixing for this parameter.
5 Phylogenetic and Non-Phylogenetic Random Effects
BACE allows users to include both phylogenetic and non-phylogenetic random effects in the imputation models. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species). In this case, users can specify a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is done by adjusting the species = FALSE argument in bacse_imp, setting it to TRUE.
Code
# Simulate data with multiple replicates per speciesset.seed(123)sim_data2 <-sim_bace(response_type ="poisson",predictor_types =c("gaussian", "binary", "threshold4", "multinomial3"),phylo_signal =c(0.55, 0.75, 0.8, 0.7, 0.7),missingness =c(0.2, 0.3, 0.4, 0.2, 0.5),rr =TRUE,n_cases =100,n_species =20,rr_form =list(species =c("x1")))# Object contains both the simulated data and phylogeny data2 <- sim_data2$data tree2 <- sim_data2$tree# View first few rows of the datahead(data2)
species y x1 x2 x3 x4
1 P9vI3 6 -0.7731097 1 4 C
2 eJBm5 NA NA 1 <NA> <NA>
3 P9vI3 1 NA <NA> 4 C
4 JE8PP 1 -2.9886121 0 <NA> <NA>
5 eJBm5 1 NA 1 3 <NA>
6 sA8eL 12 NA <NA> 2 <NA>
Code
# We can also make a table of the number of cases per species to confirm that there are multiple replicates per speciesdim(data2)
[1] 100 6
Code
length(unique(data2$species))
[1] 20
We now have a dataset with 100 cases but only 20 species, meaning that there are multiple replicates per species. We can run BACE on this dataset while including a non-phylogenetic random effect for species identity as follows:
Species effect decomposition enabled: 20 out of 20 species (100%) have replicated observations.
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 18
Fixed effects: 9
Variance components: 9
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 938.4
Variance components: 430.0
Convergence Assessment:
Check : 8 parameters (44.4%)
Fixed : 1 parameters (5.6%)
Good : 9 parameters (50.0%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[OK] y : 5/ 6 parameters good
Check convergence: species
[CH] x1 : 4/ 6 parameters good
Check convergence: species, species2
[CH] x2 : 0/ 5 parameters good (1 fixed)
Check convergence: (Intercept), x1, y, species, species2
========================================
Code
# We can view the final model for y to check that both random effects are includedmodel_y2 <- bace_impute3$models[["y"]]summary(model_y2)
Iterations = 5001:19991
Thinning interval = 10
Sample size = 1500
DIC: 201.2756
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 0.433 0.0001999 1.896 144
~species2
post.mean l-95% CI u-95% CI eff.samp
species2 0.4094 0.0004199 0.9779 248.5
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.5918 0.4049 0.8516 1282
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) -0.28377 -1.07324 0.58138 1372 0.3987
x1 -0.03534 -0.21281 0.13330 1500 0.6560
x2.L 0.63330 0.02089 1.19010 1500 0.0307 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that in this case we get a message indicating that both phylogenetic and non-phylogenetic random effects are being included in the models because we probably can decompose the variances. If you try and fit a dataset without multiple replicates per species while setting species = TRUE you will get an error because the model cannot be fit given the two parameters are not estimable.
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 18
Fixed effects: 9
Variance components: 9
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 961.9
Variance components: 451.3
Convergence Assessment:
Check : 8 parameters (44.4%)
Fixed : 1 parameters (5.6%)
Good : 9 parameters (50.0%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[OK] y : 5/ 6 parameters good
Check convergence: species
[CH] x1 : 4/ 6 parameters good
Check convergence: species, species2
[CH] x2 : 0/ 5 parameters good (1 fixed)
Check convergence: (Intercept), x1, y, species, species2
========================================
5.1 Adjusting MCMC Parameters For Convergence
As we can see from the model summary above, some parameters may have low effective sample sizes (ESS) which indicates poor mixing of the MCMC chains. This can be a sign that the imputation is not performing well and that more iterations are needed to achieve convergence. Lets have a look:
Iterations = 5001:19991
Thinning interval = 10
Sample size = 1500
DIC: 11.4237
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 225.1 0.000339 626.5 49.53
~species2
post.mean l-95% CI u-95% CI eff.samp
species2 80.38 0.0004792 329 13.91
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 1 1 1 0
Location effects: x2 ~ x1 + y
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 7.8610 -10.2228 26.6525 79.131 0.3400
x1 -3.3436 -6.7864 -0.3788 57.443 0.0253 *
y 3.3022 1.3582 5.5662 5.484 <7e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see x1 and x2 models have poor mixing and low ESS for some parameters. Obviously, depending on the model complexity and data size etc, some models will need longer runtimes. This is likely because the imputation process has not yet converged. To address this, we can increase the number of iterations (i.e., nitt), increase the burn-in period (i.e., burnin), and/or adjust the thinning interval (i.e., thin) to improve mixing and convergence of the MCMC chains.
Users can adjust the MCMC parameters (i.e., nitt, thin, and burnin) for each model by providing a list of values for each parameter that corresponds to the order of the formulas provided in the fixformula argument. For example, if you have 5 formulas in your fixformula list, you can provide a list of 5 values for each MCMC parameter as follows:
Code
# Define MCMC parameters for each model nitt_list <-list(60000, 60000, 105000) thin_list <-list(10, 10, 10)# Run BACE with model-specific MCMC parametersbace_impute5 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,runs =5, nitt = nitt_list, thin = thin_list, burnin =5000, species =TRUE)
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 18
Fixed effects: 9
Variance components: 9
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 3201.9
Variance components: 1566.4
Convergence Assessment:
Acceptable : 1 parameters (5.6%)
Check : 11 parameters (61.1%)
Fixed : 1 parameters (5.6%)
Good : 5 parameters (27.8%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[CH] y : 2/ 6 parameters good
Check convergence: (Intercept), x1, species, species2
[CH] x1 : 3/ 6 parameters good
Check convergence: (Intercept), species2
[CH] x2 : 0/ 5 parameters good (1 fixed)
Check convergence: (Intercept), x1, y, species, species2
========================================
As you can imagine, longer runtimes will mean that bace_imp will need to take longer. The diagnostics still suggest possible issues with mixing for some parameters, so we may need to further increase the number of iterations or adjust the priors to improve mixing, however, it’s still good to look at the final models to check the model visually.
5.2 Assessing Model Convergence
We need to check whether the numbers of runs we’ve done has lead to convergence of the imputation process. We can do this by looking at the trace and density plots for the parameters in the models. For example, we can check the model for y as follows:
Code
# Assess whether number of runs was enough by looking at the stability of imputed points across runs. Adjust runs if neededconverge <-assess_convergence(bace_impute2, method ="summary")plot(converge)
Code
# Now that we know that we can run BACE longer# Fit BACE with a single formulabace_impute2_2 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =15, nitt =100000, burnin =10000, thin =10)
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 5785.2
Variance components: 1996.7
Convergence Assessment:
Acceptable : 1 parameters (6.7%)
Check : 4 parameters (26.7%)
Fixed : 1 parameters (6.7%)
Good : 9 parameters (60.0%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[CH] y : 3/ 5 parameters good
Check convergence: species, units
[OK] x1 : 4/ 5 parameters good
Check convergence: species
[~] x2 : 2/ 4 parameters good (1 fixed)
Check convergence: y
========================================
Code
# Check again for convergenceconverge <-assess_convergence(bace_impute2_2, method ="summary")plot(converge)
5.3 Simple BACE Workflow
Now that you are aware of the different components of the BACE workflow, you can put it all together in a simple workflow. This will involve simulating data, running the imputation, checking model fit and convergence, and then rerunning the model say, 10 times, and simply pooling the results to account for imputation uncertainty.
=======================================================
BACE: Bayesian Analysis with Chained Equations
=======================================================
Step 1: Initial imputation for convergence assessment
-------------------------------------------------------
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 18
Fixed effects: 9
Variance components: 9
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 5469.5
Variance components: 2585.4
Convergence Assessment:
Acceptable : 2 parameters (11.1%)
Check : 7 parameters (38.9%)
Fixed : 1 parameters (5.6%)
Good : 8 parameters (44.4%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[OK] y : 5/ 6 parameters good
[CH] x1 : 3/ 6 parameters good
Check convergence: (Intercept), species2
[CH] x2 : 0/ 5 parameters good (1 fixed)
Check convergence: (Intercept), x1, y, species, species2
========================================
Step 2: Convergence assessment
-------------------------------------------------------
Skipping convergence retry (skip_conv = TRUE)
Proceeding with final imputation...
-------------------------------------------------------
=== Running 10 final imputation iterations ===
Run 1/10 - Imputed variable: y
Run 1/10 - Imputed variable: x1
Run 1/10 - Imputed variable: x2
Run 2/10 - Imputed variable: y
Run 2/10 - Imputed variable: x1
Run 2/10 - Imputed variable: x2
Run 3/10 - Imputed variable: y
Run 3/10 - Imputed variable: x1
Run 3/10 - Imputed variable: x2
Run 4/10 - Imputed variable: y
Run 4/10 - Imputed variable: x1
Run 4/10 - Imputed variable: x2
Run 5/10 - Imputed variable: y
Run 5/10 - Imputed variable: x1
Run 5/10 - Imputed variable: x2
Run 6/10 - Imputed variable: y
Run 6/10 - Imputed variable: x1
Run 6/10 - Imputed variable: x2
Run 7/10 - Imputed variable: y
Run 7/10 - Imputed variable: x1
Run 7/10 - Imputed variable: x2
Run 8/10 - Imputed variable: y
Run 8/10 - Imputed variable: x1
Run 8/10 - Imputed variable: x2
Run 9/10 - Imputed variable: y
Run 9/10 - Imputed variable: x1
Run 9/10 - Imputed variable: x2
Run 10/10 - Imputed variable: y
Run 10/10 - Imputed variable: x1
Run 10/10 - Imputed variable: x2
=== Final imputation complete ===
Saved 10 imputed datasets with corresponding models
Pooling posteriors with sampling: 1000 samples per imputation
=======================================================
BACE Analysis Complete!
=======================================================
Convergence achieved: FALSE
Number of attempts: 1
Final imputations: 10
Access results via:
- $pooled_models: Pooled posterior distributions
- $imputed_datasets: List of 10 imputed datasets
- $convergence: Convergence diagnostics
- $final_results: Full final imputation results
Code
# Now have a look at the final models and the pooled resultsy_model <- bace_final$pooled_models$models$ysummary(y_model)
+--------------------------------------------------------------+
| BACE Pooled MCMCglmm Summary (Imputation Uncertainty Included)|
+--------------------------------------------------------------+
Pooled from 10 imputations
Total posterior samples: 10000
(= 1000 sampled per imputation x 10 imputations)
Original samples per imputation: 9000
Note: Posterior summaries account for both estimation and imputation uncertainty.
--------------------------------------------------------------
Iterations = 1:10000
Thinning interval = 1
Sample size = 10000
DIC: 232.4397
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 0.4126 0.0001846 1.675 10000
~species2
post.mean l-95% CI u-95% CI eff.samp
species2 0.4637 0.0002245 1.026 10000
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.5012 0.3439 0.6629 10000
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) -0.3347 -1.1687 0.4987 10000 0.3160
x1 -0.1314 -0.4450 0.1802 10000 0.4052
x2.L 0.7096 0.2680 1.1598 8643 0.0018 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
plot(y_model)
Code
# We can now extract posteriors in the same way we would use a normal MCMCglmm model, but now we have the pooled results that account for imputation uncertainty. For example, we can extract the posterior samples for the fixed effects as follows:hist(y_model$Sol[, "x1"], main ="Posterior distribution for x1", xlab ="x1 coefficient")
Source Code
---title: "BACE: Bayesian Phylogenetic and Correlated Trait Imputation"date: "`r Sys.Date()`"author: "Daniel Noble, Szymek Drobniak, Shinichi Nakagawa"bibliography: ../bib/BACE.bibformat: html: toc: true toc-location: left toc-depth: 3 toc-title: "**Table of Contents**" output-file: "index.html" theme: cosmo embed-resources: true code-fold: show code-tools: true number-sections: true fontsize: "12" code-overflow: wrapcrossref: fig-title: Figure # (default is "Figure") tbl-title: Table # (default is "Table") title-delim: — # (default is ":") fig-prefix: Fig. # (default is "Figure") tbl-prefix: Tab. # (default is "Table")editor_options: chunk_output_type: console---# IntroductionWe provide a brief introduction to the *BACE* package for Bayesian phylogenetic and correlated trait imputation. The *BACE* package allows users to impute missing data in comparative datasets using Bayesian methods that account for phylogenetic relationships among species.# InstallationYou can install the *BACE* package from GitHub using the following command:```{r}#| label: install_bace#| echo: true#| message: false#| warning: false#| eval: true# Install from github and load#install.packages("pacman") pacman::p_load(devtools)devtools::install_github("daniel1noble/BACE") pacman::p_load(BACE, ape, phytools, MCMCglmm, dplyr, magrittr)```If you encounter any issues during installation, please ensure that you have the required dependencies installed.# DocumentationThe main functions in the *BACE* package are `bace()` and `bace_imp()`, which performs Bayesian imputation using chained equations. You can access the documentation for these functions using the following commands:```{r}#| label: bace_imp_doc#| echo: true?bace```As you will see, the function allows you to specify fixed effects, random effects, phylogenetic tree, and other parameters for the imputation process. At the bare minimum, you need to provide a formula for the fixed effects, a formula for the random effects (including phylogeny), a phylogenetic tree, and the dataset with missing values. At present, *BACE* only works with intercept-only phylogenetic and species-level random effects, but we are working on adding support for other types of random effects in the future. For most comparative datasets, this will be sufficient for modeling the phylogenetic and species-level variation in the data, but not always.# Example Usage## Simulating DataWe'll first simulate a dataset with missing values and a phylogenetic tree. We can do this using the `sim_bace()` function.```{r}#| label: simulate_data#| echo: true# Set seed for reproducibilityset.seed(123)# Simulate data. Note that threshold4 indicates an ordinal variable with 4 levels, and multinomial3 indicates a categorical variable with 3 levels. You can replace the numbers sim_data <-sim_bace(response_type ="poisson",predictor_types =c("gaussian", "binary", "threshold4", "multinomial3"),phylo_signal =c(0.55, 0.75, 0.8, 0.7, 0.7),missingness =c(0.2, 0.3, 0.4, 0.2, 0.5),n_cases =100,n_species =100)```We can then view the simulated data and phylogenetic tree because the `sim_bace()` function returns a list containing both objects. It also has a lot of other useful information about the simulation users can explore.```{r}#| label: view_simulated_data#| echo: true# Object contains both the simulated data and phylogeny data <- sim_data$data tree <- sim_data$tree# View first few rows of the datahead(data)```Here, we can see that the dataset contains lots of missing values across different variable types. Of course, normally, we would need to exclude these cases to use a complete dataset. However, with *BACE* we simply feed in the entire dataset, specify the models we want to use for imputation (or the real model of interest), and impute the missing data. ## Using *BACE* for ImputationNow that we have the data and the phylogenetic tree we can do the imputation is to use the `bace()` or `bace_imp()` function. :::::: {.callout-caution}It's very important to ensure that all variables are correctly classified in the dataset (e.g., factors for categorical variables, numeric for continuous variables, etc.) before running the imputation. *BACE* will fit different models for different variable types, and so, if they have not been correctly classified you will be modeling the variable incorrectly. If you want to see what *BACE* has classified each variable as, you can use the extract the types using the following command:```{r}#| label: check_variable_types#| echo: true #| eval: falsebace_types <- bace_impute$types```Importantly, users also need to think carefully about the distributions of the variables themselves and transform them if necessary prior to running `bace_imp` (e.g., log-transforming skewed continuous variables).:::There are two ways you can use `bace/bace_imp`. The first is to specify a list of all the formulas for each variable you would like to fit. It is important here that, if you provide a list, that any variable with missing data is found as a response variable. This must be the case because *BACE* will not automatically build models for other variables if you provide a list of formulas (but see below). For now, we'll focus on the simplified `bace` workflow because it does the full process for users, but later we will show how different components of the workflow can be run separately for more control over the imputation process.## Simple BACE WorkflowThe `bace()` function is the main function to conduct a *BACE* analysis running each of the core steps: 1) initial imputations to reach convergence in imputed data; 2) convergence assessment to ensure that imputed data has reached a stable state; 3) imputing missing data 10 times using the final model, re-fitting and estimating parameters and then pooling posteriors across the final runs. If convergence is not achieved, it will automatically restart the initial steps to allow chains to converge before running the final set of models. You can also specify the number of runs, MCMC parameters, and whether to plot the results. The final output will include pooled models that account for imputation uncertainty.```{r}#| label: simple_bace_workflow1#| echo: true#| eval: true#| warning: false#| message: false# Run the full bace workflow:bace_final1 <-bace(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,species =FALSE,runs =15, nitt =100000, burnin =10000, thin =10, plot =TRUE, skip_conv =FALSE, sample_size =1000)```As we can see `bace` provides a comprehensive overview of the imputation process including each of the core steps. This includes the initial set of runs to allow data to converge, followed by convergence assessment checks both across runs and within parameters of models. Note that, if convergence is not achieved, the function will automatically restart the initial steps to allow chains to converge before running the final set of models. However, users should be aware that sometimes convergence fail yet convergence is likely achieved. If you think this is the case than you can do initial runs and skip the convergence assessment (see below). Note, it is also not always the case that parameter checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.It's worth highlighting a few key arguments within `bace`. - First, the `skip_conv` argument allows users to skip the convergence checks and just run the imputation for the number of runs specified. This is not recommended, but it can be useful if you have a large dataset and want to run the imputation for a long time without checking convergence. - Second, the `sample_size` argument allows users to specify the number of posterior samples to pool across final runs. This can be useful if you have a large dataset and want to reduce the size of the final `bace` object.- Third, the `species` argument allows users to specify whether to include a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species). Once the imputation is complete, you can access the final pooled models and evaluate them as you would any `MCMCglmm` model fore each of the models fit. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable `y` as follows:```{r}#| label: simple_bace_workflow2#| echo: true#| eval: true#| warning: false# Now have a look at the final models and the pooled resultsy_model <- bace_final1$pooled_models$models$ysummary(y_model)plot(y_model) # We can now extract posteriors in the same way we would use a normal MCMCglmm model, but now we have the pooled results that account for imputation uncertainty. For example, we can extract the posterior samples for the fixed effects as follows:hist(y_model$Sol[, "x1"], main ="Posterior distribution for x1", xlab ="x1 coefficient")```Here, we see convergence wasn't achieved with this particular dataset after 3 attempts, and we can see in the plots of the `MCMCglmm models` that some of the variance estimates are not quite converged yet. It's best to adjust `bace` parameters to allow for longer runtimes and better mixing of the MCMC chains. For now, however, we will ignore the convergence issues for the sake of showing the functionality.Another important feature is that users can also access all the final datasets with imputed values for each run (as a list). This can be useful for checking the imputed values (i.e., cross-validation) and exploring the imputation process in more detail. You can access the imputed datasets as follows:```{r}#| label: access_imputed_datasets#| echo: true#| eval: true#| warning: false# Access the imputed datasets for each run. imputed_datasets <- bace_final1$imputed_datasets```------For example, we can run *BACE* as follows. Note that we specify a formula for each variable in the dataset that has missing data.```{r}#| label: bace_imputation#| echo: true #| eval: false# First check that all variables are correctly classified. If using sim_bace() this is done automatically, but always worth checking.str(data)# Perform BACE imputation. Here we specify models for each variable in the dataset.bace_impute <-bace_imp(fixformula =list("y ~ x1 + x2","x1 ~ x2","x2 ~ x4","x3 ~ x1 + x2","x4 ~ x1 + x2"),ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =5, verbose =FALSE)```What *BACE* will do here is fit a `MCMCglmm` model for each of the response variables specified in the formula list. At the first iteration, it will fill in missing values for any predictors using either their mean (for continuous variables) or randomly sampled values (for categorical variables). Models will then be fit using this complete predictor set and predictions will be made for missing values of the response variable. In the next run, model predictions will be used to 'fill' in missing values in any predictors instead. This cycle will continue until the predicted values converge. This process is repeated for the number of iterations specified by the `runs` argument. All the models will include the phylogenetic random effect specified in the `ran_phylo_form` argument.The second way to use *BACE* is to just provide a model with your main variables. This may be because you have an *a priori* model in mind. As such, you can also simply feeding in a single formula as below. Here, *BACE* will automatically build models for the other variables based on their types and the variables present in the dataset. It will impute missing values in the same way as described above, but using all other variables as predictors. You can also specify interactions if you wish. We can run *BACE* in this way as follows (this time using verbose = TRUE to see more output during the imputation process):```{r}#| label: bace_imputation_simple#| echo: true# Fit BACE with a single formulabace_impute2 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =5, nitt =20000, burnin =5000, thin =10) ```As you can see from the printout above, *BACE* provides a summary of the imputation process, including the number of models fitted, parameters estimated, and effective sample sizes which evaluate how well the MCMC chains for parameters are mixing. This is important to check to ensure that the imputation is performing well. Note, it is not always the case that these checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.## Evaluating Final Models*BACE* returns a list of models for each variable that was imputed. You can access these models and evaluate them as you would any `MCMCglmm` model. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable `y` as follows:```{r}#| label: evaluate_models#| echo: true# Access the model for the response variable 'y'model_y <- bace_impute2$models[["y"]] # View summary of the modelsummary(model_y)# Plot trace and density plots for model parametersplot(model_y)```We can see here that the phylogenetic variance estimate (i.e., `species` random effect) is mixing poorly. We should go back and re-run the imputation with more iterations or possibly adjust our prior to ensure better mixing for this parameter.# Phylogenetic and Non-Phylogenetic Random EffectsBACE allows users to include both phylogenetic and non-phylogenetic random effects in the imputation models. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species). In this case, users can specify a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is done by adjusting the `species = FALSE` argument in `bacse_imp`, setting it to TRUE.```{r}#| label: non_phylo_random_effects#| echo: true#| eval: true#| warning: false#| message: false# Simulate data with multiple replicates per speciesset.seed(123)sim_data2 <-sim_bace(response_type ="poisson",predictor_types =c("gaussian", "binary", "threshold4", "multinomial3"),phylo_signal =c(0.55, 0.75, 0.8, 0.7, 0.7),missingness =c(0.2, 0.3, 0.4, 0.2, 0.5),rr =TRUE,n_cases =100,n_species =20,rr_form =list(species =c("x1")))# Object contains both the simulated data and phylogeny data2 <- sim_data2$data tree2 <- sim_data2$tree# View first few rows of the datahead(data2) # We can also make a table of the number of cases per species to confirm that there are multiple replicates per speciesdim(data2)length(unique(data2$species))```We now have a dataset with 100 cases but only 20 species, meaning that there are multiple replicates per species. We can run *BACE* on this dataset while including a non-phylogenetic random effect for species identity as follows:```{r}#| label: bace_non_phylo_random_effects#| echo: true#| eval: truebace_impute3 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,runs =5, nitt =20000, burnin =5000, thin =10, species =TRUE) # We can view the final model for y to check that both random effects are includedmodel_y2 <- bace_impute3$models[["y"]]summary(model_y2) ```Note that in this case we get a message indicating that both phylogenetic and non-phylogenetic random effects are being included in the models because we probably can decompose the variances. If you try and fit a dataset without multiple replicates per species while setting `species = TRUE` you will get an error because the model cannot be fit given the two parameters are not estimable.```{r}#| label: bace_non_phylo_random_effects_error#| echo: true#| eval: true#| warning: falsebace_impute4 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,runs =5, nitt =20000, burnin =5000, thin =10, species =TRUE)```## Adjusting MCMC Parameters For ConvergenceAs we can see from the model summary above, some parameters may have low effective sample sizes (ESS) which indicates poor mixing of the MCMC chains. This can be a sign that the imputation is not performing well and that more iterations are needed to achieve convergence. Lets have a look:```{r}#| label: check_ess#| echo: true#| eval: truemodel_x1 <- bace_impute3$models[["x1"]]summary(model_x1) model_x2 <- bace_impute3$models[["x2"]]summary(model_x2) ```As we can see x1 and x2 models have poor mixing and low ESS for some parameters. Obviously, depending on the model complexity and data size etc, some models will need longer runtimes. This is likely because the imputation process has not yet converged. To address this, we can increase the number of iterations (i.e., `nitt`), increase the burn-in period (i.e., `burnin`), and/or adjust the thinning interval (i.e., `thin`) to improve mixing and convergence of the MCMC chains. Users can adjust the MCMC parameters (i.e., `nitt`, `thin`, and `burnin`) for each model by providing a list of values for each parameter that corresponds to the order of the formulas provided in the `fixformula` argument. For example, if you have 5 formulas in your `fixformula` list, you can provide a list of 5 values for each MCMC parameter as follows:```{r}#| label: mcmc_parameters#| echo: true#| eval: true#| warning: false#| message: false# Define MCMC parameters for each model nitt_list <-list(60000, 60000, 105000) thin_list <-list(10, 10, 10)# Run BACE with model-specific MCMC parametersbace_impute5 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,runs =5, nitt = nitt_list, thin = thin_list, burnin =5000, species =TRUE)```As you can imagine, longer runtimes will mean that `bace_imp` will need to take longer. The diagnostics still suggest possible issues with mixing for some parameters, so we may need to further increase the number of iterations or adjust the priors to improve mixing, however, it's still good to look at the final models to check the model visually.## Assessing Model ConvergenceWe need to check whether the numbers of runs we've done has lead to convergence of the imputation process. We can do this by looking at the trace and density plots for the parameters in the models. For example, we can check the model for `y` as follows:```{r}#| label: check_convergence#| echo: true#| eval: true#| warning: false#| message: false# Assess whether number of runs was enough by looking at the stability of imputed points across runs. Adjust runs if neededconverge <-assess_convergence(bace_impute2, method ="summary")plot(converge)# Now that we know that we can run BACE longer# Fit BACE with a single formulabace_impute2_2 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =15, nitt =100000, burnin =10000, thin =10) # Check again for convergenceconverge <-assess_convergence(bace_impute2_2, method ="summary")plot(converge)```## Simple BACE WorkflowNow that you are aware of the different components of the *BACE* workflow, you can put it all together in a simple workflow. This will involve simulating data, running the imputation, checking model fit and convergence, and then rerunning the model say, 10 times, and simply pooling the results to account for imputation uncertainty.```{r}#| label: simple_bace_workflow#| echo: true#| eval: true#| warning: false#| message: false# Run the full bace workflow:bace_final <-bace(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,species =TRUE,runs =15, nitt =100000, burnin =10000, thin =10, plot =TRUE, skip_conv =TRUE, sample_size =1000)# Now have a look at the final models and the pooled resultsy_model <- bace_final$pooled_models$models$ysummary(y_model)plot(y_model) # We can now extract posteriors in the same way we would use a normal MCMCglmm model, but now we have the pooled results that account for imputation uncertainty. For example, we can extract the posterior samples for the fixed effects as follows:hist(y_model$Sol[, "x1"], main ="Posterior distribution for x1", xlab ="x1 coefficient")```