We provide a brief introduction to the BACE package for Bayesian phylogenetic and correlated trait imputation. The BACE package allows users to impute missing data in comparative datasets using Bayesian methods that account for phylogenetic relationships among species.
2 When to use Phylo-BACE
General-purpose multiple-imputation tools such as mice, Amelia and missForest assume that rows are exchangeable: the missing value for row i is imputed using the marginal distribution of the column and the observed covariates of row i, with no structural relationship between rows. For comparative biology datasets this assumption is wrong. Closely related species tend to share trait values because they share evolutionary history, and throwing that information away means we impute a missing mouse trait as though the nearest source of information were the grand mean across 500 mammals rather than the handful of other mouse-like rodents in the tree.
Phylo-BACE keeps that information by fitting every per-variable imputation model as a Bayesian phylogenetic mixed model via MCMCglmm, with a phylogenetic random effect that borrows information between related taxa. Imputation is done via chained equations — one conditional model per variable with missing data — and final inference uses multiple independent imputations whose posterior draws are combined (see Interpreting the pooled output below).
Phylo-BACE is the right tool when:
Your rows are species (or populations nested within species) and you have a phylogenetic tree that covers them.
Your traits are likely to show phylogenetic signal — i.e. closely related species have correlated values. If you are unsure, fit a single-variable phylogenetic model to a complete trait first (e.g. phytools::phylosig() or an MCMCglmm fit with a phylo random effect) and check whether the phylogenetic variance component is non-trivial.
You want posterior inference on downstream models rather than point imputations followed by ad-hoc variance adjustments.
Rows are independent observational units with no phylogenetic structure — generic MI tools are faster and simpler.
The trait of interest has no detectable phylogenetic signal. The phylogenetic random effect is then noise that slows down MCMC without helping.
Missingness is rare and plausibly MCAR — a complete-case analysis may be defensible and much cheaper.
You need sub-second imputations in a production pipeline. Each chained-equations iteration runs an MCMC fit, so runtimes are measured in minutes, not milliseconds.
For a broader discussion of the data-gap problem and imputation in comparative biology we recommend Dalia A. Conde et al. (2019) and Pottier et al. (2024); for Rubin’s-rules-based approaches to related problems see Nakagawa and de Villemereuil (2018).
3 Installation
You can install the BACE package from GitHub using the following command:
Code
# Install from github and load#install.packages("pacman") pacman::p_load(devtools)devtools::install_github("daniel1noble/BACE")
The downloaded binary packages are in
/var/folders/48/07fr3p6n4nv8gq_s3kl6vj2c0000gn/T//RtmpJYJUfd/downloaded_packages
── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/private/var/folders/48/07fr3p6n4nv8gq_s3kl6vj2c0000gn/T/RtmpJYJUfd/remotes4640d85796c/daniel1noble-BACE-a707dec/DESCRIPTION’ ... OK
* preparing ‘BACE’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Removed empty directory ‘BACE/.vscode’
Removed empty directory ‘BACE/dev’
Removed empty directory ‘BACE/ms’
Removed empty directory ‘BACE/vignettes’
* building ‘BACE_0.0.0.9000.tar.gz’
If you encounter any issues during installation, please ensure that you have the required dependencies installed.
4 Documentation
The main functions in the BACE package are bace() and bace_imp(), which performs Bayesian imputation using chained equations. You can access the documentation for these functions using the following commands:
Code
?bace
As you will see, the function allows you to specify fixed effects, random effects, phylogenetic tree, and other parameters for the imputation process. At the bare minimum, you need to provide a formula for the fixed effects, a formula for the random effects (including phylogeny), a phylogenetic tree, and the dataset with missing values.
At present, BACE only works with intercept-only phylogenetic and species-level random effects, but we are working on adding support for other types of random effects in the future. For most comparative datasets, this will be sufficient for modeling the phylogenetic and species-level variation in the data, but not always.
5 The BACE workflow at a glance
bace() is the top-level wrapper. Internally it glues together four steps that you can also run individually when you need finer control:
Initial chained imputation — bace_imp(). For each variable with missing data, fit a MCMCglmm phylogenetic mixed model, impute the missing cells, move on to the next variable, and repeat for runs iterations. At each iteration the working dataset is updated on the raw scale so that later draws condition on the most recent imputations for every other variable. The first iteration warm-starts predictor NAs with marginal means (for continuous variables) or empirical draws (for categorical variables); subsequent iterations use the chained draws.
Convergence assessment of the chained loop — assess_convergence(). Check whether the sequence of imputed datasets has stabilised. This is not within-chain MCMC convergence — see Two kinds of convergence below.
Final imputation runs — bace_final_imp(). Starting from the converged dataset, launch n_finalindependent MCMCglmm fits per variable. Each run produces one complete imputed dataset and one per-variable posterior. Predictions are drawn from the posterior predictive distribution (not a point estimate), so the returned imputed datasets differ from run to run.
Posterior pooling — pool_posteriors(). Concatenate the per-imputation MCMC chains into a single set of posterior draws per parameter. These stacked draws approximate the marginal posterior of the analysis model integrating over the imputation distribution.
bace() runs all four steps and additionally retries step 1 with more chained iterations if step 2 fails (up to max_attempts times). The rest of this vignette walks through the workflow first with bace() and then shows how to run each step individually.
6 Example Usage
6.1 Simulating Data
We’ll first simulate a dataset with missing values and a phylogenetic tree. We can do this using the sim_bace() function.
Code
# Set seed for reproducibilityset.seed(123)# Simulate data. Note that threshold4 indicates an ordinal variable with 4 levels, and multinomial3 indicates a categorical variable with 3 levels. You can replace the numbers sim_data <-sim_bace(response_type ="poisson",predictor_types =c("gaussian", "binary", "threshold4", "multinomial3"),phylo_signal =c(0.55, 0.75, 0.8, 0.7, 0.7),missingness =c(0.2, 0.3, 0.4, 0.2, 0.5),n_cases =100,n_species =100)
Generated default beta_matrix with pre-set sparsity
We can then view the simulated data and phylogenetic tree because the sim_bace() function returns a list containing both objects. It also has a lot of other useful information about the simulation users can explore.
Code
# Object contains both the simulated data and phylogeny data <- sim_data$data tree <- sim_data$tree# View first few rows of the datahead(data)
species y x1 x2 x3 x4
1 pJUhd 1 NA <NA> 2 A
2 h62z2 0 NA <NA> <NA> B
3 69Haj 0 -3.045276 <NA> 2 B
4 gv77g 10 NA <NA> 3 <NA>
5 Arhl9 13 -1.785486 <NA> 4 <NA>
6 wLS5r 2 NA 1 2 C
Here, we can see that the dataset contains lots of missing values across different variable types. Of course, normally, we would need to exclude these cases to use a complete dataset. However, with BACE we simply feed in the entire dataset, specify the models we want to use for imputation (or the real model of interest), and impute the missing data.
6.2 Using BACE for Imputation
Now that we have the data and the phylogenetic tree we can do the imputation is to use the bace() or bace_imp() function.
Caution
It’s very important to ensure that all variables are correctly classified in the dataset (e.g., factors for categorical variables, numeric for continuous variables, etc.) before running the imputation. BACE will fit different models for different variable types, and so, if they have not been correctly classified you will be modeling the variable incorrectly.
If you want to see what BACE has classified each variable as, you can inspect the types element of any fitted bace_imp() or bace() object. For example, after running the simple workflow below you can do:
Code
# After running bace() (see next section), inspect the inferred variable typesbace_final1$initial_results$types
Importantly, users also need to think carefully about the distributions of the variables themselves and transform them if necessary prior to running bace_imp (e.g., log-transforming skewed continuous variables).
There are two ways you can use bace/bace_imp. The first is to specify a list of all the formulas for each variable you would like to fit. It is important here that, if you provide a list, that any variable with missing data is found as a response variable. This must be the case because BACE will not automatically build models for other variables if you provide a list of formulas (but see below).
For now, we’ll focus on the simplified bace workflow because it does the full process for users, but later we will show how different components of the workflow can be run separately for more control over the imputation process.
6.3 Simple BACE Workflow
The bace() function is the main function to conduct a BACE analysis running each of the core steps: 1) initial imputations to reach convergence in imputed data; 2) convergence assessment to ensure that imputed data has reached a stable state; 3) imputing missing data 10 times using the final model, re-fitting and estimating parameters and then pooling posteriors across the final runs. If convergence is not achieved, it will automatically restart the initial steps to allow chains to converge before running the final set of models. You can also specify the number of runs, MCMC parameters, and whether to plot the results. The final output will include pooled models that account for imputation uncertainty.
=======================================================
BACE: Bayesian Analysis with Chained Equations
=======================================================
Step 1: Initial imputation for convergence assessment
-------------------------------------------------------
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 5680.4
Variance components: 2032.9
Convergence Assessment:
Acceptable : 3 parameters (20.0%)
Check : 3 parameters (20.0%)
Fixed : 1 parameters (6.7%)
Good : 8 parameters (53.3%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[~] y : 3/ 5 parameters good
Check convergence: species
[OK] x1 : 4/ 5 parameters good
[CH] x2 : 1/ 4 parameters good (1 fixed)
Check convergence: x1, species
========================================
Step 2: Convergence assessment
-------------------------------------------------------
Step 3: Final imputation runs for posterior pooling
-------------------------------------------------------
=== Running 10 final imputation iterations ===
Run 1/10 - Imputed variable: y
Run 1/10 - Imputed variable: x1
Run 1/10 - Imputed variable: x2
Run 2/10 - Imputed variable: y
Run 2/10 - Imputed variable: x1
Run 2/10 - Imputed variable: x2
Run 3/10 - Imputed variable: y
Run 3/10 - Imputed variable: x1
Run 3/10 - Imputed variable: x2
Run 4/10 - Imputed variable: y
Run 4/10 - Imputed variable: x1
Run 4/10 - Imputed variable: x2
Run 5/10 - Imputed variable: y
Run 5/10 - Imputed variable: x1
Run 5/10 - Imputed variable: x2
Run 6/10 - Imputed variable: y
Run 6/10 - Imputed variable: x1
Run 6/10 - Imputed variable: x2
Run 7/10 - Imputed variable: y
Run 7/10 - Imputed variable: x1
Run 7/10 - Imputed variable: x2
Run 8/10 - Imputed variable: y
Run 8/10 - Imputed variable: x1
Run 8/10 - Imputed variable: x2
Run 9/10 - Imputed variable: y
Run 9/10 - Imputed variable: x1
Run 9/10 - Imputed variable: x2
Run 10/10 - Imputed variable: y
Run 10/10 - Imputed variable: x1
Run 10/10 - Imputed variable: x2
=== Final imputation complete ===
Saved 10 imputed datasets with corresponding models
Step 4: Pooling posteriors across imputations
-------------------------------------------------------
Using posterior sampling: 1000 samples per imputation
Posterior pooling complete!
Pooled 3 variable model(s) across 10 imputations
=======================================================
BACE Analysis Complete!
=======================================================
Convergence achieved: TRUE
Number of attempts: 1
Final imputations: 10
Access results via:
- $pooled_models: Pooled posterior distributions
- $imputed_datasets: List of 10 imputed datasets
- $convergence: Convergence diagnostics
- $final_results: Full final imputation results
As we can see bace provides a comprehensive overview of the imputation process including each of the core steps. This includes the initial set of runs to allow data to converge, followed by convergence assessment checks both across runs and within parameters of models. Note that, if convergence is not achieved, the function will automatically restart the initial steps to allow chains to converge before running the final set of models. However, users should be aware that sometimes convergence fail yet convergence is likely achieved. If you think this is the case than you can do initial runs and skip the convergence assessment (see below). Note, it is also not always the case that parameter checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.
It’s worth highlighting a few key arguments within bace.
First, the skip_conv argument allows users to skip the convergence checks and just run the imputation for the number of runs specified. This is not recommended, but it can be useful if you have a large dataset and want to run the imputation for a long time without checking convergence.
Second, the sample_size argument allows users to specify the number of posterior samples to pool across final runs. This can be useful if you have a large dataset and want to reduce the size of the final bace object.
Third, the species argument allows users to specify whether to include a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species).
Once the imputation is complete, you can access the final pooled models and evaluate them as you would any MCMCglmm model fore each of the models fit. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable y as follows:
Code
# Extract the pooled model for the response variable using get_pooled_model()y_model <-get_pooled_model(bace_final1, variable ="y")summary(y_model)
+----------------------------------------------------------------+
| BACE Pooled MCMCglmm summary (stacked per-imputation draws) |
+----------------------------------------------------------------+
Stacked from 10 imputations
Total posterior draws: 10000
(= 1000 sampled per imputation x 10 imputations)
Original samples per imputation: 9000
Posterior summaries below are computed from stacked draws that
approximate p(theta | Y_obs) (Monte Carlo integration over imputed
values). Validity requires each per-imputation chain to have
converged on its own; check a subset via
plot(final$all_models[[1]]$<var>) and coda::effectiveSize().
----------------------------------------------------------------
Iterations = 1:10000
Thinning interval = 1
Sample size = 10000
DIC: 244.89
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 0.8586 0.0002765 1.766 9718
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.5298 0.2608 0.905 10000
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.02624 -0.79977 0.84295 10424 0.9424
x1 -0.12587 -0.38693 0.13724 9540 0.3508
x2.L 0.46222 0.18914 0.72654 10334 0.0012 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
plot(y_model)
Code
# We can now extract posteriors in the same way we would use a normal MCMCglmm model, but now we have the pooled results that account for imputation uncertainty. For example, we can extract the posterior samples for the fixed effects as follows:hist(y_model$Sol[, "x1"], main ="Posterior distribution for x1", xlab ="x1 coefficient")
Code
# You can also extract all pooled models at onceall_models <-get_pooled_model(bace_final1)names(all_models)
[1] "y" "x1" "x2"
Here, we see convergence wasn’t achieved with this particular dataset after 3 attempts, and we can see in the plots of the MCMCglmm models that some of the variance estimates are not quite converged yet. It’s best to adjust bace parameters to allow for longer runtimes and better mixing of the MCMC chains. For now, however, we will ignore the convergence issues for the sake of showing the functionality.
Another important feature is that users can also access all the final datasets with imputed values for each run. This can be useful for checking the imputed values (i.e., cross-validation) and exploring the imputation process in more detail. You can access the imputed datasets using the get_imputed_data() function, which returns them as either a list or a single stacked data frame:
Code
# Access the imputed datasets as a list (one data frame per imputation run)imputed_datasets <-get_imputed_data(bace_final1, format ="list")length(imputed_datasets)
# Or stack all imputed datasets into one data frame with an .imputation columnimputed_df <-get_imputed_data(bace_final1, format ="data.frame")table(imputed_df$.imputation)
It is worth being precise about what the pooled posterior returned by bace() actually is, because the terminology in the multiple-imputation literature can trip people up.
pool_posteriors()concatenates the MCMC draws from the n_final per-imputation fits into a single set of draws per parameter. Each per-imputation chain targets the conditional posterior \(p(\theta \mid Y_{\text{obs}}, Y_{\text{mis}}^{(m)})\), where \(Y_{\text{mis}}^{(m)}\) is the m-th complete imputed dataset. Stacking these chains is a Monte Carlo approximation to the marginal posterior \[
p(\theta \mid Y_{\text{obs}}) = \int p(\theta \mid Y_{\text{obs}}, Y_{\text{mis}})\, p(Y_{\text{mis}} \mid Y_{\text{obs}})\, dY_{\text{mis}},
\] which integrates over uncertainty in the imputed values as well as over parameter uncertainty. Posterior summaries — means, quantiles, HPD intervals — computed from the stacked draws therefore reflect both sources of uncertainty simultaneously. This is the standard Bayesian-MI combiner; see Zhou and Reiter (2010) and Chapter 18 of Gelman et al. (2013).
Two practical notes follow from this:
This is not Rubin’s rules. Rubin’s rules combine scalar point estimates and variances from each imputation using a specific within/between-imputation variance decomposition, and are the right tool when the per-imputation analysis produces frequentist estimates rather than full posterior draws. When you have posterior draws, stacking is both simpler and more faithful. You do not need, and should not apply, Rubin’s formulas on top of the bace() output.
Validity depends on per-imputation MCMC convergence. Because each per-imputation chain is supposed to target its own conditional posterior, the stacked draws inherit any bias from chains that have not converged. Before trusting pooled summaries, spot-check a handful of the per-imputation fits with standard diagnostics. For example:
Code
# Inspect MCMC diagnostics on the first per-imputation fit for 'y'first_fit <- bace_final1$final_results$all_models[[1]]$yplot(first_fit)coda::effectiveSize(first_fit$Sol)coda::autocorr.diag(first_fit$Sol)
If these look poor, increase nitt and/or burnin and re-run bace() rather than simply trusting the pooled banner.
7 Advanced: running the workflow step by step
For most users bace() is the right entry point — it runs the four-step workflow end-to-end and retries if the chained loop has not converged. But there are cases where you want finer control: you may want to skip convergence assessment, inspect intermediate objects, run more or fewer final imputations than the default, or substitute your own pooling step. Here we show how to call bace_imp(), bace_final_imp() and pool_posteriors() directly.
7.1 Controlling the imputation models with a list of formulas
The first thing you might want to control is which model is fit for each variable with missing data. bace() and bace_imp() both accept a list of formulas, one per variable with missing data. This is useful when you want different predictors for different variables (for example, when a variable should only be imputed from a subset of the others, or when you want to include an interaction in one model but not another).
Note that if you pass a list, every variable with missing data must appear as a response somewhere in the list — BACE will not auto-generate models for variables you have left out.
Code
# First check that all variables are correctly classified. If using sim_bace() this is done automatically, but always worth checking.str(data)# Perform BACE imputation. Here we specify models for each variable in the dataset.bace_impute <-bace_imp(fixformula =list("y ~ x1 + x2","x1 ~ x2","x2 ~ x4","x3 ~ x1 + x2","x4 ~ x1 + x2"),ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =5, verbose =FALSE)
What BACE will do here is fit a MCMCglmm model for each of the response variables specified in the formula list. At the first iteration, it will fill in missing values for any predictors using either their mean (for continuous variables) or randomly sampled values (for categorical variables). Models will then be fit using this complete predictor set and predictions will be made for missing values of the response variable.
In the next run, model predictions will be used to ‘fill’ in missing values in any predictors instead. This cycle will continue until the predicted values converge. This process is repeated for the number of iterations specified by the runs argument. All the models will include the phylogenetic random effect specified in the ran_phylo_form argument.
The second way to use BACE is to just provide a model with your main variables. This may be because you have an a priori model in mind. As such, you can also simply feeding in a single formula as below. Here, BACE will automatically build models for the other variables based on their types and the variables present in the dataset. It will impute missing values in the same way as described above, but using all other variables as predictors. You can also specify interactions if you wish. We can run BACE in this way as follows (this time using verbose = TRUE to see more output during the imputation process):
Code
# Fit BACE with a single formulabace_impute2 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =5, nitt =20000, burnin =5000, thin =10)
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 1217.4
Variance components: 310.1
Convergence Assessment:
Acceptable : 2 parameters (13.3%)
Check : 5 parameters (33.3%)
Fixed : 1 parameters (6.7%)
Good : 7 parameters (46.7%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[CH] y : 2/ 5 parameters good
Check convergence: x2.L, species, units
[~] x1 : 3/ 5 parameters good
Check convergence: y
[~] x2 : 2/ 4 parameters good (1 fixed)
Check convergence: species
========================================
As you can see from the printout above, BACE provides a summary of the imputation process, including the number of models fitted, parameters estimated, and effective sample sizes which evaluate how well the MCMC chains for parameters are mixing. This is important to check to ensure that the imputation is performing well. Note, it is not always the case that these checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.
7.2 Evaluating Final Models
BACE returns a list of models for each variable that was imputed. You can access these models and evaluate them as you would any MCMCglmm model. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable y as follows:
Code
# Access the model for the response variable 'y' fit on the LAST chained# iteration. bace_imp() stores these as $models_last_run (one MCMCglmm# object per response variable with a formula).model_y <- bace_impute2$models_last_run[["y"]]# View summary of the modelsummary(model_y)
Iterations = 5001:19991
Thinning interval = 10
Sample size = 1500
DIC: 211.5929
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 0.5158 0.000266 1.928 53.94
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.7228 0.2394 1.148 44.87
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 0.007307 -0.711009 0.730705 1378.7 0.979
x1 -0.023955 -0.301169 0.244285 1500.0 0.852
x2.L 0.423303 0.094137 0.744851 729.4 0.012 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Plot trace and density plots for model parametersplot(model_y)
We can see here that the phylogenetic variance estimate (i.e., species random effect) is mixing poorly. We should go back and re-run the imputation with more iterations or possibly adjust our prior to ensure better mixing for this parameter.
8 Phylogenetic and Non-Phylogenetic Random Effects
BACE allows users to include both phylogenetic and non-phylogenetic random effects in the imputation models. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species). In this case, users can specify a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is done by adjusting the species = FALSE argument in bacse_imp, setting it to TRUE.
Code
# Simulate data with multiple replicates per speciesset.seed(123)sim_data2 <-sim_bace(response_type ="poisson",predictor_types =c("gaussian", "binary", "threshold4", "multinomial3"),phylo_signal =c(0.55, 0.75, 0.8, 0.7, 0.7),missingness =c(0.2, 0.3, 0.4, 0.2, 0.5),rr =TRUE,n_cases =100,n_species =20,rr_form =list(species =c("x1")))# Object contains both the simulated data and phylogeny data2 <- sim_data2$data tree2 <- sim_data2$tree# View first few rows of the datahead(data2)
species y x1 x2 x3 x4
1 P9vI3 6 -0.7731097 1 4 C
2 eJBm5 NA NA 1 <NA> <NA>
3 P9vI3 1 NA <NA> 4 C
4 JE8PP 1 -2.9886121 0 <NA> <NA>
5 eJBm5 1 NA 1 3 <NA>
6 sA8eL 12 NA <NA> 2 <NA>
Code
# We can also make a table of the number of cases per species to confirm that there are multiple replicates per speciesdim(data2)
[1] 100 6
Code
length(unique(data2$species))
[1] 20
We now have a dataset with 100 cases but only 20 species, meaning that there are multiple replicates per species. We can run BACE on this dataset while including a non-phylogenetic random effect for species identity as follows:
Species effect decomposition enabled: 20 out of 20 species (100%) have replicated observations.
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 18
Fixed effects: 9
Variance components: 9
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 931.6
Variance components: 409.8
Convergence Assessment:
Check : 9 parameters (50.0%)
Fixed : 1 parameters (5.6%)
Good : 8 parameters (44.4%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[CH] y : 4/ 6 parameters good
Check convergence: species, species2
[CH] x1 : 4/ 6 parameters good
Check convergence: species, species2
[CH] x2 : 0/ 5 parameters good (1 fixed)
Check convergence: (Intercept), x1, y, species, species2
========================================
Code
# We can view the final model for y to check that both random effects are includedmodel_y2 <- bace_impute3$models_last_run[["y"]]summary(model_y2)
Iterations = 5001:19991
Thinning interval = 10
Sample size = 1500
DIC: 201.5099
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 0.4993 0.0002282 1.931 132.8
~species2
post.mean l-95% CI u-95% CI eff.samp
species2 0.3634 0.0003098 0.8881 187.1
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.5901 0.3869 0.8028 1280
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) -0.32426 -1.23494 0.48719 1500 0.3907
x1 -0.17011 -0.48431 0.18143 1500 0.3000
x2.L 0.65040 0.06644 1.26101 1361 0.0307 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that in this case we get a message indicating that both phylogenetic and non-phylogenetic random effects are being included in the models because we probably can decompose the variances. If you try and fit a dataset without multiple replicates per species while setting species = TRUE you will get an error because the model cannot be fit given the two parameters are not estimable.
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 18
Fixed effects: 9
Variance components: 9
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 998.5
Variance components: 471.1
Convergence Assessment:
Acceptable : 1 parameters (5.6%)
Check : 6 parameters (33.3%)
Fixed : 1 parameters (5.6%)
Good : 10 parameters (55.6%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[OK] y : 5/ 6 parameters good
[CH] x1 : 4/ 6 parameters good
Check convergence: species, species2
[CH] x2 : 1/ 5 parameters good (1 fixed)
Check convergence: x1, y, species, species2
========================================
8.1 Adjusting MCMC Parameters For Convergence
As we can see from the model summary above, some parameters may have low effective sample sizes (ESS) which indicates poor mixing of the MCMC chains. This can be a sign that the imputation is not performing well and that more iterations are needed to achieve convergence. Lets have a look:
Iterations = 5001:19991
Thinning interval = 10
Sample size = 1500
DIC: 12.67915
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 1477 0.1075 3824 7.834
~species2
post.mean l-95% CI u-95% CI eff.samp
species2 78.54 0.0001735 527.2 12.68
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 1 1 1 0
Location effects: x2 ~ x1 + y
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 11.088 -32.021 54.202 252.345 0.508
x1 -12.725 -28.264 2.008 9.018 0.125
y 12.658 3.980 21.712 4.382 <7e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see x1 and x2 models have poor mixing and low ESS for some parameters. Obviously, depending on the model complexity and data size etc, some models will need longer runtimes. This is likely because the imputation process has not yet converged. To address this, we can increase the number of iterations (i.e., nitt), increase the burn-in period (i.e., burnin), and/or adjust the thinning interval (i.e., thin) to improve mixing and convergence of the MCMC chains.
Users can adjust the MCMC parameters (i.e., nitt, thin, and burnin) for each model by providing a list of values for each parameter that corresponds to the order of the formulas provided in the fixformula argument. For example, if you have 5 formulas in your fixformula list, you can provide a list of 5 values for each MCMC parameter as follows:
Code
# Define MCMC parameters for each model nitt_list <-list(60000, 60000, 105000) thin_list <-list(10, 10, 10)# Run BACE with model-specific MCMC parametersbace_impute5 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,runs =5, nitt = nitt_list, thin = thin_list, burnin =5000, species =TRUE)
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 18
Fixed effects: 9
Variance components: 9
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 3065.7
Variance components: 1528.8
Convergence Assessment:
Acceptable : 4 parameters (22.2%)
Check : 5 parameters (27.8%)
Fixed : 1 parameters (5.6%)
Good : 8 parameters (44.4%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[~] y : 4/ 6 parameters good
Check convergence: (Intercept)
[~] x1 : 4/ 6 parameters good
[CH] x2 : 0/ 5 parameters good (1 fixed)
Check convergence: (Intercept), x1, y, species
========================================
As you can imagine, longer runtimes will mean that bace_imp will need to take longer. The diagnostics still suggest possible issues with mixing for some parameters, so we may need to further increase the number of iterations or adjust the priors to improve mixing, however, it’s still good to look at the final models to check the model visually.
8.2 Two kinds of convergence
When working with BACE it is important to keep two distinct convergence questions separate, because they are checked in different ways and the tools that flag problems in one are silent on the other.
Within-chain MCMC convergence of each per-variable MCMCglmm fit. This is the classical question: has the MCMC sampler explored its target posterior and mixed well? The right tools are the ones you already know — trace plots, effective sample size, autocorrelation, Geweke — applied to an individualMCMCglmm object. bace_imp() prints a short summary of effective sample sizes and autocorrelation for the last-iteration models when verbose = TRUE, and you can always call plot() or coda::effectiveSize() on bace_impute$models_last_run[["y"]] or on any of the per-imputation fits stored in bace_final1$final_results$all_models[[m]][["y"]]. Failures here are fixed by increasing nitt, burnin, or thin, tightening priors, or (less often) simplifying the model.
Convergence of the chained-equations loop. This is the multiple-imputation question: has the sequence of imputed datasets stabilised across chained iterations, or is it still drifting? This is not about MCMC mixing within a single fit — it is about whether the outer loop over variables has reached a stationary distribution over imputed values. The right tool for this question is assess_convergence(), which tracks summary statistics of the imputed cells across the runs chained iterations. Failures here are fixed by increasing runs or, occasionally, by revisiting the specification of the imputation models.
bace() assesses only the second kind of convergence automatically (and retries with more runs if it fails). The first kind is your responsibility to check before trusting pooled inference — see Interpreting the pooled output above.
8.3 Assessing chained-equations convergence
We need to check whether the numbers of runs we’ve done has lead to convergence of the imputation process. We can do this by looking at the trace and density plots for the parameters in the models. For example, we can check the model for y as follows:
Code
# Assess whether number of runs was enough by looking at the stability of imputed points across runs. Adjust runs if neededconverge <-assess_convergence(bace_impute2, method ="summary")plot(converge)
Code
# Now that we know that we can run BACE longer# Fit BACE with a single formulabace_impute2_2 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =15, nitt =100000, burnin =10000, thin =10)
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 15
Fixed effects: 9
Variance components: 6
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 5626.9
Variance components: 2014.6
Convergence Assessment:
Acceptable : 4 parameters (26.7%)
Check : 1 parameters (6.7%)
Fixed : 1 parameters (6.7%)
Good : 9 parameters (60.0%)
Overall: ACCEPTABLE - Most parameters converged adequately
Convergence by Response Variable:
----------------------------------------
[~] y : 3/ 5 parameters good
[OK] x1 : 4/ 5 parameters good
[~] x2 : 2/ 4 parameters good (1 fixed)
Check convergence: species
========================================
Code
# Check again for convergenceconverge <-assess_convergence(bace_impute2_2, method ="summary")plot(converge)
8.4 Worked example: bace() with species random effects
Now that you are aware of the different components of the BACE workflow, it is useful to see them all combined in a second worked example — this time using the replicated-species dataset from the previous section so that both phylogenetic and non-phylogenetic species effects are estimated, and letting bace() handle the orchestration end-to-end.
=======================================================
BACE: Bayesian Analysis with Chained Equations
=======================================================
Step 1: Initial imputation for convergence assessment
-------------------------------------------------------
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2
========================================
BACE MCMC Diagnostics Summary
========================================
Number of models: 3
Total parameters: 18
Fixed effects: 9
Variance components: 9
Fixed units (threshold/categorical): 1
Mean Effective Sample Size:
Fixed effects: 5219.6
Variance components: 2453.6
Convergence Assessment:
Acceptable : 3 parameters (16.7%)
Check : 5 parameters (27.8%)
Fixed : 1 parameters (5.6%)
Good : 9 parameters (50.0%)
Overall: CHECK - Many parameters show convergence issues
Consider increasing nitt, adjusting priors, or checking model specification
Convergence by Response Variable:
----------------------------------------
[OK] y : 5/ 6 parameters good
[~] x1 : 4/ 6 parameters good
[CH] x2 : 0/ 5 parameters good (1 fixed)
Check convergence: (Intercept), x1, y, species, species2
========================================
Step 2: Convergence assessment
-------------------------------------------------------
Skipping convergence retry (skip_conv = TRUE)
Proceeding with final imputation...
-------------------------------------------------------
=== Running 10 final imputation iterations ===
Run 1/10 - Imputed variable: y
Run 1/10 - Imputed variable: x1
Run 1/10 - Imputed variable: x2
Run 2/10 - Imputed variable: y
Run 2/10 - Imputed variable: x1
Run 2/10 - Imputed variable: x2
Run 3/10 - Imputed variable: y
Run 3/10 - Imputed variable: x1
Run 3/10 - Imputed variable: x2
Run 4/10 - Imputed variable: y
Run 4/10 - Imputed variable: x1
Run 4/10 - Imputed variable: x2
Run 5/10 - Imputed variable: y
Run 5/10 - Imputed variable: x1
Run 5/10 - Imputed variable: x2
Run 6/10 - Imputed variable: y
Run 6/10 - Imputed variable: x1
Run 6/10 - Imputed variable: x2
Run 7/10 - Imputed variable: y
Run 7/10 - Imputed variable: x1
Run 7/10 - Imputed variable: x2
Run 8/10 - Imputed variable: y
Run 8/10 - Imputed variable: x1
Run 8/10 - Imputed variable: x2
Run 9/10 - Imputed variable: y
Run 9/10 - Imputed variable: x1
Run 9/10 - Imputed variable: x2
Run 10/10 - Imputed variable: y
Run 10/10 - Imputed variable: x1
Run 10/10 - Imputed variable: x2
=== Final imputation complete ===
Saved 10 imputed datasets with corresponding models
Pooling posteriors with sampling: 1000 samples per imputation
=======================================================
BACE Analysis Complete!
=======================================================
Convergence achieved: FALSE
Number of attempts: 1
Final imputations: 10
Access results via:
- $pooled_models: Pooled posterior distributions
- $imputed_datasets: List of 10 imputed datasets
- $convergence: Convergence diagnostics
- $final_results: Full final imputation results
Code
# Extract the pooled model for y using the accessor functiony_model <-get_pooled_model(bace_final, variable ="y")summary(y_model)
+----------------------------------------------------------------+
| BACE Pooled MCMCglmm summary (stacked per-imputation draws) |
+----------------------------------------------------------------+
Stacked from 10 imputations
Total posterior draws: 10000
(= 1000 sampled per imputation x 10 imputations)
Original samples per imputation: 9000
Posterior summaries below are computed from stacked draws that
approximate p(theta | Y_obs) (Monte Carlo integration over imputed
values). Validity requires each per-imputation chain to have
converged on its own; check a subset via
plot(final$all_models[[1]]$<var>) and coda::effectiveSize().
----------------------------------------------------------------
Iterations = 1:10000
Thinning interval = 1
Sample size = 10000
DIC: 232.7726
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 0.4434 0.0002206 1.706 10000
~species2
post.mean l-95% CI u-95% CI eff.samp
species2 0.4313 0.0002421 0.9809 10000
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.502 0.3573 0.6716 10000
Location effects: y ~ x1 + x2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) -0.3650 -1.1838 0.4832 10000 0.2892
x1 -0.1755 -0.4871 0.1445 10000 0.2752
x2.L 0.7058 0.2688 1.1635 10000 0.0036 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
plot(y_model)
Code
# We can now extract posteriors in the same way we would use a normal MCMCglmm model,# but now we have draws that integrate over the imputation distribution. For example,# we can look at the marginal posterior of the x1 coefficient:hist(y_model$Sol[, "x1"], main ="Posterior distribution for x1", xlab ="x1 coefficient")
9 Citation and further reading
If you use BACE in a publication, please cite both the package (see citation("BACE")) and the underlying MCMCglmm engine (Hadfield and Nakagawa 2010). Phylogenetic comparative methods and phylogenetic mixed models are reviewed in Cornwell and Nakagawa (2017), de Villemereuil, Nakagawa, and Garamszegi (2014) and Halliwell, Holland, and Yates (2025). The broader motivation for imputation in comparative biology — data gaps, trait databases, and the cost of complete-case analysis — is discussed in Dalia A. Conde et al. (2019) and Pottier et al. (2024). For a related Rubin’s-rules-based approach that handles phylogenetic uncertainty (rather than missing trait data), see Nakagawa and de Villemereuil (2018).
The Bayesian combiner used by pool_posteriors() — concatenating the per-imputation chains to approximate the marginal posterior — is discussed in Zhou and Reiter (2010) and in Chapter 18 of Gelman et al. (2013).
Bug reports, feature requests and worked examples are very welcome at the BACE issue tracker.
Dalia A. Conde, Johanna Staerk, Fernando Colchero, Rita Da Silva, Hugh P. Possingham, Annette Baudisch, and James W. Vaupel. 2019. “Data Gaps and Opportunities for Comparative and Conservation Biology.”Pnas 116 (10): 9658–64. https://doi.org/10.1073/pnas.1816367116.
de Villemereuil, Pierre, Shinichi Nakagawa, and L Z Garamszegi. 2014. “Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology.”Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology, 287–303. https://doi.org/10.1007/978-3-662-43550-2_11.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC.
Hadfield, J D, and S Nakagawa. 2010. “General Quantitative Genetic Methods for Comparative Biology: Phylogenies, Taxonomies and Multi-Trait Models for Continuous and Categorical Characters.”Journal of Evolutionary Biology 23 (3): 494–508. https://doi.org/10.1111/j.1420-9101.2009.01915.x.
Halliwell, Ben, Barbara R Holland, and Luke A Yates. 2025. “Multi-Response Phylogenetic Mixed Models: Concepts and Application.”Biological Reviews of the Cambridge Philosophical Society 100 (3): 1294–1316. https://doi.org/10.1111/brv.70001.
Nakagawa, Shinichi, and P de Villemereuil. 2018. “A General Method for Simultaneously Accounting for Phylogenetic and Species Sampling Uncertainty via Rubin’s Rules in Comparative Analysis.”Systematic Biology 68 (4): 632–41. https://doi.org/10.1093/sysbio/syy089.
Pottier, Patrice, Daniel W A Noble, Frank Seebacher, Nicholas C Wu, Samantha Burke, Malgorzata Lagisz, Lisa E Schwanz, Szymon M Drobniak, and Shinichi Nakagawa. 2024. “New Horizons for Comparative Studies and Meta-Analyses.”Trends in Ecology & Evolution 39 (5): 435–45. https://doi.org/10.1016/j.tree.2023.12.004.
Zhou, Xiang, and Jerome P. Reiter. 2010. “A Note on Bayesian Inference After Multiple Imputation.”The American Statistician 64 (2): 159–63. https://doi.org/10.1198/tast.2010.09109.
Source Code
---title: "BACE: Bayesian Phylogenetic and Correlated Trait Imputation"date: "`r Sys.Date()`"author: "Daniel Noble, Szymek Drobniak, Shinichi Nakagawa"bibliography: ../bib/BACE.bibformat: html: toc: true toc-location: left toc-depth: 3 toc-title: "**Table of Contents**" output-file: "index.html" theme: cosmo embed-resources: true code-fold: show code-tools: true number-sections: true fontsize: "12" code-overflow: wrapcrossref: fig-title: Figure # (default is "Figure") tbl-title: Table # (default is "Table") title-delim: — # (default is ":") fig-prefix: Fig. # (default is "Figure") tbl-prefix: Tab. # (default is "Table")editor_options: chunk_output_type: console---# IntroductionWe provide a brief introduction to the *BACE* package for Bayesian phylogenetic and correlated trait imputation. The *BACE* package allows users to impute missing data in comparative datasets using Bayesian methods that account for phylogenetic relationships among species.# When to use Phylo-BACEGeneral-purpose multiple-imputation tools such as **mice**, **Amelia** and **missForest** assume that rows are exchangeable: the missing value for row *i* is imputed using the marginal distribution of the column and the observed covariates of row *i*, with no structural relationship between rows. For comparative biology datasets this assumption is wrong. Closely related species tend to share trait values because they share evolutionary history, and throwing that information away means we impute a missing mouse trait as though the nearest source of information were the grand mean across 500 mammals rather than the handful of other mouse-like rodents in the tree.**Phylo-BACE** keeps that information by fitting every per-variable imputation model as a Bayesian phylogenetic mixed model via `MCMCglmm`, with a phylogenetic random effect that borrows information between related taxa. Imputation is done via chained equations — one conditional model per variable with missing data — and final inference uses multiple independent imputations whose posterior draws are combined (see *Interpreting the pooled output* below).Phylo-BACE is the right tool when:- Your rows are species (or populations nested within species) and you have a phylogenetic tree that covers them.- Your traits are likely to show **phylogenetic signal** — i.e. closely related species have correlated values. If you are unsure, fit a single-variable phylogenetic model to a complete trait first (e.g. `phytools::phylosig()` or an `MCMCglmm` fit with a `phylo` random effect) and check whether the phylogenetic variance component is non-trivial.- You want **posterior inference** on downstream models rather than point imputations followed by ad-hoc variance adjustments.- Your missing-data pattern spans **mixed variable types** (continuous, binary, categorical, ordinal, count).Phylo-BACE is probably the wrong tool when:- Rows are independent observational units with no phylogenetic structure — generic MI tools are faster and simpler.- The trait of interest has no detectable phylogenetic signal. The phylogenetic random effect is then noise that slows down MCMC without helping.- Missingness is rare and plausibly MCAR — a complete-case analysis may be defensible and much cheaper.- You need sub-second imputations in a production pipeline. Each chained-equations iteration runs an MCMC fit, so runtimes are measured in minutes, not milliseconds.For a broader discussion of the data-gap problem and imputation in comparative biology we recommend @daliaa.conde2019 and @pottier2024; for Rubin's-rules-based approaches to related problems see @nakagawa2018.# InstallationYou can install the *BACE* package from GitHub using the following command:```{r}#| label: install_bace#| echo: true#| message: false#| warning: false#| eval: true# Install from github and load#install.packages("pacman") pacman::p_load(devtools)devtools::install_github("daniel1noble/BACE") pacman::p_load(BACE, ape, phytools, MCMCglmm, dplyr, magrittr)```If you encounter any issues during installation, please ensure that you have the required dependencies installed.# DocumentationThe main functions in the *BACE* package are `bace()` and `bace_imp()`, which performs Bayesian imputation using chained equations. You can access the documentation for these functions using the following commands:```{r}#| label: bace_imp_doc#| echo: true?bace```As you will see, the function allows you to specify fixed effects, random effects, phylogenetic tree, and other parameters for the imputation process. At the bare minimum, you need to provide a formula for the fixed effects, a formula for the random effects (including phylogeny), a phylogenetic tree, and the dataset with missing values. At present, *BACE* only works with intercept-only phylogenetic and species-level random effects, but we are working on adding support for other types of random effects in the future. For most comparative datasets, this will be sufficient for modeling the phylogenetic and species-level variation in the data, but not always.# The BACE workflow at a glance`bace()` is the top-level wrapper. Internally it glues together four steps that you can also run individually when you need finer control:1. **Initial chained imputation** — `bace_imp()`. For each variable with missing data, fit a `MCMCglmm` phylogenetic mixed model, impute the missing cells, move on to the next variable, and repeat for `runs` iterations. At each iteration the working dataset is updated on the raw scale so that later draws condition on the most recent imputations for every other variable. The first iteration warm-starts predictor NAs with marginal means (for continuous variables) or empirical draws (for categorical variables); subsequent iterations use the chained draws.2. **Convergence assessment of the chained loop** — `assess_convergence()`. Check whether the *sequence of imputed datasets* has stabilised. This is not within-chain MCMC convergence — see *Two kinds of convergence* below.3. **Final imputation runs** — `bace_final_imp()`. Starting from the converged dataset, launch `n_final` *independent* MCMCglmm fits per variable. Each run produces one complete imputed dataset and one per-variable posterior. Predictions are drawn from the posterior predictive distribution (not a point estimate), so the returned imputed datasets differ from run to run.4. **Posterior pooling** — `pool_posteriors()`. Concatenate the per-imputation MCMC chains into a single set of posterior draws per parameter. These stacked draws approximate the marginal posterior of the analysis model integrating over the imputation distribution.`bace()` runs all four steps and additionally retries step 1 with more chained iterations if step 2 fails (up to `max_attempts` times). The rest of this vignette walks through the workflow first with `bace()` and then shows how to run each step individually.# Example Usage## Simulating DataWe'll first simulate a dataset with missing values and a phylogenetic tree. We can do this using the `sim_bace()` function.```{r}#| label: simulate_data#| echo: true# Set seed for reproducibilityset.seed(123)# Simulate data. Note that threshold4 indicates an ordinal variable with 4 levels, and multinomial3 indicates a categorical variable with 3 levels. You can replace the numbers sim_data <-sim_bace(response_type ="poisson",predictor_types =c("gaussian", "binary", "threshold4", "multinomial3"),phylo_signal =c(0.55, 0.75, 0.8, 0.7, 0.7),missingness =c(0.2, 0.3, 0.4, 0.2, 0.5),n_cases =100,n_species =100)```We can then view the simulated data and phylogenetic tree because the `sim_bace()` function returns a list containing both objects. It also has a lot of other useful information about the simulation users can explore.```{r}#| label: view_simulated_data#| echo: true# Object contains both the simulated data and phylogeny data <- sim_data$data tree <- sim_data$tree# View first few rows of the datahead(data)```Here, we can see that the dataset contains lots of missing values across different variable types. Of course, normally, we would need to exclude these cases to use a complete dataset. However, with *BACE* we simply feed in the entire dataset, specify the models we want to use for imputation (or the real model of interest), and impute the missing data. ## Using *BACE* for ImputationNow that we have the data and the phylogenetic tree we can do the imputation is to use the `bace()` or `bace_imp()` function. :::::: {.callout-caution}It's very important to ensure that all variables are correctly classified in the dataset (e.g., factors for categorical variables, numeric for continuous variables, etc.) before running the imputation. *BACE* will fit different models for different variable types, and so, if they have not been correctly classified you will be modeling the variable incorrectly. If you want to see what *BACE* has classified each variable as, you can inspect the `types` element of any fitted `bace_imp()` or `bace()` object. For example, after running the simple workflow below you can do:```{r}#| label: check_variable_types#| echo: true#| eval: false# After running bace() (see next section), inspect the inferred variable typesbace_final1$initial_results$types```Importantly, users also need to think carefully about the distributions of the variables themselves and transform them if necessary prior to running `bace_imp` (e.g., log-transforming skewed continuous variables).:::There are two ways you can use `bace/bace_imp`. The first is to specify a list of all the formulas for each variable you would like to fit. It is important here that, if you provide a list, that any variable with missing data is found as a response variable. This must be the case because *BACE* will not automatically build models for other variables if you provide a list of formulas (but see below). For now, we'll focus on the simplified `bace` workflow because it does the full process for users, but later we will show how different components of the workflow can be run separately for more control over the imputation process.## Simple BACE WorkflowThe `bace()` function is the main function to conduct a *BACE* analysis running each of the core steps: 1) initial imputations to reach convergence in imputed data; 2) convergence assessment to ensure that imputed data has reached a stable state; 3) imputing missing data 10 times using the final model, re-fitting and estimating parameters and then pooling posteriors across the final runs. If convergence is not achieved, it will automatically restart the initial steps to allow chains to converge before running the final set of models. You can also specify the number of runs, MCMC parameters, and whether to plot the results. The final output will include pooled models that account for imputation uncertainty.```{r}#| label: simple_bace_workflow1#| echo: true#| eval: true#| warning: false#| message: false# Run the full bace workflow:bace_final1 <-bace(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,species =FALSE,runs =15, nitt =100000, burnin =10000, thin =10, plot =TRUE, skip_conv =FALSE, sample_size =1000)```As we can see `bace` provides a comprehensive overview of the imputation process including each of the core steps. This includes the initial set of runs to allow data to converge, followed by convergence assessment checks both across runs and within parameters of models. Note that, if convergence is not achieved, the function will automatically restart the initial steps to allow chains to converge before running the final set of models. However, users should be aware that sometimes convergence fail yet convergence is likely achieved. If you think this is the case than you can do initial runs and skip the convergence assessment (see below). Note, it is also not always the case that parameter checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.It's worth highlighting a few key arguments within `bace`. - First, the `skip_conv` argument allows users to skip the convergence checks and just run the imputation for the number of runs specified. This is not recommended, but it can be useful if you have a large dataset and want to run the imputation for a long time without checking convergence. - Second, the `sample_size` argument allows users to specify the number of posterior samples to pool across final runs. This can be useful if you have a large dataset and want to reduce the size of the final `bace` object.- Third, the `species` argument allows users to specify whether to include a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species). Once the imputation is complete, you can access the final pooled models and evaluate them as you would any `MCMCglmm` model fore each of the models fit. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable `y` as follows:```{r}#| label: simple_bace_workflow2#| echo: true#| eval: true#| warning: false# Extract the pooled model for the response variable using get_pooled_model()y_model <-get_pooled_model(bace_final1, variable ="y")summary(y_model)plot(y_model) # We can now extract posteriors in the same way we would use a normal MCMCglmm model, but now we have the pooled results that account for imputation uncertainty. For example, we can extract the posterior samples for the fixed effects as follows:hist(y_model$Sol[, "x1"], main ="Posterior distribution for x1", xlab ="x1 coefficient")# You can also extract all pooled models at onceall_models <-get_pooled_model(bace_final1)names(all_models)```Here, we see convergence wasn't achieved with this particular dataset after 3 attempts, and we can see in the plots of the `MCMCglmm models` that some of the variance estimates are not quite converged yet. It's best to adjust `bace` parameters to allow for longer runtimes and better mixing of the MCMC chains. For now, however, we will ignore the convergence issues for the sake of showing the functionality.Another important feature is that users can also access all the final datasets with imputed values for each run. This can be useful for checking the imputed values (i.e., cross-validation) and exploring the imputation process in more detail. You can access the imputed datasets using the `get_imputed_data()` function, which returns them as either a list or a single stacked data frame:```{r}#| label: access_imputed_datasets#| echo: true#| eval: true#| warning: false# Access the imputed datasets as a list (one data frame per imputation run)imputed_datasets <-get_imputed_data(bace_final1, format ="list")length(imputed_datasets)head(imputed_datasets[[1]])# Or stack all imputed datasets into one data frame with an .imputation columnimputed_df <-get_imputed_data(bace_final1, format ="data.frame")table(imputed_df$.imputation)```## Interpreting the pooled outputIt is worth being precise about what the pooled posterior returned by `bace()` actually is, because the terminology in the multiple-imputation literature can trip people up.`pool_posteriors()` **concatenates** the MCMC draws from the `n_final` per-imputation fits into a single set of draws per parameter. Each per-imputation chain targets the *conditional* posterior $p(\theta \mid Y_{\text{obs}}, Y_{\text{mis}}^{(m)})$, where $Y_{\text{mis}}^{(m)}$ is the m-th complete imputed dataset. Stacking these chains is a Monte Carlo approximation to the *marginal* posterior$$p(\theta \mid Y_{\text{obs}}) = \int p(\theta \mid Y_{\text{obs}}, Y_{\text{mis}})\, p(Y_{\text{mis}} \mid Y_{\text{obs}})\, dY_{\text{mis}},$$which integrates over uncertainty in the imputed values as well as over parameter uncertainty. Posterior summaries — means, quantiles, HPD intervals — computed from the stacked draws therefore reflect both sources of uncertainty simultaneously. This is the standard Bayesian-MI combiner; see @zhou2010 and Chapter 18 of @gelman2013.Two practical notes follow from this:1. **This is not Rubin's rules.** Rubin's rules combine *scalar* point estimates and variances from each imputation using a specific within/between-imputation variance decomposition, and are the right tool when the per-imputation analysis produces frequentist estimates rather than full posterior draws. When you have posterior draws, stacking is both simpler and more faithful. You do not need, and should not apply, Rubin's formulas on top of the `bace()` output.2. **Validity depends on per-imputation MCMC convergence.** Because each per-imputation chain is supposed to target its own conditional posterior, the stacked draws inherit any bias from chains that have not converged. Before trusting pooled summaries, spot-check a handful of the per-imputation fits with standard diagnostics. For example:```{r}#| label: check_per_imputation_mcmc#| echo: true#| eval: false# Inspect MCMC diagnostics on the first per-imputation fit for 'y'first_fit <- bace_final1$final_results$all_models[[1]]$yplot(first_fit)coda::effectiveSize(first_fit$Sol)coda::autocorr.diag(first_fit$Sol)```If these look poor, increase `nitt` and/or `burnin` and re-run `bace()` rather than simply trusting the pooled banner.# Advanced: running the workflow step by stepFor most users `bace()` is the right entry point — it runs the four-step workflow end-to-end and retries if the chained loop has not converged. But there are cases where you want finer control: you may want to skip convergence assessment, inspect intermediate objects, run more or fewer final imputations than the default, or substitute your own pooling step. Here we show how to call `bace_imp()`, `bace_final_imp()` and `pool_posteriors()` directly.## Controlling the imputation models with a list of formulasThe first thing you might want to control is which model is fit for each variable with missing data. `bace()` and `bace_imp()` both accept a *list* of formulas, one per variable with missing data. This is useful when you want different predictors for different variables (for example, when a variable should only be imputed from a subset of the others, or when you want to include an interaction in one model but not another).Note that if you pass a list, every variable with missing data must appear as a response somewhere in the list — *BACE* will not auto-generate models for variables you have left out.```{r}#| label: bace_imputation#| echo: true #| eval: false# First check that all variables are correctly classified. If using sim_bace() this is done automatically, but always worth checking.str(data)# Perform BACE imputation. Here we specify models for each variable in the dataset.bace_impute <-bace_imp(fixformula =list("y ~ x1 + x2","x1 ~ x2","x2 ~ x4","x3 ~ x1 + x2","x4 ~ x1 + x2"),ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =5, verbose =FALSE)```What *BACE* will do here is fit a `MCMCglmm` model for each of the response variables specified in the formula list. At the first iteration, it will fill in missing values for any predictors using either their mean (for continuous variables) or randomly sampled values (for categorical variables). Models will then be fit using this complete predictor set and predictions will be made for missing values of the response variable. In the next run, model predictions will be used to 'fill' in missing values in any predictors instead. This cycle will continue until the predicted values converge. This process is repeated for the number of iterations specified by the `runs` argument. All the models will include the phylogenetic random effect specified in the `ran_phylo_form` argument.The second way to use *BACE* is to just provide a model with your main variables. This may be because you have an *a priori* model in mind. As such, you can also simply feeding in a single formula as below. Here, *BACE* will automatically build models for the other variables based on their types and the variables present in the dataset. It will impute missing values in the same way as described above, but using all other variables as predictors. You can also specify interactions if you wish. We can run *BACE* in this way as follows (this time using verbose = TRUE to see more output during the imputation process):```{r}#| label: bace_imputation_simple#| echo: true# Fit BACE with a single formulabace_impute2 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =5, nitt =20000, burnin =5000, thin =10) ```As you can see from the printout above, *BACE* provides a summary of the imputation process, including the number of models fitted, parameters estimated, and effective sample sizes which evaluate how well the MCMC chains for parameters are mixing. This is important to check to ensure that the imputation is performing well. Note, it is not always the case that these checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.## Evaluating Final Models*BACE* returns a list of models for each variable that was imputed. You can access these models and evaluate them as you would any `MCMCglmm` model. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable `y` as follows:```{r}#| label: evaluate_models#| echo: true# Access the model for the response variable 'y' fit on the LAST chained# iteration. bace_imp() stores these as $models_last_run (one MCMCglmm# object per response variable with a formula).model_y <- bace_impute2$models_last_run[["y"]]# View summary of the modelsummary(model_y)# Plot trace and density plots for model parametersplot(model_y)```We can see here that the phylogenetic variance estimate (i.e., `species` random effect) is mixing poorly. We should go back and re-run the imputation with more iterations or possibly adjust our prior to ensure better mixing for this parameter.# Phylogenetic and Non-Phylogenetic Random EffectsBACE allows users to include both phylogenetic and non-phylogenetic random effects in the imputation models. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species). In this case, users can specify a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is done by adjusting the `species = FALSE` argument in `bacse_imp`, setting it to TRUE.```{r}#| label: non_phylo_random_effects#| echo: true#| eval: true#| warning: false#| message: false# Simulate data with multiple replicates per speciesset.seed(123)sim_data2 <-sim_bace(response_type ="poisson",predictor_types =c("gaussian", "binary", "threshold4", "multinomial3"),phylo_signal =c(0.55, 0.75, 0.8, 0.7, 0.7),missingness =c(0.2, 0.3, 0.4, 0.2, 0.5),rr =TRUE,n_cases =100,n_species =20,rr_form =list(species =c("x1")))# Object contains both the simulated data and phylogeny data2 <- sim_data2$data tree2 <- sim_data2$tree# View first few rows of the datahead(data2) # We can also make a table of the number of cases per species to confirm that there are multiple replicates per speciesdim(data2)length(unique(data2$species))```We now have a dataset with 100 cases but only 20 species, meaning that there are multiple replicates per species. We can run *BACE* on this dataset while including a non-phylogenetic random effect for species identity as follows:```{r}#| label: bace_non_phylo_random_effects#| echo: true#| eval: truebace_impute3 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,runs =5, nitt =20000, burnin =5000, thin =10, species =TRUE) # We can view the final model for y to check that both random effects are includedmodel_y2 <- bace_impute3$models_last_run[["y"]]summary(model_y2)```Note that in this case we get a message indicating that both phylogenetic and non-phylogenetic random effects are being included in the models because we probably can decompose the variances. If you try and fit a dataset without multiple replicates per species while setting `species = TRUE` you will get an error because the model cannot be fit given the two parameters are not estimable.```{r}#| label: bace_non_phylo_random_effects_error#| echo: true#| eval: true#| warning: falsebace_impute4 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,runs =5, nitt =20000, burnin =5000, thin =10, species =TRUE)```## Adjusting MCMC Parameters For ConvergenceAs we can see from the model summary above, some parameters may have low effective sample sizes (ESS) which indicates poor mixing of the MCMC chains. This can be a sign that the imputation is not performing well and that more iterations are needed to achieve convergence. Lets have a look:```{r}#| label: check_ess#| echo: true#| eval: truemodel_x1 <- bace_impute3$models_last_run[["x1"]]summary(model_x1)model_x2 <- bace_impute3$models_last_run[["x2"]]summary(model_x2)```As we can see x1 and x2 models have poor mixing and low ESS for some parameters. Obviously, depending on the model complexity and data size etc, some models will need longer runtimes. This is likely because the imputation process has not yet converged. To address this, we can increase the number of iterations (i.e., `nitt`), increase the burn-in period (i.e., `burnin`), and/or adjust the thinning interval (i.e., `thin`) to improve mixing and convergence of the MCMC chains. Users can adjust the MCMC parameters (i.e., `nitt`, `thin`, and `burnin`) for each model by providing a list of values for each parameter that corresponds to the order of the formulas provided in the `fixformula` argument. For example, if you have 5 formulas in your `fixformula` list, you can provide a list of 5 values for each MCMC parameter as follows:```{r}#| label: mcmc_parameters#| echo: true#| eval: true#| warning: false#| message: false# Define MCMC parameters for each model nitt_list <-list(60000, 60000, 105000) thin_list <-list(10, 10, 10)# Run BACE with model-specific MCMC parametersbace_impute5 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,runs =5, nitt = nitt_list, thin = thin_list, burnin =5000, species =TRUE)```As you can imagine, longer runtimes will mean that `bace_imp` will need to take longer. The diagnostics still suggest possible issues with mixing for some parameters, so we may need to further increase the number of iterations or adjust the priors to improve mixing, however, it's still good to look at the final models to check the model visually.## Two kinds of convergenceWhen working with *BACE* it is important to keep two distinct convergence questions separate, because they are checked in different ways and the tools that flag problems in one are silent on the other.1. **Within-chain MCMC convergence** of each per-variable `MCMCglmm` fit. This is the classical question: has the MCMC sampler explored its target posterior and mixed well? The right tools are the ones you already know — trace plots, effective sample size, autocorrelation, Geweke — applied to an *individual* `MCMCglmm` object. `bace_imp()` prints a short summary of effective sample sizes and autocorrelation for the last-iteration models when `verbose = TRUE`, and you can always call `plot()` or `coda::effectiveSize()` on `bace_impute$models_last_run[["y"]]` or on any of the per-imputation fits stored in `bace_final1$final_results$all_models[[m]][["y"]]`. Failures here are fixed by increasing `nitt`, `burnin`, or `thin`, tightening priors, or (less often) simplifying the model.2. **Convergence of the chained-equations loop.** This is the multiple-imputation question: has the *sequence of imputed datasets* stabilised across chained iterations, or is it still drifting? This is not about MCMC mixing within a single fit — it is about whether the outer loop over variables has reached a stationary distribution over imputed values. The right tool for this question is `assess_convergence()`, which tracks summary statistics of the imputed cells across the `runs` chained iterations. Failures here are fixed by increasing `runs` or, occasionally, by revisiting the specification of the imputation models.`bace()` assesses only the second kind of convergence automatically (and retries with more `runs` if it fails). The first kind is your responsibility to check before trusting pooled inference — see *Interpreting the pooled output* above.## Assessing chained-equations convergenceWe need to check whether the numbers of runs we've done has lead to convergence of the imputation process. We can do this by looking at the trace and density plots for the parameters in the models. For example, we can check the model for `y` as follows:```{r}#| label: check_convergence#| echo: true#| eval: true#| warning: false#| message: false# Assess whether number of runs was enough by looking at the stability of imputed points across runs. Adjust runs if neededconverge <-assess_convergence(bace_impute2, method ="summary")plot(converge)# Now that we know that we can run BACE longer# Fit BACE with a single formulabace_impute2_2 <-bace_imp(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree,data = data,runs =15, nitt =100000, burnin =10000, thin =10) # Check again for convergenceconverge <-assess_convergence(bace_impute2_2, method ="summary")plot(converge)```## Worked example: bace() with species random effectsNow that you are aware of the different components of the *BACE* workflow, it is useful to see them all combined in a second worked example — this time using the replicated-species dataset from the previous section so that both phylogenetic and non-phylogenetic species effects are estimated, and letting `bace()` handle the orchestration end-to-end.```{r}#| label: simple_bace_workflow#| echo: true#| eval: true#| warning: false#| message: false# Run the full bace workflow:bace_final <-bace(fixformula ="y ~ x1 + x2",ran_phylo_form ="~ 1 |species",phylo = tree2,data = data2,species =TRUE,runs =15, nitt =100000, burnin =10000, thin =10, plot =TRUE, skip_conv =TRUE, sample_size =1000)# Extract the pooled model for y using the accessor functiony_model <-get_pooled_model(bace_final, variable ="y")summary(y_model)plot(y_model) # We can now extract posteriors in the same way we would use a normal MCMCglmm model,# but now we have draws that integrate over the imputation distribution. For example,# we can look at the marginal posterior of the x1 coefficient:hist(y_model$Sol[, "x1"], main ="Posterior distribution for x1", xlab ="x1 coefficient")```# Citation and further readingIf you use *BACE* in a publication, please cite both the package (see `citation("BACE")`) and the underlying `MCMCglmm` engine [@hadfield2010]. Phylogenetic comparative methods and phylogenetic mixed models are reviewed in @cornwell2017, @devillemereuil2014a and @halliwell2025. The broader motivation for imputation in comparative biology — data gaps, trait databases, and the cost of complete-case analysis — is discussed in @daliaa.conde2019 and @pottier2024. For a related Rubin's-rules-based approach that handles *phylogenetic* uncertainty (rather than missing trait data), see @nakagawa2018.The Bayesian combiner used by `pool_posteriors()` — concatenating the per-imputation chains to approximate the marginal posterior — is discussed in @zhou2010 and in Chapter 18 of @gelman2013.Bug reports, feature requests and worked examples are very welcome at the [BACE issue tracker](https://github.com/daniel1noble/BACE/issues).# References::: {#refs}:::