BACE: Bayesian Phylogenetic and Correlated Trait Imputation

Author

Daniel Noble, Szymek Drobniak, Shinichi Nakagawa

Published

April 13, 2026

1 Introduction

We provide a brief introduction to the BACE package for Bayesian phylogenetic and correlated trait imputation. The BACE package allows users to impute missing data in comparative datasets using Bayesian methods that account for phylogenetic relationships among species.

2 When to use Phylo-BACE

General-purpose multiple-imputation tools such as mice, Amelia and missForest assume that rows are exchangeable: the missing value for row i is imputed using the marginal distribution of the column and the observed covariates of row i, with no structural relationship between rows. For comparative biology datasets this assumption is wrong. Closely related species tend to share trait values because they share evolutionary history, and throwing that information away means we impute a missing mouse trait as though the nearest source of information were the grand mean across 500 mammals rather than the handful of other mouse-like rodents in the tree.

Phylo-BACE keeps that information by fitting every per-variable imputation model as a Bayesian phylogenetic mixed model via MCMCglmm, with a phylogenetic random effect that borrows information between related taxa. Imputation is done via chained equations — one conditional model per variable with missing data — and final inference uses multiple independent imputations whose posterior draws are combined (see Interpreting the pooled output below).

Phylo-BACE is the right tool when:

Your rows are species (or populations nested within species) and you have a phylogenetic tree that covers them.
Your traits are likely to show phylogenetic signal — i.e. closely related species have correlated values. If you are unsure, fit a single-variable phylogenetic model to a complete trait first (e.g. phytools::phylosig() or an MCMCglmm fit with a phylo random effect) and check whether the phylogenetic variance component is non-trivial.
You want posterior inference on downstream models rather than point imputations followed by ad-hoc variance adjustments.
Your missing-data pattern spans mixed variable types (continuous, binary, categorical, ordinal, count).

Phylo-BACE is probably the wrong tool when:

Rows are independent observational units with no phylogenetic structure — generic MI tools are faster and simpler.
The trait of interest has no detectable phylogenetic signal. The phylogenetic random effect is then noise that slows down MCMC without helping.
Missingness is rare and plausibly MCAR — a complete-case analysis may be defensible and much cheaper.
You need sub-second imputations in a production pipeline. Each chained-equations iteration runs an MCMC fit, so runtimes are measured in minutes, not milliseconds.

For a broader discussion of the data-gap problem and imputation in comparative biology we recommend Dalia A. Conde et al. (2019) and Pottier et al. (2024); for Rubin’s-rules-based approaches to related problems see Nakagawa and de Villemereuil (2018).

3 Installation

You can install the BACE package from GitHub using the following command:

Code

# Install from github and load
#install.packages("pacman")
  pacman::p_load(devtools)
devtools::install_github("daniel1noble/BACE")

cli   (3.6.5 -> 3.6.6) [CRAN]
vctrs (0.7.2 -> 0.7.3) [CRAN]


The downloaded binary packages are in
    /var/folders/48/07fr3p6n4nv8gq_s3kl6vj2c0000gn/T//RtmpJYJUfd/downloaded_packages

── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/private/var/folders/48/07fr3p6n4nv8gq_s3kl6vj2c0000gn/T/RtmpJYJUfd/remotes4640d85796c/daniel1noble-BACE-a707dec/DESCRIPTION’ ... OK
* preparing ‘BACE’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Removed empty directory ‘BACE/.vscode’
Removed empty directory ‘BACE/dev’
Removed empty directory ‘BACE/ms’
Removed empty directory ‘BACE/vignettes’
* building ‘BACE_0.0.0.9000.tar.gz’

Code

  pacman::p_load(BACE, ape, phytools, MCMCglmm, dplyr, magrittr)

If you encounter any issues during installation, please ensure that you have the required dependencies installed.

4 Documentation

The main functions in the BACE package are bace() and bace_imp(), which performs Bayesian imputation using chained equations. You can access the documentation for these functions using the following commands:

Code

?bace

As you will see, the function allows you to specify fixed effects, random effects, phylogenetic tree, and other parameters for the imputation process. At the bare minimum, you need to provide a formula for the fixed effects, a formula for the random effects (including phylogeny), a phylogenetic tree, and the dataset with missing values.

At present, BACE only works with intercept-only phylogenetic and species-level random effects, but we are working on adding support for other types of random effects in the future. For most comparative datasets, this will be sufficient for modeling the phylogenetic and species-level variation in the data, but not always.

5 The BACE workflow at a glance

bace() is the top-level wrapper. Internally it glues together four steps that you can also run individually when you need finer control:

Initial chained imputation — bace_imp(). For each variable with missing data, fit a MCMCglmm phylogenetic mixed model, impute the missing cells, move on to the next variable, and repeat for runs iterations. At each iteration the working dataset is updated on the raw scale so that later draws condition on the most recent imputations for every other variable. The first iteration warm-starts predictor NAs with marginal means (for continuous variables) or empirical draws (for categorical variables); subsequent iterations use the chained draws.
Convergence assessment of the chained loop — assess_convergence(). Check whether the sequence of imputed datasets has stabilised. This is not within-chain MCMC convergence — see Two kinds of convergence below.
Final imputation runs — bace_final_imp(). Starting from the converged dataset, launch n_final independent MCMCglmm fits per variable. Each run produces one complete imputed dataset and one per-variable posterior. Predictions are drawn from the posterior predictive distribution (not a point estimate), so the returned imputed datasets differ from run to run.
Posterior pooling — pool_posteriors(). Concatenate the per-imputation MCMC chains into a single set of posterior draws per parameter. These stacked draws approximate the marginal posterior of the analysis model integrating over the imputation distribution.

bace() runs all four steps and additionally retries step 1 with more chained iterations if step 2 fails (up to max_attempts times). The rest of this vignette walks through the workflow first with bace() and then shows how to run each step individually.

6 Example Usage

6.1 Simulating Data

We’ll first simulate a dataset with missing values and a phylogenetic tree. We can do this using the sim_bace() function.

Code

# Set seed for reproducibility
set.seed(123)

# Simulate data. Note that threshold4 indicates an ordinal variable with 4 levels, and multinomial3 indicates a categorical variable with 3 levels. You can replace the numbers 
sim_data <- sim_bace(response_type = "poisson",
                   predictor_types = c("gaussian", "binary", "threshold4", "multinomial3"),
                      phylo_signal = c(0.55, 0.75, 0.8, 0.7, 0.7),
                       missingness = c(0.2, 0.3, 0.4, 0.2, 0.5),
                           n_cases = 100,
                         n_species = 100)

Generated default beta_matrix with pre-set sparsity

Generated default beta_resp: -0.043, 0.457, -0.047, 0.178

We can then view the simulated data and phylogenetic tree because the sim_bace() function returns a list containing both objects. It also has a lot of other useful information about the simulation users can explore.

Code

# Object contains both the simulated data and phylogeny
    data <- sim_data$data
    tree <- sim_data$tree

# View first few rows of the data
    head(data)

  species  y        x1   x2   x3   x4
1   pJUhd  1        NA <NA>    2    A
2   h62z2  0        NA <NA> <NA>    B
3   69Haj  0 -3.045276 <NA>    2    B
4   gv77g 10        NA <NA>    3 <NA>
5   Arhl9 13 -1.785486 <NA>    4 <NA>
6   wLS5r  2        NA    1    2    C

Here, we can see that the dataset contains lots of missing values across different variable types. Of course, normally, we would need to exclude these cases to use a complete dataset. However, with BACE we simply feed in the entire dataset, specify the models we want to use for imputation (or the real model of interest), and impute the missing data.

6.2 Using BACE for Imputation

Now that we have the data and the phylogenetic tree we can do the imputation is to use the bace() or bace_imp() function.

Caution

It’s very important to ensure that all variables are correctly classified in the dataset (e.g., factors for categorical variables, numeric for continuous variables, etc.) before running the imputation. BACE will fit different models for different variable types, and so, if they have not been correctly classified you will be modeling the variable incorrectly.

If you want to see what BACE has classified each variable as, you can inspect the types element of any fitted bace_imp() or bace() object. For example, after running the simple workflow below you can do:

Code

# After running bace() (see next section), inspect the inferred variable types
bace_final1$initial_results$types

Importantly, users also need to think carefully about the distributions of the variables themselves and transform them if necessary prior to running bace_imp (e.g., log-transforming skewed continuous variables).

There are two ways you can use bace/bace_imp. The first is to specify a list of all the formulas for each variable you would like to fit. It is important here that, if you provide a list, that any variable with missing data is found as a response variable. This must be the case because BACE will not automatically build models for other variables if you provide a list of formulas (but see below).

For now, we’ll focus on the simplified bace workflow because it does the full process for users, but later we will show how different components of the workflow can be run separately for more control over the imputation process.

6.3 Simple BACE Workflow

The bace() function is the main function to conduct a BACE analysis running each of the core steps: 1) initial imputations to reach convergence in imputed data; 2) convergence assessment to ensure that imputed data has reached a stable state; 3) imputing missing data 10 times using the final model, re-fitting and estimating parameters and then pooling posteriors across the final runs. If convergence is not achieved, it will automatically restart the initial steps to allow chains to converge before running the final set of models. You can also specify the number of runs, MCMC parameters, and whether to plot the results. The final output will include pooled models that account for imputation uncertainty.

Code

# Run the full bace workflow:
bace_final1 <- bace(fixformula = "y ~ x1 + x2",
                     ran_phylo_form = "~ 1 |species",
                              phylo = tree,
                               data = data,
                               species = FALSE,
                               runs = 15, nitt = 100000, burnin = 10000, thin = 10, plot = TRUE, skip_conv = FALSE, sample_size = 1000)


=======================================================
BACE: Bayesian Analysis with Chained Equations
=======================================================

Step 1: Initial imputation for convergence assessment
-------------------------------------------------------
Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2

========================================
BACE MCMC Diagnostics Summary
========================================

Number of models: 3 
Total parameters: 15 
  Fixed effects: 9 
  Variance components: 6 
  Fixed units (threshold/categorical): 1 

Mean Effective Sample Size:
  Fixed effects:      5680.4
  Variance components: 2032.9

Convergence Assessment:
  Acceptable  :   3 parameters (20.0%)
  Check       :   3 parameters (20.0%)
  Fixed       :   1 parameters (6.7%)
  Good        :   8 parameters (53.3%)

Overall: CHECK - Many parameters show convergence issues
  Consider increasing nitt, adjusting priors, or checking model specification

Convergence by Response Variable:
----------------------------------------
[~] y              :  3/ 5 parameters good
    Check convergence: species
[OK] x1             :  4/ 5 parameters good
[CH] x2             :  1/ 4 parameters good (1 fixed)
    Check convergence: x1, species
========================================



Step 2: Convergence assessment
-------------------------------------------------------



Step 3: Final imputation runs for posterior pooling
-------------------------------------------------------

=== Running 10 final imputation iterations ===

Run 1/10 - Imputed variable: y
Run 1/10 - Imputed variable: x1
Run 1/10 - Imputed variable: x2
Run 2/10 - Imputed variable: y
Run 2/10 - Imputed variable: x1
Run 2/10 - Imputed variable: x2
Run 3/10 - Imputed variable: y
Run 3/10 - Imputed variable: x1
Run 3/10 - Imputed variable: x2
Run 4/10 - Imputed variable: y
Run 4/10 - Imputed variable: x1
Run 4/10 - Imputed variable: x2
Run 5/10 - Imputed variable: y
Run 5/10 - Imputed variable: x1
Run 5/10 - Imputed variable: x2
Run 6/10 - Imputed variable: y
Run 6/10 - Imputed variable: x1
Run 6/10 - Imputed variable: x2
Run 7/10 - Imputed variable: y
Run 7/10 - Imputed variable: x1
Run 7/10 - Imputed variable: x2
Run 8/10 - Imputed variable: y
Run 8/10 - Imputed variable: x1
Run 8/10 - Imputed variable: x2
Run 9/10 - Imputed variable: y
Run 9/10 - Imputed variable: x1
Run 9/10 - Imputed variable: x2
Run 10/10 - Imputed variable: y
Run 10/10 - Imputed variable: x1
Run 10/10 - Imputed variable: x2

=== Final imputation complete ===
Saved 10 imputed datasets with corresponding models


Step 4: Pooling posteriors across imputations
-------------------------------------------------------
Using posterior sampling: 1000 samples per imputation

Posterior pooling complete!
Pooled 3 variable model(s) across 10 imputations

=======================================================
BACE Analysis Complete!
=======================================================

Convergence achieved: TRUE 
Number of attempts: 1 
Final imputations: 10 

Access results via:
  - $pooled_models: Pooled posterior distributions
  - $imputed_datasets: List of 10 imputed datasets
  - $convergence: Convergence diagnostics
  - $final_results: Full final imputation results

As we can see bace provides a comprehensive overview of the imputation process including each of the core steps. This includes the initial set of runs to allow data to converge, followed by convergence assessment checks both across runs and within parameters of models. Note that, if convergence is not achieved, the function will automatically restart the initial steps to allow chains to converge before running the final set of models. However, users should be aware that sometimes convergence fail yet convergence is likely achieved. If you think this is the case than you can do initial runs and skip the convergence assessment (see below). Note, it is also not always the case that parameter checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.

It’s worth highlighting a few key arguments within bace.

First, the skip_conv argument allows users to skip the convergence checks and just run the imputation for the number of runs specified. This is not recommended, but it can be useful if you have a large dataset and want to run the imputation for a long time without checking convergence.
Second, the sample_size argument allows users to specify the number of posterior samples to pool across final runs. This can be useful if you have a large dataset and want to reduce the size of the final bace object.
Third, the species argument allows users to specify whether to include a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species).

Once the imputation is complete, you can access the final pooled models and evaluate them as you would any MCMCglmm model fore each of the models fit. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable y as follows:

Code

# Extract the pooled model for the response variable using get_pooled_model()
y_model <- get_pooled_model(bace_final1, variable = "y")
summary(y_model)


+----------------------------------------------------------------+
|  BACE Pooled MCMCglmm summary (stacked per-imputation draws)    |
+----------------------------------------------------------------+

Stacked from 10 imputations
Total posterior draws: 10000 
  (= 1000 sampled per imputation x 10 imputations)
  Original samples per imputation: 9000 

Posterior summaries below are computed from stacked draws that
approximate p(theta | Y_obs) (Monte Carlo integration over imputed
values). Validity requires each per-imputation chain to have
converged on its own; check a subset via
plot(final$all_models[[1]]$<var>) and coda::effectiveSize().
----------------------------------------------------------------


 Iterations = 1:10000
 Thinning interval  = 1
 Sample size  = 10000 

 DIC: 244.89 

 G-structure:  ~species

        post.mean  l-95% CI u-95% CI eff.samp
species    0.8586 0.0002765    1.766     9718

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units    0.5298   0.2608    0.905    10000

 Location effects: y ~ x1 + x2 

            post.mean l-95% CI u-95% CI eff.samp  pMCMC   
(Intercept)   0.02624 -0.79977  0.84295    10424 0.9424   
x1           -0.12587 -0.38693  0.13724     9540 0.3508   
x2.L          0.46222  0.18914  0.72654    10334 0.0012 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

plot(y_model)

Code

# We can now extract posteriors in the same way we would use a normal MCMCglmm model, but now we have the pooled results that account for imputation uncertainty. For example, we can extract the posterior samples for the fixed effects as follows:

hist(y_model$Sol[, "x1"], main = "Posterior distribution for x1", xlab = "x1 coefficient")

Code

# You can also extract all pooled models at once
all_models <- get_pooled_model(bace_final1)
names(all_models)

[1] "y"  "x1" "x2"

Here, we see convergence wasn’t achieved with this particular dataset after 3 attempts, and we can see in the plots of the MCMCglmm models that some of the variance estimates are not quite converged yet. It’s best to adjust bace parameters to allow for longer runtimes and better mixing of the MCMC chains. For now, however, we will ignore the convergence issues for the sake of showing the functionality.

Another important feature is that users can also access all the final datasets with imputed values for each run. This can be useful for checking the imputed values (i.e., cross-validation) and exploring the imputation process in more detail. You can access the imputed datasets using the get_imputed_data() function, which returns them as either a list or a single stacked data frame:

Code

# Access the imputed datasets as a list (one data frame per imputation run)
imputed_datasets <- get_imputed_data(bace_final1, format = "list")
length(imputed_datasets)

[1] 10

Code

head(imputed_datasets[[1]])

   y        x1 x2 species
1  1 -1.881373  0   pJUhd
2  0 -3.379423  1   h62z2
3  0 -3.045276  0   69Haj
4 10 -4.499355  1   gv77g
5 13 -1.785486  1   Arhl9
6  2 -0.760707  1   wLS5r

Code

# Or stack all imputed datasets into one data frame with an .imputation column
imputed_df <- get_imputed_data(bace_final1, format = "data.frame")
table(imputed_df$.imputation)


  1   2   3   4   5   6   7   8   9  10 
100 100 100 100 100 100 100 100 100 100

6.4 Interpreting the pooled output

It is worth being precise about what the pooled posterior returned by bace() actually is, because the terminology in the multiple-imputation literature can trip people up.

pool_posteriors() concatenates the MCMC draws from the n_final per-imputation fits into a single set of draws per parameter. Each per-imputation chain targets the conditional posterior $p(\theta \mid Y_{\text{obs}}, Y_{\text{mis}}^{(m)})$, where $Y_{\text{mis}}^{(m)}$ is the m-th complete imputed dataset. Stacking these chains is a Monte Carlo approximation to the marginal posterior \[ p(\theta \mid Y_{\text{obs}}) = \int p(\theta \mid Y_{\text{obs}}, Y_{\text{mis}})\, p(Y_{\text{mis}} \mid Y_{\text{obs}})\, dY_{\text{mis}}, \] which integrates over uncertainty in the imputed values as well as over parameter uncertainty. Posterior summaries — means, quantiles, HPD intervals — computed from the stacked draws therefore reflect both sources of uncertainty simultaneously. This is the standard Bayesian-MI combiner; see Zhou and Reiter (2010) and Chapter 18 of Gelman et al. (2013).

Two practical notes follow from this:

This is not Rubin’s rules. Rubin’s rules combine scalar point estimates and variances from each imputation using a specific within/between-imputation variance decomposition, and are the right tool when the per-imputation analysis produces frequentist estimates rather than full posterior draws. When you have posterior draws, stacking is both simpler and more faithful. You do not need, and should not apply, Rubin’s formulas on top of the bace() output.
Validity depends on per-imputation MCMC convergence. Because each per-imputation chain is supposed to target its own conditional posterior, the stacked draws inherit any bias from chains that have not converged. Before trusting pooled summaries, spot-check a handful of the per-imputation fits with standard diagnostics. For example:

Code

# Inspect MCMC diagnostics on the first per-imputation fit for 'y'
first_fit <- bace_final1$final_results$all_models[[1]]$y
plot(first_fit)
coda::effectiveSize(first_fit$Sol)
coda::autocorr.diag(first_fit$Sol)

If these look poor, increase nitt and/or burnin and re-run bace() rather than simply trusting the pooled banner.

7 Advanced: running the workflow step by step

For most users bace() is the right entry point — it runs the four-step workflow end-to-end and retries if the chained loop has not converged. But there are cases where you want finer control: you may want to skip convergence assessment, inspect intermediate objects, run more or fewer final imputations than the default, or substitute your own pooling step. Here we show how to call bace_imp(), bace_final_imp() and pool_posteriors() directly.

7.1 Controlling the imputation models with a list of formulas

The first thing you might want to control is which model is fit for each variable with missing data. bace() and bace_imp() both accept a list of formulas, one per variable with missing data. This is useful when you want different predictors for different variables (for example, when a variable should only be imputed from a subset of the others, or when you want to include an interaction in one model but not another).

Note that if you pass a list, every variable with missing data must appear as a response somewhere in the list — BACE will not auto-generate models for variables you have left out.

Code

# First check that all variables are correctly classified. If using sim_bace() this is done automatically, but always worth checking.
  str(data)

# Perform BACE imputation. Here we specify models for each variable in the dataset.
bace_impute <- bace_imp(fixformula = list("y ~ x1 + x2",
                                          "x1 ~ x2",
                                          "x2 ~ x4",
                                          "x3 ~ x1 + x2",
                                          "x4 ~ x1 + x2"),
                    ran_phylo_form = "~ 1 |species",
                             phylo = tree,
                              data = data,
                              runs = 5, verbose = FALSE)

What BACE will do here is fit a MCMCglmm model for each of the response variables specified in the formula list. At the first iteration, it will fill in missing values for any predictors using either their mean (for continuous variables) or randomly sampled values (for categorical variables). Models will then be fit using this complete predictor set and predictions will be made for missing values of the response variable.

In the next run, model predictions will be used to ‘fill’ in missing values in any predictors instead. This cycle will continue until the predicted values converge. This process is repeated for the number of iterations specified by the runs argument. All the models will include the phylogenetic random effect specified in the ran_phylo_form argument.

The second way to use BACE is to just provide a model with your main variables. This may be because you have an a priori model in mind. As such, you can also simply feeding in a single formula as below. Here, BACE will automatically build models for the other variables based on their types and the variables present in the dataset. It will impute missing values in the same way as described above, but using all other variables as predictors. You can also specify interactions if you wish. We can run BACE in this way as follows (this time using verbose = TRUE to see more output during the imputation process):

Code

# Fit BACE with a single formula
bace_impute2 <- bace_imp(fixformula = "y ~ x1 + x2",
                     ran_phylo_form = "~ 1 |species",
                              phylo = tree,
                               data = data,
                               runs = 5, nitt = 20000, burnin = 5000, thin = 10)

Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2

========================================
BACE MCMC Diagnostics Summary
========================================

Number of models: 3 
Total parameters: 15 
  Fixed effects: 9 
  Variance components: 6 
  Fixed units (threshold/categorical): 1 

Mean Effective Sample Size:
  Fixed effects:      1217.4
  Variance components: 310.1

Convergence Assessment:
  Acceptable  :   2 parameters (13.3%)
  Check       :   5 parameters (33.3%)
  Fixed       :   1 parameters (6.7%)
  Good        :   7 parameters (46.7%)

Overall: CHECK - Many parameters show convergence issues
  Consider increasing nitt, adjusting priors, or checking model specification

Convergence by Response Variable:
----------------------------------------
[CH] y              :  2/ 5 parameters good
    Check convergence: x2.L, species, units
[~] x1             :  3/ 5 parameters good
    Check convergence: y
[~] x2             :  2/ 4 parameters good (1 fixed)
    Check convergence: species
========================================

As you can see from the printout above, BACE provides a summary of the imputation process, including the number of models fitted, parameters estimated, and effective sample sizes which evaluate how well the MCMC chains for parameters are mixing. This is important to check to ensure that the imputation is performing well. Note, it is not always the case that these checks signal problems with the models. Users will need to explore final models carefully before adjusting MCMC parameters or the number of runs. More on this below.

7.2 Evaluating Final Models

BACE returns a list of models for each variable that was imputed. You can access these models and evaluate them as you would any MCMCglmm model. This is important for checking model fit and convergence along with downstream inference. For example, you can check the summary of the model for the response variable y as follows:

Code

# Access the model for the response variable 'y' fit on the LAST chained
# iteration. bace_imp() stores these as $models_last_run (one MCMCglmm
# object per response variable with a formula).
model_y <- bace_impute2$models_last_run[["y"]]

# View summary of the model
summary(model_y)


 Iterations = 5001:19991
 Thinning interval  = 10
 Sample size  = 1500 

 DIC: 211.5929 

 G-structure:  ~species

        post.mean l-95% CI u-95% CI eff.samp
species    0.5158 0.000266    1.928    53.94

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units    0.7228   0.2394    1.148    44.87

 Location effects: y ~ x1 + x2 

            post.mean  l-95% CI  u-95% CI eff.samp pMCMC  
(Intercept)  0.007307 -0.711009  0.730705   1378.7 0.979  
x1          -0.023955 -0.301169  0.244285   1500.0 0.852  
x2.L         0.423303  0.094137  0.744851    729.4 0.012 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

# Plot trace and density plots for model parameters
plot(model_y)

We can see here that the phylogenetic variance estimate (i.e., species random effect) is mixing poorly. We should go back and re-run the imputation with more iterations or possibly adjust our prior to ensure better mixing for this parameter.

8 Phylogenetic and Non-Phylogenetic Random Effects

BACE allows users to include both phylogenetic and non-phylogenetic random effects in the imputation models. This is important if multiple replicates of species are included in the dataset (e.g., multiple populations of the same species). In this case, users can specify a non-phylogenetic random effect for species identity in addition to the phylogenetic random effect. This is done by adjusting the species = FALSE argument in bacse_imp, setting it to TRUE.

Code

# Simulate data with multiple replicates per species
set.seed(123)
sim_data2 <- sim_bace(response_type = "poisson",
                   predictor_types = c("gaussian", "binary", "threshold4", "multinomial3"),
                      phylo_signal = c(0.55, 0.75, 0.8, 0.7, 0.7),
                       missingness = c(0.2, 0.3, 0.4, 0.2, 0.5),
                                rr = TRUE,
                           n_cases = 100,
                         n_species = 20,
                           rr_form = list(species = c("x1")))

# Object contains both the simulated data and phylogeny
    data2 <- sim_data2$data
    tree2 <- sim_data2$tree

# View first few rows of the data
    head(data2)

  species  y         x1   x2   x3   x4
1   P9vI3  6 -0.7731097    1    4    C
2   eJBm5 NA         NA    1 <NA> <NA>
3   P9vI3  1         NA <NA>    4    C
4   JE8PP  1 -2.9886121    0 <NA> <NA>
5   eJBm5  1         NA    1    3 <NA>
6   sA8eL 12         NA <NA>    2 <NA>

Code

# We can also make a table of the number of cases per species to confirm that there are multiple replicates per species
    dim(data2)

[1] 100   6

Code

    length(unique(data2$species))

[1] 20

We now have a dataset with 100 cases but only 20 species, meaning that there are multiple replicates per species. We can run BACE on this dataset while including a non-phylogenetic random effect for species identity as follows:

Code

bace_impute3 <- bace_imp(fixformula = "y ~ x1 + x2",
                     ran_phylo_form = "~ 1 |species",
                              phylo = tree2,
                               data = data2,
                               runs = 5, nitt = 20000, burnin = 5000, thin = 10, species = TRUE)

Species effect decomposition enabled: 20 out of 20 species (100%) have replicated observations.

Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2

========================================
BACE MCMC Diagnostics Summary
========================================

Number of models: 3 
Total parameters: 18 
  Fixed effects: 9 
  Variance components: 9 
  Fixed units (threshold/categorical): 1 

Mean Effective Sample Size:
  Fixed effects:      931.6
  Variance components: 409.8

Convergence Assessment:
  Check       :   9 parameters (50.0%)
  Fixed       :   1 parameters (5.6%)
  Good        :   8 parameters (44.4%)

Overall: CHECK - Many parameters show convergence issues
  Consider increasing nitt, adjusting priors, or checking model specification

Convergence by Response Variable:
----------------------------------------
[CH] y              :  4/ 6 parameters good
    Check convergence: species, species2
[CH] x1             :  4/ 6 parameters good
    Check convergence: species, species2
[CH] x2             :  0/ 5 parameters good (1 fixed)
    Check convergence: (Intercept), x1, y, species, species2
========================================

Code

# We can view the final model for y to check that both random effects are included
model_y2 <- bace_impute3$models_last_run[["y"]]
summary(model_y2)


 Iterations = 5001:19991
 Thinning interval  = 10
 Sample size  = 1500 

 DIC: 201.5099 

 G-structure:  ~species

        post.mean  l-95% CI u-95% CI eff.samp
species    0.4993 0.0002282    1.931    132.8

               ~species2

         post.mean  l-95% CI u-95% CI eff.samp
species2    0.3634 0.0003098   0.8881    187.1

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units    0.5901   0.3869   0.8028     1280

 Location effects: y ~ x1 + x2 

            post.mean l-95% CI u-95% CI eff.samp  pMCMC  
(Intercept)  -0.32426 -1.23494  0.48719     1500 0.3907  
x1           -0.17011 -0.48431  0.18143     1500 0.3000  
x2.L          0.65040  0.06644  1.26101     1361 0.0307 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that in this case we get a message indicating that both phylogenetic and non-phylogenetic random effects are being included in the models because we probably can decompose the variances. If you try and fit a dataset without multiple replicates per species while setting species = TRUE you will get an error because the model cannot be fit given the two parameters are not estimable.

Code

bace_impute4 <- bace_imp(fixformula = "y ~ x1 + x2",
                     ran_phylo_form = "~ 1 |species",
                              phylo = tree2,
                               data = data2,
                               runs = 5, nitt = 20000, burnin = 5000, thin = 10, species = TRUE)

Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2

========================================
BACE MCMC Diagnostics Summary
========================================

Number of models: 3 
Total parameters: 18 
  Fixed effects: 9 
  Variance components: 9 
  Fixed units (threshold/categorical): 1 

Mean Effective Sample Size:
  Fixed effects:      998.5
  Variance components: 471.1

Convergence Assessment:
  Acceptable  :   1 parameters (5.6%)
  Check       :   6 parameters (33.3%)
  Fixed       :   1 parameters (5.6%)
  Good        :  10 parameters (55.6%)

Overall: CHECK - Many parameters show convergence issues
  Consider increasing nitt, adjusting priors, or checking model specification

Convergence by Response Variable:
----------------------------------------
[OK] y              :  5/ 6 parameters good
[CH] x1             :  4/ 6 parameters good
    Check convergence: species, species2
[CH] x2             :  1/ 5 parameters good (1 fixed)
    Check convergence: x1, y, species, species2
========================================

8.1 Adjusting MCMC Parameters For Convergence

As we can see from the model summary above, some parameters may have low effective sample sizes (ESS) which indicates poor mixing of the MCMC chains. This can be a sign that the imputation is not performing well and that more iterations are needed to achieve convergence. Lets have a look:

Code

model_x1 <- bace_impute3$models_last_run[["x1"]]
summary(model_x1)


 Iterations = 5001:19991
 Thinning interval  = 10
 Sample size  = 1500 

 DIC: 90.27461 

 G-structure:  ~species

        post.mean  l-95% CI u-95% CI eff.samp
species     1.261 0.0002354     4.61    71.75

               ~species2

         post.mean  l-95% CI u-95% CI eff.samp
species2    0.8811 0.0003004     2.08     85.8

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units    0.1614   0.1001   0.2291     1500

 Location effects: x1 ~ y + x2 

            post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept)  -0.06272 -1.57428  1.22995    758.3 0.987
y            -0.01926 -0.19347  0.13887   1500.0 0.824
x2.L         -0.12949 -0.47902  0.20606   1500.0 0.471

Code

model_x2 <- bace_impute3$models_last_run[["x2"]]
summary(model_x2)


 Iterations = 5001:19991
 Thinning interval  = 10
 Sample size  = 1500 

 DIC: 12.67915 

 G-structure:  ~species

        post.mean l-95% CI u-95% CI eff.samp
species      1477   0.1075     3824    7.834

               ~species2

         post.mean  l-95% CI u-95% CI eff.samp
species2     78.54 0.0001735    527.2    12.68

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units         1        1        1        0

 Location effects: x2 ~ x1 + y 

            post.mean l-95% CI u-95% CI eff.samp  pMCMC    
(Intercept)    11.088  -32.021   54.202  252.345  0.508    
x1            -12.725  -28.264    2.008    9.018  0.125    
y              12.658    3.980   21.712    4.382 <7e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see x1 and x2 models have poor mixing and low ESS for some parameters. Obviously, depending on the model complexity and data size etc, some models will need longer runtimes. This is likely because the imputation process has not yet converged. To address this, we can increase the number of iterations (i.e., nitt), increase the burn-in period (i.e., burnin), and/or adjust the thinning interval (i.e., thin) to improve mixing and convergence of the MCMC chains.

Users can adjust the MCMC parameters (i.e., nitt, thin, and burnin) for each model by providing a list of values for each parameter that corresponds to the order of the formulas provided in the fixformula argument. For example, if you have 5 formulas in your fixformula list, you can provide a list of 5 values for each MCMC parameter as follows:

Code

# Define MCMC parameters for each model
  nitt_list <- list(60000, 60000, 105000)
  thin_list <- list(10, 10, 10)

# Run BACE with model-specific MCMC parameters
bace_impute5 <- bace_imp(fixformula = "y ~ x1 + x2",
                    ran_phylo_form = "~ 1 |species",
                             phylo = tree2,
                              data = data2,
                              runs = 5, nitt = nitt_list, thin = thin_list, burnin = 5000, species = TRUE)

Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2

========================================
BACE MCMC Diagnostics Summary
========================================

Number of models: 3 
Total parameters: 18 
  Fixed effects: 9 
  Variance components: 9 
  Fixed units (threshold/categorical): 1 

Mean Effective Sample Size:
  Fixed effects:      3065.7
  Variance components: 1528.8

Convergence Assessment:
  Acceptable  :   4 parameters (22.2%)
  Check       :   5 parameters (27.8%)
  Fixed       :   1 parameters (5.6%)
  Good        :   8 parameters (44.4%)

Overall: CHECK - Many parameters show convergence issues
  Consider increasing nitt, adjusting priors, or checking model specification

Convergence by Response Variable:
----------------------------------------
[~] y              :  4/ 6 parameters good
    Check convergence: (Intercept)
[~] x1             :  4/ 6 parameters good
[CH] x2             :  0/ 5 parameters good (1 fixed)
    Check convergence: (Intercept), x1, y, species
========================================

As you can imagine, longer runtimes will mean that bace_imp will need to take longer. The diagnostics still suggest possible issues with mixing for some parameters, so we may need to further increase the number of iterations or adjust the priors to improve mixing, however, it’s still good to look at the final models to check the model visually.

8.2 Two kinds of convergence

When working with BACE it is important to keep two distinct convergence questions separate, because they are checked in different ways and the tools that flag problems in one are silent on the other.

Within-chain MCMC convergence of each per-variable MCMCglmm fit. This is the classical question: has the MCMC sampler explored its target posterior and mixed well? The right tools are the ones you already know — trace plots, effective sample size, autocorrelation, Geweke — applied to an individual MCMCglmm object. bace_imp() prints a short summary of effective sample sizes and autocorrelation for the last-iteration models when verbose = TRUE, and you can always call plot() or coda::effectiveSize() on bace_impute$models_last_run[["y"]] or on any of the per-imputation fits stored in bace_final1$final_results$all_models[[m]][["y"]]. Failures here are fixed by increasing nitt, burnin, or thin, tightening priors, or (less often) simplifying the model.
Convergence of the chained-equations loop. This is the multiple-imputation question: has the sequence of imputed datasets stabilised across chained iterations, or is it still drifting? This is not about MCMC mixing within a single fit — it is about whether the outer loop over variables has reached a stationary distribution over imputed values. The right tool for this question is assess_convergence(), which tracks summary statistics of the imputed cells across the runs chained iterations. Failures here are fixed by increasing runs or, occasionally, by revisiting the specification of the imputation models.

bace() assesses only the second kind of convergence automatically (and retries with more runs if it fails). The first kind is your responsibility to check before trusting pooled inference — see Interpreting the pooled output above.

8.3 Assessing chained-equations convergence

We need to check whether the numbers of runs we’ve done has lead to convergence of the imputation process. We can do this by looking at the trace and density plots for the parameters in the models. For example, we can check the model for y as follows:

Code

# Assess whether number of runs was enough by looking at the stability of imputed points across runs. Adjust runs if needed
converge <- assess_convergence(bace_impute2, method = "summary")
plot(converge)

Code

# Now that we know that we can run BACE longer

# Fit BACE with a single formula
bace_impute2_2 <- bace_imp(fixformula = "y ~ x1 + x2",
                     ran_phylo_form = "~ 1 |species",
                              phylo = tree,
                               data = data,
                               runs = 15, nitt = 100000, burnin = 10000, thin = 10)

Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2

========================================
BACE MCMC Diagnostics Summary
========================================

Number of models: 3 
Total parameters: 15 
  Fixed effects: 9 
  Variance components: 6 
  Fixed units (threshold/categorical): 1 

Mean Effective Sample Size:
  Fixed effects:      5626.9
  Variance components: 2014.6

Convergence Assessment:
  Acceptable  :   4 parameters (26.7%)
  Check       :   1 parameters (6.7%)
  Fixed       :   1 parameters (6.7%)
  Good        :   9 parameters (60.0%)

Overall: ACCEPTABLE - Most parameters converged adequately

Convergence by Response Variable:
----------------------------------------
[~] y              :  3/ 5 parameters good
[OK] x1             :  4/ 5 parameters good
[~] x2             :  2/ 4 parameters good (1 fixed)
    Check convergence: species
========================================

Code

# Check again for convergence
converge <- assess_convergence(bace_impute2_2, method = "summary")
plot(converge)

8.4 Worked example: bace() with species random effects

Now that you are aware of the different components of the BACE workflow, it is useful to see them all combined in a second worked example — this time using the replicated-species dataset from the previous section so that both phylogenetic and non-phylogenetic species effects are estimated, and letting bace() handle the orchestration end-to-end.

Code

# Run the full bace workflow:
bace_final <- bace(fixformula = "y ~ x1 + x2",
                     ran_phylo_form = "~ 1 |species",
                              phylo = tree2,
                               data = data2,
                               species = TRUE,
                               runs = 15, nitt = 100000, burnin = 10000, thin = 10, plot = TRUE, skip_conv = TRUE, sample_size = 1000)


=======================================================
BACE: Bayesian Analysis with Chained Equations
=======================================================

Step 1: Initial imputation for convergence assessment
-------------------------------------------------------

Run 1, Imputed variable: y
Run 1, Imputed variable: x1
Run 1, Imputed variable: x2
Run 2, Imputed variable: y
Run 2, Imputed variable: x1
Run 2, Imputed variable: x2
Run 3, Imputed variable: y
Run 3, Imputed variable: x1
Run 3, Imputed variable: x2
Run 4, Imputed variable: y
Run 4, Imputed variable: x1
Run 4, Imputed variable: x2
Run 5, Imputed variable: y
Run 5, Imputed variable: x1
Run 5, Imputed variable: x2
Run 6, Imputed variable: y
Run 6, Imputed variable: x1
Run 6, Imputed variable: x2
Run 7, Imputed variable: y
Run 7, Imputed variable: x1
Run 7, Imputed variable: x2
Run 8, Imputed variable: y
Run 8, Imputed variable: x1
Run 8, Imputed variable: x2
Run 9, Imputed variable: y
Run 9, Imputed variable: x1
Run 9, Imputed variable: x2
Run 10, Imputed variable: y
Run 10, Imputed variable: x1
Run 10, Imputed variable: x2
Run 11, Imputed variable: y
Run 11, Imputed variable: x1
Run 11, Imputed variable: x2
Run 12, Imputed variable: y
Run 12, Imputed variable: x1
Run 12, Imputed variable: x2
Run 13, Imputed variable: y
Run 13, Imputed variable: x1
Run 13, Imputed variable: x2
Run 14, Imputed variable: y
Run 14, Imputed variable: x1
Run 14, Imputed variable: x2
Run 15, Imputed variable: y
Run 15, Imputed variable: x1
Run 15, Imputed variable: x2

========================================
BACE MCMC Diagnostics Summary
========================================

Number of models: 3 
Total parameters: 18 
  Fixed effects: 9 
  Variance components: 9 
  Fixed units (threshold/categorical): 1 

Mean Effective Sample Size:
  Fixed effects:      5219.6
  Variance components: 2453.6

Convergence Assessment:
  Acceptable  :   3 parameters (16.7%)
  Check       :   5 parameters (27.8%)
  Fixed       :   1 parameters (5.6%)
  Good        :   9 parameters (50.0%)

Overall: CHECK - Many parameters show convergence issues
  Consider increasing nitt, adjusting priors, or checking model specification

Convergence by Response Variable:
----------------------------------------
[OK] y              :  5/ 6 parameters good
[~] x1             :  4/ 6 parameters good
[CH] x2             :  0/ 5 parameters good (1 fixed)
    Check convergence: (Intercept), x1, y, species, species2
========================================



Step 2: Convergence assessment
-------------------------------------------------------



Skipping convergence retry (skip_conv = TRUE)
Proceeding with final imputation...
-------------------------------------------------------

=== Running 10 final imputation iterations ===

Run 1/10 - Imputed variable: y
Run 1/10 - Imputed variable: x1
Run 1/10 - Imputed variable: x2
Run 2/10 - Imputed variable: y
Run 2/10 - Imputed variable: x1
Run 2/10 - Imputed variable: x2
Run 3/10 - Imputed variable: y
Run 3/10 - Imputed variable: x1
Run 3/10 - Imputed variable: x2
Run 4/10 - Imputed variable: y
Run 4/10 - Imputed variable: x1
Run 4/10 - Imputed variable: x2
Run 5/10 - Imputed variable: y
Run 5/10 - Imputed variable: x1
Run 5/10 - Imputed variable: x2
Run 6/10 - Imputed variable: y
Run 6/10 - Imputed variable: x1
Run 6/10 - Imputed variable: x2
Run 7/10 - Imputed variable: y
Run 7/10 - Imputed variable: x1
Run 7/10 - Imputed variable: x2
Run 8/10 - Imputed variable: y
Run 8/10 - Imputed variable: x1
Run 8/10 - Imputed variable: x2
Run 9/10 - Imputed variable: y
Run 9/10 - Imputed variable: x1
Run 9/10 - Imputed variable: x2
Run 10/10 - Imputed variable: y
Run 10/10 - Imputed variable: x1
Run 10/10 - Imputed variable: x2

=== Final imputation complete ===
Saved 10 imputed datasets with corresponding models

Pooling posteriors with sampling: 1000 samples per imputation

=======================================================
BACE Analysis Complete!
=======================================================

Convergence achieved: FALSE 
Number of attempts: 1 
Final imputations: 10 

Access results via:
  - $pooled_models: Pooled posterior distributions
  - $imputed_datasets: List of 10 imputed datasets
  - $convergence: Convergence diagnostics
  - $final_results: Full final imputation results

Code

# Extract the pooled model for y using the accessor function
y_model <- get_pooled_model(bace_final, variable = "y")
summary(y_model)


+----------------------------------------------------------------+
|  BACE Pooled MCMCglmm summary (stacked per-imputation draws)    |
+----------------------------------------------------------------+

Stacked from 10 imputations
Total posterior draws: 10000 
  (= 1000 sampled per imputation x 10 imputations)
  Original samples per imputation: 9000 

Posterior summaries below are computed from stacked draws that
approximate p(theta | Y_obs) (Monte Carlo integration over imputed
values). Validity requires each per-imputation chain to have
converged on its own; check a subset via
plot(final$all_models[[1]]$<var>) and coda::effectiveSize().
----------------------------------------------------------------


 Iterations = 1:10000
 Thinning interval  = 1
 Sample size  = 10000 

 DIC: 232.7726 

 G-structure:  ~species

        post.mean  l-95% CI u-95% CI eff.samp
species    0.4434 0.0002206    1.706    10000

               ~species2

         post.mean  l-95% CI u-95% CI eff.samp
species2    0.4313 0.0002421   0.9809    10000

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units     0.502   0.3573   0.6716    10000

 Location effects: y ~ x1 + x2 

            post.mean l-95% CI u-95% CI eff.samp  pMCMC   
(Intercept)   -0.3650  -1.1838   0.4832    10000 0.2892   
x1            -0.1755  -0.4871   0.1445    10000 0.2752   
x2.L           0.7058   0.2688   1.1635    10000 0.0036 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

plot(y_model)

Code

# We can now extract posteriors in the same way we would use a normal MCMCglmm model,
# but now we have draws that integrate over the imputation distribution. For example,
# we can look at the marginal posterior of the x1 coefficient:

hist(y_model$Sol[, "x1"], main = "Posterior distribution for x1", xlab = "x1 coefficient")

9 Citation and further reading

If you use BACE in a publication, please cite both the package (see citation("BACE")) and the underlying MCMCglmm engine (Hadfield and Nakagawa 2010). Phylogenetic comparative methods and phylogenetic mixed models are reviewed in Cornwell and Nakagawa (2017), de Villemereuil, Nakagawa, and Garamszegi (2014) and Halliwell, Holland, and Yates (2025). The broader motivation for imputation in comparative biology — data gaps, trait databases, and the cost of complete-case analysis — is discussed in Dalia A. Conde et al. (2019) and Pottier et al. (2024). For a related Rubin’s-rules-based approach that handles phylogenetic uncertainty (rather than missing trait data), see Nakagawa and de Villemereuil (2018).

The Bayesian combiner used by pool_posteriors() — concatenating the per-imputation chains to approximate the marginal posterior — is discussed in Zhou and Reiter (2010) and in Chapter 18 of Gelman et al. (2013).

Bug reports, feature requests and worked examples are very welcome at the BACE issue tracker.

10 References

Cornwell, Will, and Shinichi Nakagawa. 2017. “Phylogenetic Comparative Methods.” Current Biology: CB 27 (9): R333–36. https://doi.org/10.1016/j.cub.2017.03.049.

Dalia A. Conde, Johanna Staerk, Fernando Colchero, Rita Da Silva, Hugh P. Possingham, Annette Baudisch, and James W. Vaupel. 2019. “Data Gaps and Opportunities for Comparative and Conservation Biology.” Pnas 116 (10): 9658–64. https://doi.org/10.1073/pnas.1816367116.

de Villemereuil, Pierre, Shinichi Nakagawa, and L Z Garamszegi. 2014. “Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology.” Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology, 287–303. https://doi.org/10.1007/978-3-662-43550-2_11.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC.

Hadfield, J D, and S Nakagawa. 2010. “General Quantitative Genetic Methods for Comparative Biology: Phylogenies, Taxonomies and Multi-Trait Models for Continuous and Categorical Characters.” Journal of Evolutionary Biology 23 (3): 494–508. https://doi.org/10.1111/j.1420-9101.2009.01915.x.

Halliwell, Ben, Barbara R Holland, and Luke A Yates. 2025. “Multi-Response Phylogenetic Mixed Models: Concepts and Application.” Biological Reviews of the Cambridge Philosophical Society 100 (3): 1294–1316. https://doi.org/10.1111/brv.70001.

Nakagawa, Shinichi, and P de Villemereuil. 2018. “A General Method for Simultaneously Accounting for Phylogenetic and Species Sampling Uncertainty via Rubin’s Rules in Comparative Analysis.” Systematic Biology 68 (4): 632–41. https://doi.org/10.1093/sysbio/syy089.

Pottier, Patrice, Daniel W A Noble, Frank Seebacher, Nicholas C Wu, Samantha Burke, Malgorzata Lagisz, Lisa E Schwanz, Szymon M Drobniak, and Shinichi Nakagawa. 2024. “New Horizons for Comparative Studies and Meta-Analyses.” Trends in Ecology & Evolution 39 (5): 435–45. https://doi.org/10.1016/j.tree.2023.12.004.

Zhou, Xiang, and Jerome P. Reiter. 2010. “A Note on Bayesian Inference After Multiple Imputation.” The American Statistician 64 (2): 159–63. https://doi.org/10.1198/tast.2010.09109.