VBayesMM

Variational Bayesian Neural Network to Prioritize Important Relationships of High-Dimensional Microbiome Multiomics Data

Novel Methodology

Performance Metrics
Production Ready
34%
SMAPE Score
Lowest validation error achieved
2,257
Metabolite Abundances
Highest number of metabolite predicted
57,702
Microbial Taxa
Maximum feature dimensionality
O(n log n)
Complexity
Optimized ELBO computation

Method Performance

Method Accuracy (SMAPE) Feature Selection Uncertainty Quantification Computational Speed Interpretability
VBayesMM
34.7% - 55.1%
Best across all datasets
Probabilistic
Spike-and-Slab Prior
Bayesian Posterior
Full uncertainty quantification
1.5h - 120h+
Moderate (scales with complexity)
Selection Probabilities
Highly interpretable
MiMeNet
38.8% - 60.1%
Second best performance
L2 Regularization
Standard MLP approach
Point estimates
No uncertainty
1.2h - 99h
Fastest among neural methods
Limited context
Hyperparameter dependent
MMvec
47.6% - 76.5%
Third place performance
Equal treatment
No feature selection
MAP estimates
No uncertainty
1.5h - 120h+
Similar to VBayesMM
Black box
Limited interpretability
sPLS
67.5% - 93.1%
Poorest performance
L1/L2 regularization
Manual tuning required
No uncertainty
Deterministic output
Slowest
Poor scalability
Linear assumptions
Limited to linear relationships
CORE INNOVATION

Spike-and-Slab Priors for Variational Bayesian Neural Networks

Problem Formulation

Given paired microbiome-metabolome dataset D = [X, Y] with K samples, N taxonomic units, and M metabolites, VBayesMM learns conditional probabilities:

$$p(y_j | x_i) = \frac{\exp(\mathbf{u}_i^T \mathbf{v}_j + b_j)}{\sum_{j'=1}^M \exp(\mathbf{u}_i^T \mathbf{v}_{j'} + b_{j'})}$$

Where $$\mathbf{u}_i, \mathbf{v}_j \in \mathbb{R}^L$$ are latent embeddings for microbes and metabolites respectively.

Spike-and-Slab Prior

The core innovation lies in modeling each taxonomic unit with a mixture distribution that probabilistically determines feature relevance:

$$p(\mathbf{u}_i | \gamma_i) = \gamma_i \mathcal{N}(\mathbf{u}_i | \mathbf{0}, \sigma^2_{\text{slab}} \mathbf{I}) + (1 - \gamma_i) \delta_0(\mathbf{u}_i)$$

Where $$\gamma_i \in \{0,1\} : \text{acts as a binary selector}$$ $$\delta_0 : \text{represents the Dirac spike at zero for irrelevant features}$$ $$\sigma^2_{\text{slab}} : \text{controls the variance of important features}$$

Variational Inference Optimization

We employ mean-field variational inference to approximate the intractable posterior distribution:

$$q(\Theta) = q(\mathbf{U}) \cdot q(\gamma) \cdot q(\mathbf{V})$$ $$\text{where } q(\mathbf{U}) = \prod_{i=1}^N \gamma \mathcal{N}(\mathbf{u}_i | \alpha_{U_i}, \beta_{U_i}^2) + (1 - \gamma) \delta_0$$ $$q(\gamma) = \prod_{i=1}^N \prod_{l=1}^L \text{Bernoulli}(\gamma_{il} | \xi_{il})$$ $$ q(\mathbf{V}) = \prod_{j=1}^M \mathcal{N}(\mathbf{v}_j | \alpha_{V_j}, \beta_{V_j}^2)$$

The variational parameters $$\Theta = \{\alpha_U, \beta_U, \xi, \alpha_V, \beta_V\}$$ are optimized via gradient descent on the Evidence Lower Bound (ELBO).

Evidence Lower Bound (ELBO)

The optimization objective combines data likelihood with Bayesian regularization:

$$\mathcal{L}[q(\Theta)] = \mathbb{E}_{q(\Theta)}[\log p(\mathcal{D} | \Theta)] - \text{KL}[q(\Theta) \| p(\Theta)]$$

The first term ensures data fidelity while the KL divergence term provides automatic complexity control through Bayesian regularization.

Probabilistic Feature Selection

Unlike traditional L1/L2 regularization, our spike-and-slab approach provides principled Bayesian feature selection with interpretable selection probabilities.

P(feature selected) = E[γᵢ | data] > τ_threshold

Uncertainty Quantification

Bayesian neural network architecture provides full posterior distributions over parameters, enabling robust uncertainty estimation.

σ²(prediction) = Var[p(y|x)] + E[Var[y|x,θ]]

Scalable Implementation

Variational inference transforms intractable MCMC sampling into efficient gradient-based optimization, achieving O(n log n) complexity.

∇_θ ELBO = E_q[∇_θ log p(D|θ)] - ∇_θ KL[q||p]

Conditional Probability Modeling

Models microbe-metabolite relationships through conditional probabilities using multinomial likelihood with neural network embeddings.

p(metabolite_j | microbe_i) = softmax(uᵢᵀvⱼ + bⱼ)

Model Architecture

Variational Bayesian inference pipeline with spike-and-slab priors in encoder network for sparse feature selection

Microbiome Input
16S/WGS → Sparse Matrix
Encoder Network
Spike-and-Slab + ELBO
Latent Space
Low-Dimensional Encoding
Decoder Network
Metabolome Prediction + ELBO
TensorFlow
PyTorch
# Quick Start - TensorFlow Implementation import vbayesmm as vb import tensorflow as tf # Load your multiomics data microbes, metabolites = vb.load_data("data/") # Initialize VBayesMM model model = vb.VBayesMM( input_dim=microbes.shape[1], output_dim=metabolites.shape[1], latent_dim=50, framework="tensorflow" ) # Train with ELBO optimization history = model.fit( microbes, metabolites, epochs=1000, batch_size=32, validation_split=0.2 ) # Extract important species important_species = model.get_selected_features(threshold=0.5) print(f"Selected {len(important_species)} important microbial species")

Framework Support

Choose your preferred deep learning framework with identical functionality

TensorFlow

v2.x Compatible

  • Eager execution support
  • TensorBoard integration
  • tf.data pipeline optimization
  • Distributed training ready
  • SavedModel export format

PyTorch

v1.x Compatible

  • Dynamic computation graphs
  • Native Python debugging
  • TorchScript compilation
  • ONNX export support
  • Lightning integration

Applications & Datasets

Sleep Disorders

Obstructive Sleep Apnea (OSA)

Analysis of microbiome-metabolome relationships in mouse models of sleep apnea using 16S rRNA sequencing and LC-MS/MS metabolomics.

  • 16S rRNA gene sequencing
  • LC-MS/MS metabolomics
  • Mouse model validation

Metabolic Disorders

High-Fat Diet (HFD) Model

Investigation of diet-induced changes in microbiome-metabolome interactions in murine models of metabolic dysfunction.

  • Diet-induced obesity model
  • Metabolic pathway analysis
  • Longitudinal data support

Cancer Research

Gastric & Colorectal Cancer

Clinical application in cancer patients using whole-genome shotgun sequencing and CE-TOFMS metabolomics.

  • WGS microbiome profiling
  • CE-TOFMS metabolomics
  • Stage 0-4 cancer patients

Code Examples

Let us first load the simple sample data to see an example of the VBayesMM. VBayesMM supports for loading arbitrary biom, tsv, and csv formats.

TensorFlow Implementation

Import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import biom
from biom import load_table, Table
from scipy.stats import entropy, spearmanr
from scipy.sparse import coo_matrix
import tensorflow as tf
from VBayesMM import VBayesMM

Loading and preparing data in examples/

microbes = load_table("microbes.biom")
metabolites = load_table("metabolites.biom")
microbes_df = microbes.to_dataframe()
metabolites_df = metabolites.to_dataframe()
microbes_df = microbes_df.astype(pd.SparseDtype("float64", fill_value=0))
metabolites_df = metabolites_df.astype(pd.SparseDtype("float64", fill_value=0))
microbes_df, metabolites_df = microbes_df.align(metabolites_df, axis=0, join='inner')
num_test = 20
sample_ids = set(np.random.choice(microbes_df.index, size=num_test))
sample_ids = np.array([(x in sample_ids) for x in microbes_df.index])
train_microbes_df = microbes_df.loc[~sample_ids]
test_microbes_df = microbes_df.loc[sample_ids]
train_metabolites_df = metabolites_df.loc[~sample_ids]
test_metabolites_df = metabolites_df.loc[sample_ids]
train_microbes_coo = coo_matrix(train_microbes_df.values)
test_microbes_coo = coo_matrix(test_microbes_df.values)

Creating, training, and testing a model

model = VBayesMM()
config = tf.compat.v1.ConfigProto()
with tf.Graph().as_default(), tf.compat.v1.Session(config=config) as session:
model(session, train_microbes_coo, train_metabolites_df.values, test_microbes_coo, test_metabolites_df.values)
ELBO, _, SMAPE = model.fit(epoch=5000)

Results Visualization

Train data Test data
ELBO Train SMAPE Test

Visualizing the posterior distributions of model outputs

latent_microbiome_matrix = model.U
microbial_species_selection = model.U_mean_gamma
microbial_species_selection_mean = np.sort(np.mean(microbial_species_selection, axis=1))[::-1]
Latent microbiome matrix Microbial species selection
Latent microbiome matrix Microbial species selection

PyTorch Implementation

Import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import biom
from biom import load_table, Table
from scipy.stats import entropy, spearmanr
from scipy.sparse import coo_matrix
import torch
from VBayesMM import VBayesMM

Loading and preparing data

microbes = load_table("microbes.biom")
metabolites = load_table("metabolites.biom")
microbes_df = microbes.to_dataframe()
metabolites_df = metabolites.to_dataframe()
microbes_df = microbes_df.astype(pd.SparseDtype("float64", fill_value=0))
metabolites_df = metabolites_df.astype(pd.SparseDtype("float64", fill_value=0))
microbes_df, metabolites_df = microbes_df.align(metabolites_df, axis=0, join='inner')
num_test = 20
sample_ids = set(np.random.choice(microbes_df.index, size=num_test))
sample_ids = np.array([(x in sample_ids) for x in microbes_df.index])
train_microbes_df = microbes_df.loc[~sample_ids]
test_microbes_df = microbes_df.loc[sample_ids]
train_metabolites_df = metabolites_df.loc[~sample_ids]
test_metabolites_df = metabolites_df.loc[sample_ids]
n, d1 = train_microbes_df.shape
n, d2 = train_metabolites_df.shape
train_microbes_coo = coo_matrix(train_microbes_df.values)
test_microbes_coo = coo_matrix(test_microbes_df.values)
trainY_torch = torch.tensor(train_metabolites_df.to_numpy(), dtype=torch.float32)
testY_torch = torch.tensor(test_metabolites_df.to_numpy(), dtype=torch.float32)

Creating, training, and testing a model

model = VBayesMM(d1=d1, d2=d2, num_samples=n)
ELBO, _, SMAPE = model.fit(train_microbes_coo, trainY_torch, test_microbes_coo, testY_torch, epochs=5000)

Results Visualization

Train data Test data
ELBO Train SMAPE Test

Visualizing the posterior distributions of model outputs

latent_microbiome_matrix = np.array(model.qUmain_mean.weight.data.detach())
microbial_species_selection = np.array(model.qUmain_mean_gamma.detach())
microbial_species_selection_mean = np.sort(np.mean(microbial_species_selection, axis=1))[::-1]
Latent microbiome matrix Microbial species selection
Latent microbiome matrix Microbial species selection

Source Code

All of the code is in the src/ folder, you can use to re-make the analyses in the paper:

  • tensorflow/VBayesMM.py: file contains Python codes for VBayesMM method for TensorFlow User.
  • pytorch/VBayesMM.py: file contains Python codes for VBayesMM method for PyTorch User.

If you have any problem, please contact me via email: dangthanhtung91@vn-bml.com