Model building with Liesel#
In this tutorial, we go into more depth regarding the model building functionality in Liesel.
Liesel is based on the concept of probabilistic graphical models (PGMs) to represent (primarily Bayesian) statistical models, so let us start with a very brief look at what PGMs are and how they are implemented in Liesel.
Probabilistic graphical models#
In a PGM, each variable is represented as a node. There are two basic types of nodes in Liesel: strong and weak nodes. A strong node is a node whose value is defined “outside” of the model, for example, if the node represents some observed data or a parameter (parameters are usually set by an inference algorithm such as an optimizer or sampler). In contrast, a weak node is a node whose value is defined “within” the model, that is, it is a deterministic function of some other nodes. An exp-transformation mapping a real-valued parameter to a positive number, for example, would be a weak node.
In addition, each node can have an optional probability distribution. The probability density or mass function of the distribution evaluated at the value of the node gives its log-probability. In a typical Bayesian regression model, the response node would have a normal distribution and the parameter nodes would have some prior distribution (for example, a normal-inverse-gamma prior). The following table shows the different node types and some examples of their use cases.
Strong node |
Weak node |
|
|---|---|---|
With distribution |
Response, parameter, … |
Copula, … |
Without distribution |
Covariate, hyperparameter, … |
Inverse link function, parameter transformation, … |
A PGM is a collection of connected nodes. Two nodes can be connected through a directed edge, meaning that the first node is an input for the value or the distribution of the second node. Nodes without an edge between them are assumed to be conditionally independent, allowing us to factorize the model log-probability as
So let us consider the same model and data from the linear regression tutorial, where we had the underlying model \(y_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i, \;\sigma^2)\) with the true parameters \(\boldsymbol{\beta} = (\beta_0, \beta_1)' = (1, 2)'\) and \(\sigma = 1\).
import jax
import jax.numpy as jnp
import numpy as np
import matplotlib.pyplot as plt
# We use distributions and bijectors from tensorflow probability
import tensorflow_probability.substrates.jax.distributions as tfd
import tensorflow_probability.substrates.jax.bijectors as tfb
import liesel.goose as gs
import liesel.model as lsl
rng = np.random.default_rng(42)
# sample size and true parameters
n = 500
true_beta = np.array([1.0, 2.0])
true_sigma = 1.0
# data-generating process
x0 = rng.uniform(size=n)
X_mat = np.column_stack([np.ones(n), x0])
eps = rng.normal(scale=true_sigma, size=n)
y_vec = X_mat @ true_beta + eps
# plot the simulated data
plt.scatter(x0, y_vec)
plt.title("Simulated data from the linear regression model")
plt.xlabel("Covariate x")
plt.ylabel("Response y")
plt.show()

Building the model graph#
The graph of a Bayesian linear regression model is a tree, where the hyperparameters of the prior are the leaves and the response is the root. To build this tree in Liesel, we need to start from the leaves and work our way down to the root.
In the linear regression tutorial, we assumed the weakly informative
prior \(\beta_0, \beta_1 \sim \mathcal{N}(0, 100^2)\), so we start from
there and then work our way down. To do so, we need to define its
initial value and its node distribution using the Dist class.
beta_prior = lsl.Dist(tfd.Normal, loc=0.0, scale=100.0)
Note that you could also provide a Var instance for the loc
and scale argument - this fact allows you to build hierarchical
models. If you provide floats like we do here, Liesel will turn them
into Value nodes under the hood.
With this distribution object, we can now create the node for our
regression coefficient with the Var.new_param() constructor:
beta = lsl.Var.new_param(value=np.array([0.0, 0.0]), distribution=beta_prior, name="beta")
The second branch of the tree contains the residual standard deviation,
which we again build using the weakly informative prior
\(\sigma^2 \sim \text{InverseGamma}(a, b)\) with \(a = b = 0.01\) on the
squared standard deviation. This time, we supply the hyperparameters as
Var instances.
a = lsl.Var.new_param(0.01, name="a")
b = lsl.Var.new_param(0.01, name="b")
sigma_sq_prior = lsl.Dist(tfd.InverseGamma, concentration=a, scale=b)
sigma_sq = lsl.Var.new_param(value=10.0, distribution=sigma_sq_prior, name="sigma_sq")
The variable constructor Var.new_calc() always takes a function
as its first argument, and the nodes to be used as function inputs as
the following arguments. We use it to compute the square root of our
variance, since we again also need to work with the scale. This is the
first weak node that we are setting up - all previous nodes have been
strong.
sigma = lsl.Var.new_calc(jnp.sqrt, sigma_sq, name="sigma").update()
We do the same to compute the matrix-vector product after constructing the design matrix
X = lsl.Var.new_obs(X_mat, name="X")
mu = lsl.Var.new_calc(jnp.dot, X, beta, name="mu")
Finally, we can connect the branches of the tree in a response node. The node values are our observed response values, and as before, we must specify the response distribution.
y_dist = lsl.Dist(tfd.Normal, loc=mu, scale=sigma)
y = lsl.Var.new_obs(y_vec, distribution=y_dist, name="y")
Now, to construct a full-fledged Liesel model from our individual node objects, we only add the response node.
model = lsl.Model([y])
model
Model(24 nodes, 8 vars)
Since all other nodes are directly or indirectly connected to this node, the Model has collected those nodes automatically.
To visualize the statistical graph of a model we call the
plot_vars() function. Strong nodes are shown in blue, weak
nodes in red. Nodes with a probability distribution are highlighted with
a star. In the figure below, we can see the tree-like structure of the
graph and identify the two branches for the mean and the standard
deviation of the response.
lsl.plot_vars(model)

Node and model log-probabilities#
The log-probability of the model, which can be interpreted as the
unnormalized log-posterior in a Bayesian context, can be accessed with
the log_prob property.
model.log_prob
Array(-1179.6559, dtype=float32)
The individual nodes also have a log_prob property. In fact, because
of the conditional independence assumption of the model, the
log-probability of the model is given by the sum of the
log-probabilities of the nodes with probability distributions. We take
the sum for the .log_prob attributes of beta and y because, per
default, the attributes return the individual log-probability
contributions of each element in the values of the nodes. So for beta
we would get two log-probability values, and for y we would get 500.
beta.log_prob.sum() + sigma_sq.log_prob + y.log_prob.sum()
Array(-1179.6559, dtype=float32)
Nodes without a probability distribution return a log-probability of zero.
sigma.log_prob
0.0
The log-probability of a node depends on its value and its inputs. Thus, if we change the variance of the response from 10 to 1, the log-probability of the corresponding node, the log-probability of the response node, and the log-probability of the model change as well.
print(f"Old value of sigma_sq: {sigma_sq.value}")
Old value of sigma_sq: 10.0
print(f"Old log-prob of sigma_sq: {sigma_sq.log_prob}")
Old log-prob of sigma_sq: -6.972140312194824
print(f"Old log-prob of y: {y.log_prob.sum()}\n")
Old log-prob of y: -1161.635498046875
sigma_sq.value = 1.0
print(f"New value of sigma_sq: {sigma_sq.value}")
New value of sigma_sq: 1.0
print(f"New log-prob of sigma_sq: {sigma_sq.log_prob}")
New log-prob of sigma_sq: -4.655529975891113
print(f"New log-prob of y: {y.log_prob.sum()}\n")
New log-prob of y: -1724.6702880859375
print(f"New model log-prob: {model.log_prob}")
New model log-prob: -1740.3740234375
For most inference algorithms, we need the gradient of the model log-probability with respect to the parameters. Liesel uses the JAX library for numerical computing and machine learning to compute gradients using automatic differentiation.