Course Information
- Course: Probabilistic Artificial Intelligence (PAI)
- Semester: Fall 2025
- University: ETH Zurich
- Status: Completed
- 📄 Course Materials: PDF Notes
PAI Core Principle: A key aspect of intelligence is not to only make decisions, but reason about the uncertainty of these decisions, and to consider this uncertainty when making decisions. This is what PAI is about.
Welcome to my deep dive into Probabilistic Artificial Intelligence! This course is fundamentally changing how I think about machine learning and AI systems. It's not just about making machines smart, but about making them humble: systems that know what they don't know, and act cautiously when uncertainty is high.
Part I: Probabilistic Approaches to Machine Learning
The first part of the program covers probabilistic approaches to machine learning. We exploit the crucial difference between two types of uncertainty:
- Epistemic uncertainty - due to lack of data (reducible)
- Aleatoric uncertainty - irreducible noise from observations and outcomes
Then we discuss concrete approaches toward probabilistic inference, including some fascinating methods:
1. Bayesian Linear Regression
Think of it as linear regression with a twist: instead of just one line through the data, we put a probability distribution over all possible lines, updating our belief as we see more points. The result is not just a prediction, but a quantified uncertainty around that prediction.
2. Gaussian Process Models
A Gaussian process is like saying: "I don't know the exact function, but I'll assume smoothness and let the data tell me the shape." It's a powerful non-parametric model that gives both predictions and confidence intervals in a mathematically elegant way.
3. Bayesian Neural Networks
Neural networks, but probabilistic: instead of fixed weights, we learn distributions over weights. This lets the network say "I'm confident here" or "I'm uncertain there", which is critical for safe decision-making in the real world.
Part II: Uncertainty in Sequential Tasks
The second part is about uncertainty in sequential tasks. We consider active learning and optimization approaches that actively collect data by proposing experiments that are informative for reducing epistemic uncertainty.
4. Bayesian Optimization
When experiments are expensive (think: running clinical trials or tuning giant ML models), Bayesian optimization helps choose the next experiment to run by balancing exploration and exploitation. It actively seeks the most informative data points to shrink uncertainty fastest.
How Do We Know Which Experiments Are Informative?
Great question! We measure informativeness through epistemic uncertainty — the uncertainty that comes from not having enough data. By asking "where am I most ignorant?", the algorithm proposes experiments that are maximally clarifying.
When we say an algorithm knows which experiments are informative, we're talking about information-theoretic criteria that measure how much an experiment would reduce epistemic uncertainty:
- Posterior variance: In Gaussian processes, uncertainty at a point is quantified by the posterior variance. The algorithm chooses new inputs where this variance is largest.
- Expected information gain: "If I ran this experiment, how much would it shrink my uncertainty about the model?" This is formalized as maximizing the expected reduction in entropy.
- Acquisition functions: Functions like Upper Confidence Bound (UCB) or Expected Improvement (EI) trade off exploration and exploitation:
- UCB: "Try points with high predicted value plus high uncertainty."
- EI: "Try points likely to improve upon the best outcome so far."
Reinforcement Learning & MDPs
Then we cover reinforcement learning (RL), a rich formalism for modeling agents that learn to act in uncertain environments.
5. Markov Decision Process (MDP)
The MDP is the mathematical backbone of reinforcement learning. It models the world as states, actions, and rewards, capturing the idea that decisions today affect both what you see tomorrow and what long-term payoff you'll get.
6. RL with Neural Network Approximations
Modern RL uses deep neural networks to approximate value functions or policies in huge state spaces. This gives us "deep RL" — the engine behind agents that can play Go, control robots, or learn strategies in complex, high-dimensional environments.
Model-Based RL and Safety
We close by discussing modern approaches in model-based RL, which use epistemic and aleatoric uncertainty to guide exploration, while also reasoning about safety.
The Big Picture: Probabilistic AI is not just about making machines smart, but about making them humble: systems that know what they don't know, and act cautiously when uncertainty is high. That's the difference between a reckless model and a trustworthy intelligent agent.
Why This Matters
In a world where AI systems are making increasingly important decisions — from medical diagnoses to autonomous vehicle control — the ability to quantify and reason about uncertainty isn't just mathematically elegant, it's ethically essential.
PAI gives us the tools to build AI systems that:
- Make better decisions under uncertainty
- Know when to ask for more data
- Fail safely when confidence is low
- Learn more efficiently by being strategic about what to explore
This course is reshaping how I think about intelligence itself — not just as the ability to be right, but as the wisdom to know when you might be wrong.
Completed Tasks & Projects
Below are the key tasks I've completed for the PAI exam, demonstrating practical applications of the theoretical concepts covered in the course.
Task 1: Bayesian Linear Regression with Model Selection
The Problem
Given a dataset with input-output pairs, we need to:
- Implement Bayesian Linear Regression to model the relationship between inputs and outputs
- Perform model selection by comparing different polynomial feature transformations
- Use the marginal likelihood (evidence) to select the best model complexity
- Quantify prediction uncertainty using the posterior predictive distribution
Our Approach
We tackled this problem using a fully Bayesian framework:
- Feature Engineering: Applied polynomial transformations of varying degrees (1 to 10) to capture non-linear relationships
- Bayesian Inference: Instead of point estimates, we computed full posterior distributions over model parameters using conjugate priors (Gaussian-Gamma)
- Model Evidence: Calculated the marginal likelihood for each polynomial degree, which naturally penalizes overfitting through the Bayesian Occam's Razor principle
- Posterior Predictive: Generated predictions with credible intervals that reflect both aleatoric (data noise) and epistemic (parameter) uncertainty
Key Insights
- Marginal Likelihood as Model Selection: The evidence automatically trades off model fit and complexity, selecting simpler models when data doesn't justify complexity
- Uncertainty Quantification: Unlike traditional regression, Bayesian methods provide confidence intervals that widen in regions with sparse data
- Occam's Razor in Action: More complex models are penalized unless they significantly improve fit, preventing overfitting
- No Cross-Validation Needed: The marginal likelihood handles model selection in one pass, without requiring train/validation splits
Task 2: Gaussian Processes for Regression
The Problem
Implement Gaussian Process (GP) regression to model complex, non-linear functions with uncertainty quantification:
- Build a Gaussian Process model from scratch using kernel functions
- Handle noisy observations while maintaining smooth predictions
- Implement different kernel functions (RBF, Matérn, Periodic) to capture various function behaviors
- Optimize hyperparameters (length scale, variance, noise) using maximum likelihood estimation
- Provide uncertainty bounds that adapt to data density
Our Approach
We implemented a complete Gaussian Process regression framework:
- Kernel Design: Implemented multiple kernel functions including RBF (Radial Basis Function), Matérn, and Periodic kernels to model different smoothness and periodicity assumptions
- GP Posterior Computation: Derived and implemented the exact posterior distribution over functions given observed data, using the kernel trick to avoid explicit feature space computation
- Hyperparameter Optimization: Maximized the log marginal likelihood with respect to kernel hyperparameters using gradient-based optimization (e.g., L-BFGS)
- Numerical Stability: Applied Cholesky decomposition for efficient and numerically stable inversion of covariance matrices
- Predictive Distribution: Computed mean and variance of the posterior predictive distribution at test points, providing both point predictions and uncertainty estimates
Key Insights
- Non-Parametric Flexibility: GPs define distributions over functions rather than parameters, allowing them to adapt to arbitrary function complexity without fixed model structure
- Kernel as Prior: The choice of kernel encodes our prior beliefs about function smoothness, periodicity, and correlation structure—critical for good predictions
- Uncertainty-Aware Predictions: Predictive variance naturally increases in regions far from training data, providing honest uncertainty quantification
- Computational Challenges: GP inference scales as O(n³) due to matrix inversion, requiring approximations (e.g., inducing points) for large datasets
- Automatic Relevance Determination: Hyperparameter optimization naturally performs feature selection by learning which input dimensions are most relevant
- Connection to Bayesian Linear Regression: GPs can be seen as infinite-dimensional extensions of Bayesian linear regression with infinite basis functions
Task 3: Safe Bayesian Optimization with Constraints
The Problem
Optimize a black-box objective function while respecting safety constraints—a critical challenge in real-world applications like drug design:
- Maximize bioavailability f(x) of a drug formulation
- Ensure surface area v(x) ≤ 4.0 to remain safe for patients
- Handle noisy observations from expensive experiments
- Never violate safety constraints during the optimization process (safe exploration)
- Balance exploration vs. exploitation within the safe set
The key challenge: we cannot afford to test unsafe configurations, so we must learn both the objective and constraint functions conservatively.
Our Approach
We implemented a dual-GP safe Bayesian optimization framework:
- Dual Gaussian Processes: Maintained two independent GPs—one modeling the objective f(x) and another modeling the constraint violation g(x) = v(x) - 4.0
- Conservative Safe Set: Defined a point as safe if μ_g(x) + β·σ_g(x) ≤ -ε, where β = 3.0 provides high-probability safety guarantees (analogous to UCB but for constraints)
- Kernel Engineering: Used Matérn kernels for f(x) with fixed noise (σ_f = 0.15) and a composite kernel (Linear + Matérn + RBF) for v(x) to capture both global trends and local variations
- Safe Expected Improvement: Maximized Expected Improvement (EI) acquisition function only over the safe set, ensuring all proposed experiments respect safety constraints
- Exploration Bonus: Added a small variance term to the acquisition to avoid premature convergence and encourage safe exploration
- Graceful Degradation: Implemented fallback mechanisms when the GP becomes overly pessimistic—re-sample near previously safe observations
Key Insights
- Safety Through Uncertainty: High uncertainty in constraint predictions forces conservative behavior—the algorithm won't risk unsafe regions until it has more data
- Exploration-Safety Tradeoff: Safe BO must balance three objectives: maximize reward, explore the space, and never violate constraints—a harder problem than unconstrained BO
- Dual Modeling: Separate GPs for objective and constraint allow independent uncertainty quantification, crucial since we need conservative constraint estimates but optimistic objective estimates
- Never Leave the Safe Set: Unlike penalty methods or barrier functions, safe BO provides hard guarantees: if the initial point is safe and the GP is well-calibrated, all queries remain safe with high probability
- Beta Parameter Tuning: The safety parameter β controls risk aversion—higher β means more conservative (larger safe set margin), lower β allows more aggressive exploration
- Real-World Applications: This framework is critical for domains where constraint violations are unacceptable: medical treatments, autonomous systems, chemical processes, robotics
- Sample Efficiency: Safe BO achieves near-optimal solutions in fewer evaluations than random search or grid search while maintaining safety, crucial when experiments are expensive or time-consuming
Task 4: Maximum a Posteriori Policy Optimization (MPO)
The Problem
Implement a state-of-the-art deep reinforcement learning algorithm for continuous control:
- Train an agent to solve the CartPole-v1 control task using only raw observations
- Handle continuous action spaces with Gaussian stochastic policies
- Balance exploration and exploitation while learning from experience
- Ensure stable learning by constraining policy updates to avoid catastrophic forgetting
- Use off-policy learning with a replay buffer for sample efficiency
- Implement actor-critic architecture with twin Q-functions for value estimation
Our Approach
We implemented MPO, an EM-style algorithm that constrains policy updates using KL divergence:
- Twin Critics (Clipped Double Q-Learning): Maintained two independent Q-networks (Q₁, Q₂) and used their minimum for target computation to reduce overestimation bias
- Gaussian Stochastic Actor: Policy network outputs mean μ and log-std for each action dimension, creating a diagonal Gaussian distribution. Actions are sampled using the reparameterization trick and squashed through tanh for bounded control
- E-Step (Critic Update): Updated critics toward Bellman targets computed by sampling K actions from the target policy at next states and averaging their Q-values
- M-Step (Policy Update): Updated policy to maximize expected Q-value using importance-weighted maximum likelihood:
- Sample K actions from current policy at observed states
- Compute importance weights w_k ∝ exp(Q(s,a_k)/η) using softmax
- Maximize weighted log-likelihood: Σ_k w_k log π(a_k|s)
- Temperature Parameter η: Adaptively controlled via Lagrangian optimization to maintain KL(π_old || π_new) ≈ ε_KL, preventing drastic policy changes
- Soft Target Updates: Applied Polyak averaging (τ=0.005) to slowly update target networks, stabilizing learning
- Replay Buffer: Stored 50k transitions for off-policy sampling, enabling data reuse and breaking temporal correlations
Key Insights
- EM Perspective on RL: MPO frames policy improvement as an EM algorithm—E-step evaluates actions, M-step improves policy toward high-value actions weighted by their advantage
- Trust Region via KL Constraint: Unlike PPO's clipping or TRPO's line search, MPO uses a temperature parameter to implicitly enforce trust regions, ensuring smooth policy evolution
- Sample Reweighting: Importance weights naturally implement advantage-weighted regression: actions with high Q-values get more influence on policy updates
- Continuous Control Challenges: Tanh squashing ensures actions stay within bounds while maintaining differentiability for gradient-based optimization
- Twin Q-Functions: Taking min(Q₁, Q₂) for target computation reduces overoptimistic Q-value estimates that plague single-critic methods, improving stability
- Off-Policy Efficiency: Replay buffer enables multiple gradient updates per environment step, dramatically improving sample efficiency compared to on-policy methods like PPO
- Reparameterization Trick: Sampling a ~ N(μ, σ) as a = μ + σ·ε (ε ~ N(0,1)) allows backpropagation through stochastic sampling, critical for policy gradient estimation
- Stabilization Techniques: Combination of target networks, twin critics, soft updates, and KL constraints creates a remarkably stable learning algorithm suitable for complex control tasks