Mathematics for AI

Module 00

Foundations

Essential algebraic concepts that serve as building blocks for all machine learning mathematics.

Variables & Notation

Beginner

In ML, x represents input features, y is the target output, and θ (theta) denotes learnable parameters. Vectors are bold v, matrices are capital W.

$$y = f(x; \theta)$$

Read as: "y is a function of x, parameterized by theta."

Functions

Beginner

A function maps inputs to outputs. Neural networks are compositions of simple functions.

Function	Formula	Use
Linear	$f(x) = ax + b$	Regression
ReLU	$\max(0, x)$	Hidden layers
Sigmoid	$\frac{1}{1+e^{-x}}$	Binary probs
Tanh	$\frac{e^x - e^{-x}}{e^x + e^{-x}}$	Recurrent nets
Softmax	$\frac{e^{x_i}}{\sum e^{x_j}}$	Multi-class

Activation Functions Visualized

Beginner

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

Interactive Plot

Exponentials & Logarithms

Intermediate

Exponentials ($e^x$) model growth and appear in softmax. Logarithms turn products into sums, essential for loss functions.

$$\log(ab) = \log(a) + \log(b)$$ $$\frac{d}{dx}e^x = e^x$$

Why this matters

Cross-entropy uses logs to prevent numerical underflow when multiplying many small probabilities.

Sets & Logic

Beginner

Understanding set notation helps read ML papers. Common sets:

$\mathbb{R}$ — real numbers (continuous values)
$\mathbb{R}^n$ — n-dimensional vectors
$\mathbb{R}^{m \times n}$ — matrices with m rows, n columns
$\{0, 1\}$ — binary set (classification labels)

Module 01

Linear Algebra

The language of data. Vectors, matrices, and tensors — how we represent and transform information in ML.

Vectors

Beginner

A vector is an ordered list of numbers. One data point = one vector. An image with 784 pixels becomes a vector in $\mathbb{R}^{784}$.

$$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n$$

Dot product measures similarity: $\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i$

Vector Operations

Beginner

Operation	Notation	ML Use
Addition	$\mathbf{u} + \mathbf{v}$	Residual connections
Dot product	$\mathbf{u} \cdot \mathbf{v}$	Attention, similarity
Hadamard	$\mathbf{u} \circ \mathbf{v}$	Dropout masking
Outer product	$\mathbf{u} \mathbf{v}^T$	Gradient computation
Norm	$\\|\mathbf{v}\\|_2$	Regularization

Matrices

Beginner

A matrix is a 2D array. Your dataset is a matrix where rows are samples and columns are features.

$$A \in \mathbb{R}^{m \times n}$$

Matrix multiplication is the fundamental operation of neural networks: $\mathbf{y} = W\mathbf{x} + \mathbf{b}$

Linear Transformations

Beginner

Matrices transform space. Click to see how different matrices rotate, scale, or shear the grid.

Matrix Visualization

Special Matrices

Intermediate

Type	Property	Use
Identity (I)	$AI = A$	Skip connections
Diagonal	Non-zero on diagonal only	Scaling, covariance
Symmetric	$A = A^T$	Distance matrices
Orthogonal	$A^T A = I$	Rotations, PCA
Positive Definite	$\mathbf{x}^T A \mathbf{x} > 0$	Convex optimization

Matrix Decompositions

Advanced

Breaking matrices into simpler components:

LU: $A = LU$ for solving linear systems
Cholesky: $A = LL^T$ for positive definite matrices
Eigendecomposition: $A = Q\Lambda Q^{-1}$
SVD: $A = U\Sigma V^T$ — most important for ML

Eigenvalues & Eigenvectors

Advanced

Eigenvectors are directions that only stretch (not rotate) when transformed. The stretch amount is the eigenvalue.

$$A\mathbf{v} = \lambda\mathbf{v}$$

PCA Application

Find eigenvectors of covariance matrix. The one with largest eigenvalue is the direction of maximum variance — your first principal component.

Tensors

Intermediate

Tensors generalize vectors and matrices to any number of dimensions.

0D: Scalar (number)
1D: Vector (sequence)
2D: Matrix (grid)
3D: Tensor (e.g., RGB image: height × width × channels)
4D: Batch of images (batch × height × width × channels)

Module 02

Calculus

The mathematics of change. Derivatives tell us how to adjust parameters to improve our models.

Limits

Beginner

The limit describes what happens as we approach a value. The derivative is defined as a limit.

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

Derivatives

Beginner

The derivative is the slope of a function at a point.

Function	Derivative
$x^n$	$nx^{n-1}$
$e^x$	$e^x$
$\ln(x)$	$\frac{1}{x}$
$\sin(x)$	$\cos(x)$
$\sigma(x)$ (sigmoid)	$\sigma(x)(1-\sigma(x))$

Derivative Rules

Beginner

Sum: $(f + g)' = f' + g'$
Product: $(fg)' = f'g + fg'$
Quotient: $(\frac{f}{g})' = \frac{f'g - fg'}{g^2}$
Chain: $(f(g(x)))' = f'(g(x)) \cdot g'(x)$

The chain rule is the foundation of backpropagation.

Partial Derivatives

Intermediate

For functions with multiple inputs, the partial derivative measures change with respect to one input, holding others constant.

$$\frac{\partial f}{\partial x_i}$$

Gradients & Gradient Descent

Intermediate

The gradient is a vector of all partial derivatives. It points in the direction of steepest increase. To minimize loss, we go opposite.

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$ $$\theta_{new} = \theta_{old} - \eta \nabla L$$

Gradient Descent Visualization

Integrals

Intermediate

Integration computes areas under curves. In probability, it gives us cumulative probabilities.

$$P(a \leq X \leq b) = \int_a^b p(x) dx$$

Multivariable Calculus

Advanced

Jacobian: Matrix of first derivatives
Hessian: Matrix of second derivatives (curvature)

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Positive definite Hessian → convex (good for optimization).

Backpropagation

Advanced

Applying the chain rule backward through the computational graph to compute gradients.

                            # Forward pass
                            z = W @ x + b
                            a = sigmoid(z)
                            loss = (a - y)**2

                            # Backward pass (chain rule)
                            d_loss = 2 * (a - y)
                            da_dz = a * (1 - a) # sigmoid derivative
                            dL_dz = d_loss * da_dz # chain rule
                            dL_dW = dL_dz @ x.T # gradient for weights
                        

Module 03

Probability

Quantifying uncertainty. Essential for classification, predictions, and understanding model confidence.

Probability Basics

Beginner

$P(A)$ is the probability of event A, from 0 (impossible) to 1 (certain).

Joint: $P(A, B)$ — both occur
Marginal: $P(A) = \sum_B P(A, B)$
Conditional: $P(A|B) = \frac{P(A, B)}{P(B)}$
Independence: $P(A, B) = P(A)P(B)$

Random Variables

Beginner

A random variable assigns numbers to outcomes. Can be discrete or continuous.

PMF (discrete): $P(X = x)$
PDF (continuous): $p(x)$ where $\int p(x)dx = 1$
CDF: $F(x) = P(X \leq x)$

Probability Distributions

Intermediate

Distribution Visualizer

Distribution	Type	Key Use
Normal/Gaussian	Continuous	Noise, weight init (via Xavier/He)
Uniform	Continuous	Random sampling, initialization
Bernoulli	Discrete	Binary outcomes, dropout
Categorical	Discrete	Multi-class classification
Multinomial	Discrete	Word frequencies, counts
Beta	Continuous	Probabilities (0 to 1)
Dirichlet	Continuous	Distribution over distributions

Expectation & Variance

Intermediate

$$\mathbb{E}[X] = \sum_x x \cdot P(X=x)$$ $$\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$$

Covariance: $\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$

Bayes' Theorem

Advanced

Update beliefs based on evidence:

$$P(\theta|D) = \frac{P(D|\theta) \cdot P(\theta)}{P(D)}$$

Prior $P(\theta)$: Belief before data
Likelihood $P(D|\theta)$: Probability of data
Posterior $P(\theta|D)$: Belief after data

Maximum Likelihood

Advanced

Find parameters that maximize the probability of observed data.

$$\theta_{MLE} = \arg\max_\theta P(D|\theta)$$

Equivalent to minimizing negative log-likelihood (cross-entropy).

Module 04

Optimization

Finding the best parameters. The algorithms that enable machines to learn from data.

Gradient Descent Variants

Beginner

Variant	Update	Pros/Cons
Batch	Full dataset	Stable, slow
Stochastic (SGD)	One sample	Fast, noisy
Mini-batch	32-512 samples	Best balance

Learning Rate

Beginner

The step size $\eta$ (eta) is crucial. Too large: divergence. Too small: slow convergence.

Decay: $\eta_t = \eta_0 / (1 + \gamma t)$
Step decay: Drop by factor every N epochs
Cosine annealing: Smooth decay
Warmup: Start small, increase

Momentum

Intermediate

Accumulate velocity in directions of consistent gradient.

$$v_t = \gamma v_{t-1} + \eta \nabla L$$ $$\theta = \theta - v_t$$

Helps escape shallow local minima and accelerates through flat regions.

Adaptive Optimizers

Advanced

Optimizer	Key Feature
AdaGrad	Per-parameter LR based on history
RMSprop	Moving average of squared gradients
Adam	Momentum + RMSprop (most popular)
AdamW	Adam with proper weight decay
LAMB	Layer-wise LR for large batches

Regularization Techniques

Intermediate

L1 / L2 Regularization

Add penalty to loss: $\lambda \|\theta\|_1$ (sparse) or $\lambda \|\theta\|_2^2$ (small weights)

Dropout

Randomly zero neurons during training. Prevents co-adaptation.

Early Stopping

Stop training when validation loss stops improving.

Batch Normalization

Normalize layer inputs. Stabilizes training, allows higher LR.

Second-Order Methods

Advanced

Use curvature information (Hessian) for faster convergence.

$$\theta_{new} = \theta_{old} - H^{-1}\nabla L$$

Newton's method is expensive ($O(n^3)$). L-BFGS approximates Hessian efficiently.

Module 05

Advanced Topics

Information theory, numerical methods, and specialized techniques for modern deep learning.

Information Theory

Advanced

Entropy: $H(X) = -\sum p(x)\log p(x)$
KL Divergence: $D_{KL}(p\|q) = \sum p(x)\log\frac{p(x)}{q(x)}$
Cross-Entropy: $H(p,q) = H(p) + D_{KL}(p\|q)$

Cross-entropy is the standard loss for classification. Minimizing it = making predicted distribution match true distribution.

Numerical Stability

Intermediate

Log-sum-exp trick: Prevent overflow in softmax
Xavier/He init: Keep variance stable through layers
Gradient clipping: Prevent exploding gradients
Mixed precision: FP16 training with FP32 master

Constrained Optimization

Advanced

Lagrange multipliers for optimization with constraints. Used in SVMs and when parameters must satisfy conditions.

$$\mathcal{L}(x, \lambda) = f(x) - \lambda g(x)$$

AutoDiff

Advanced

Automatic differentiation computes exact derivatives by traversing the computational graph.

Forward mode: One pass per input
Reverse mode: One pass per output (backprop)
Used in PyTorch, TensorFlow, JAX

Mathematics for Intelligence

Foundations

Variables & Notation

Functions

Activation Functions Visualized

Exponentials & Logarithms

Sets & Logic

Linear Algebra

Vectors

Vector Operations

Matrices

Linear Transformations

Special Matrices

Matrix Decompositions

Eigenvalues & Eigenvectors

Tensors

Calculus

Limits

Derivatives

Derivative Rules

Partial Derivatives

Gradients & Gradient Descent

Integrals

Multivariable Calculus

Backpropagation

Probability

Probability Basics

Random Variables

Probability Distributions

Expectation & Variance

Bayes' Theorem

Maximum Likelihood

Optimization

Gradient Descent Variants

Learning Rate

Momentum

Adaptive Optimizers

Regularization Techniques

L1 / L2 Regularization

Dropout

Early Stopping

Batch Normalization

Second-Order Methods

Advanced Topics

Information Theory

Numerical Stability

Constrained Optimization

AutoDiff