From basic algebra to advanced optimization — every mathematical concept you need for modern machine learning, explained visually.
Start ExploringEssential algebraic concepts that serve as building blocks for all machine learning mathematics.
In ML, x represents input features, y is the target output, and θ (theta) denotes learnable parameters. Vectors are bold v, matrices are capital W.
Read as: "y is a function of x, parameterized by theta."
A function maps inputs to outputs. Neural networks are compositions of simple functions.
| Function | Formula | Use |
|---|---|---|
| Linear | $f(x) = ax + b$ | Regression |
| ReLU | $\max(0, x)$ | Hidden layers |
| Sigmoid | $\frac{1}{1+e^{-x}}$ | Binary probs |
| Tanh | $\frac{e^x - e^{-x}}{e^x + e^{-x}}$ | Recurrent nets |
| Softmax | $\frac{e^{x_i}}{\sum e^{x_j}}$ | Multi-class |
Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.
Exponentials ($e^x$) model growth and appear in softmax. Logarithms turn products into sums, essential for loss functions.
Cross-entropy uses logs to prevent numerical underflow when multiplying many small probabilities.
Understanding set notation helps read ML papers. Common sets:
The language of data. Vectors, matrices, and tensors — how we represent and transform information in ML.
A vector is an ordered list of numbers. One data point = one vector. An image with 784 pixels becomes a vector in $\mathbb{R}^{784}$.
Dot product measures similarity: $\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i$
| Operation | Notation | ML Use |
|---|---|---|
| Addition | $\mathbf{u} + \mathbf{v}$ | Residual connections |
| Dot product | $\mathbf{u} \cdot \mathbf{v}$ | Attention, similarity |
| Hadamard | $\mathbf{u} \circ \mathbf{v}$ | Dropout masking |
| Outer product | $\mathbf{u} \mathbf{v}^T$ | Gradient computation |
| Norm | $\|\mathbf{v}\|_2$ | Regularization |
A matrix is a 2D array. Your dataset is a matrix where rows are samples and columns are features.
Matrix multiplication is the fundamental operation of neural networks: $\mathbf{y} = W\mathbf{x} + \mathbf{b}$
Matrices transform space. Click to see how different matrices rotate, scale, or shear the grid.
| Type | Property | Use |
|---|---|---|
| Identity (I) | $AI = A$ | Skip connections |
| Diagonal | Non-zero on diagonal only | Scaling, covariance |
| Symmetric | $A = A^T$ | Distance matrices |
| Orthogonal | $A^T A = I$ | Rotations, PCA |
| Positive Definite | $\mathbf{x}^T A \mathbf{x} > 0$ | Convex optimization |
Breaking matrices into simpler components:
Eigenvectors are directions that only stretch (not rotate) when transformed. The stretch amount is the eigenvalue.
Find eigenvectors of covariance matrix. The one with largest eigenvalue is the direction of maximum variance — your first principal component.
Tensors generalize vectors and matrices to any number of dimensions.
The mathematics of change. Derivatives tell us how to adjust parameters to improve our models.
The limit describes what happens as we approach a value. The derivative is defined as a limit.
The derivative is the slope of a function at a point.
| Function | Derivative |
|---|---|
| $x^n$ | $nx^{n-1}$ |
| $e^x$ | $e^x$ |
| $\ln(x)$ | $\frac{1}{x}$ |
| $\sin(x)$ | $\cos(x)$ |
| $\sigma(x)$ (sigmoid) | $\sigma(x)(1-\sigma(x))$ |
The chain rule is the foundation of backpropagation.
For functions with multiple inputs, the partial derivative measures change with respect to one input, holding others constant.
The gradient is a vector of all partial derivatives. It points in the direction of steepest increase. To minimize loss, we go opposite.
Integration computes areas under curves. In probability, it gives us cumulative probabilities.
Jacobian: Matrix of first derivatives
Hessian: Matrix of second derivatives (curvature)
Positive definite Hessian → convex (good for optimization).
Applying the chain rule backward through the computational graph to compute gradients.
Quantifying uncertainty. Essential for classification, predictions, and understanding model confidence.
$P(A)$ is the probability of event A, from 0 (impossible) to 1 (certain).
A random variable assigns numbers to outcomes. Can be discrete or continuous.
| Distribution | Type | Key Use |
|---|---|---|
| Normal/Gaussian | Continuous | Noise, weight init (via Xavier/He) |
| Uniform | Continuous | Random sampling, initialization |
| Bernoulli | Discrete | Binary outcomes, dropout |
| Categorical | Discrete | Multi-class classification |
| Multinomial | Discrete | Word frequencies, counts |
| Beta | Continuous | Probabilities (0 to 1) |
| Dirichlet | Continuous | Distribution over distributions |
Covariance: $\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$
Update beliefs based on evidence:
Find parameters that maximize the probability of observed data.
Equivalent to minimizing negative log-likelihood (cross-entropy).
Finding the best parameters. The algorithms that enable machines to learn from data.
| Variant | Update | Pros/Cons |
|---|---|---|
| Batch | Full dataset | Stable, slow |
| Stochastic (SGD) | One sample | Fast, noisy |
| Mini-batch | 32-512 samples | Best balance |
The step size $\eta$ (eta) is crucial. Too large: divergence. Too small: slow convergence.
Accumulate velocity in directions of consistent gradient.
Helps escape shallow local minima and accelerates through flat regions.
| Optimizer | Key Feature |
|---|---|
| AdaGrad | Per-parameter LR based on history |
| RMSprop | Moving average of squared gradients |
| Adam | Momentum + RMSprop (most popular) |
| AdamW | Adam with proper weight decay |
| LAMB | Layer-wise LR for large batches |
Add penalty to loss: $\lambda \|\theta\|_1$ (sparse) or $\lambda \|\theta\|_2^2$ (small weights)
Randomly zero neurons during training. Prevents co-adaptation.
Stop training when validation loss stops improving.
Normalize layer inputs. Stabilizes training, allows higher LR.
Use curvature information (Hessian) for faster convergence.
Newton's method is expensive ($O(n^3)$). L-BFGS approximates Hessian efficiently.
Information theory, numerical methods, and specialized techniques for modern deep learning.
Cross-entropy is the standard loss for classification. Minimizing it = making predicted distribution match true distribution.
Lagrange multipliers for optimization with constraints. Used in SVMs and when parameters must satisfy conditions.
Automatic differentiation computes exact derivatives by traversing the computational graph.