Complete Visual Guide

Mathematics for Intelligence

From basic algebra to advanced optimization — every mathematical concept you need for modern machine learning, explained visually.

Start Exploring

Foundations

Essential algebraic concepts that serve as building blocks for all machine learning mathematics.

01

Variables & Notation

Beginner

In ML, x represents input features, y is the target output, and θ (theta) denotes learnable parameters. Vectors are bold v, matrices are capital W.

$$y = f(x; \theta)$$

Read as: "y is a function of x, parameterized by theta."

02

Functions

Beginner

A function maps inputs to outputs. Neural networks are compositions of simple functions.

Function Formula Use
Linear $f(x) = ax + b$ Regression
ReLU $\max(0, x)$ Hidden layers
Sigmoid $\frac{1}{1+e^{-x}}$ Binary probs
Tanh $\frac{e^x - e^{-x}}{e^x + e^{-x}}$ Recurrent nets
Softmax $\frac{e^{x_i}}{\sum e^{x_j}}$ Multi-class
03

Activation Functions Visualized

Beginner

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

Interactive Plot
04

Exponentials & Logarithms

Intermediate

Exponentials ($e^x$) model growth and appear in softmax. Logarithms turn products into sums, essential for loss functions.

$$\log(ab) = \log(a) + \log(b)$$ $$\frac{d}{dx}e^x = e^x$$
Why this matters

Cross-entropy uses logs to prevent numerical underflow when multiplying many small probabilities.

05

Sets & Logic

Beginner

Understanding set notation helps read ML papers. Common sets:

  • $\mathbb{R}$ — real numbers (continuous values)
  • $\mathbb{R}^n$ — n-dimensional vectors
  • $\mathbb{R}^{m \times n}$ — matrices with m rows, n columns
  • $\{0, 1\}$ — binary set (classification labels)

Linear Algebra

The language of data. Vectors, matrices, and tensors — how we represent and transform information in ML.

01

Vectors

Beginner

A vector is an ordered list of numbers. One data point = one vector. An image with 784 pixels becomes a vector in $\mathbb{R}^{784}$.

$$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n$$

Dot product measures similarity: $\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i$

02

Vector Operations

Beginner
Operation Notation ML Use
Addition $\mathbf{u} + \mathbf{v}$ Residual connections
Dot product $\mathbf{u} \cdot \mathbf{v}$ Attention, similarity
Hadamard $\mathbf{u} \circ \mathbf{v}$ Dropout masking
Outer product $\mathbf{u} \mathbf{v}^T$ Gradient computation
Norm $\|\mathbf{v}\|_2$ Regularization
03

Matrices

Beginner

A matrix is a 2D array. Your dataset is a matrix where rows are samples and columns are features.

$$A \in \mathbb{R}^{m \times n}$$

Matrix multiplication is the fundamental operation of neural networks: $\mathbf{y} = W\mathbf{x} + \mathbf{b}$

04

Linear Transformations

Beginner

Matrices transform space. Click to see how different matrices rotate, scale, or shear the grid.

Matrix Visualization
05

Special Matrices

Intermediate
Type Property Use
Identity (I) $AI = A$ Skip connections
Diagonal Non-zero on diagonal only Scaling, covariance
Symmetric $A = A^T$ Distance matrices
Orthogonal $A^T A = I$ Rotations, PCA
Positive Definite $\mathbf{x}^T A \mathbf{x} > 0$ Convex optimization
06

Matrix Decompositions

Advanced

Breaking matrices into simpler components:

  • LU: $A = LU$ for solving linear systems
  • Cholesky: $A = LL^T$ for positive definite matrices
  • Eigendecomposition: $A = Q\Lambda Q^{-1}$
  • SVD: $A = U\Sigma V^T$ — most important for ML
07

Eigenvalues & Eigenvectors

Advanced

Eigenvectors are directions that only stretch (not rotate) when transformed. The stretch amount is the eigenvalue.

$$A\mathbf{v} = \lambda\mathbf{v}$$
PCA Application

Find eigenvectors of covariance matrix. The one with largest eigenvalue is the direction of maximum variance — your first principal component.

08

Tensors

Intermediate

Tensors generalize vectors and matrices to any number of dimensions.

  • 0D: Scalar (number)
  • 1D: Vector (sequence)
  • 2D: Matrix (grid)
  • 3D: Tensor (e.g., RGB image: height × width × channels)
  • 4D: Batch of images (batch × height × width × channels)

Calculus

The mathematics of change. Derivatives tell us how to adjust parameters to improve our models.

01

Limits

Beginner

The limit describes what happens as we approach a value. The derivative is defined as a limit.

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$
02

Derivatives

Beginner

The derivative is the slope of a function at a point.

Function Derivative
$x^n$ $nx^{n-1}$
$e^x$ $e^x$
$\ln(x)$ $\frac{1}{x}$
$\sin(x)$ $\cos(x)$
$\sigma(x)$ (sigmoid) $\sigma(x)(1-\sigma(x))$
03

Derivative Rules

Beginner
Sum: $(f + g)' = f' + g'$
Product: $(fg)' = f'g + fg'$
Quotient: $(\frac{f}{g})' = \frac{f'g - fg'}{g^2}$
Chain: $(f(g(x)))' = f'(g(x)) \cdot g'(x)$

The chain rule is the foundation of backpropagation.

04

Partial Derivatives

Intermediate

For functions with multiple inputs, the partial derivative measures change with respect to one input, holding others constant.

$$\frac{\partial f}{\partial x_i}$$
05

Gradients & Gradient Descent

Intermediate

The gradient is a vector of all partial derivatives. It points in the direction of steepest increase. To minimize loss, we go opposite.

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$ $$\theta_{new} = \theta_{old} - \eta \nabla L$$
Gradient Descent Visualization
06

Integrals

Intermediate

Integration computes areas under curves. In probability, it gives us cumulative probabilities.

$$P(a \leq X \leq b) = \int_a^b p(x) dx$$
07

Multivariable Calculus

Advanced

Jacobian: Matrix of first derivatives
Hessian: Matrix of second derivatives (curvature)

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Positive definite Hessian → convex (good for optimization).

08

Backpropagation

Advanced

Applying the chain rule backward through the computational graph to compute gradients.

# Forward pass z = W @ x + b a = sigmoid(z) loss = (a - y)**2 # Backward pass (chain rule) d_loss = 2 * (a - y) da_dz = a * (1 - a) # sigmoid derivative dL_dz = d_loss * da_dz # chain rule dL_dW = dL_dz @ x.T # gradient for weights

Probability

Quantifying uncertainty. Essential for classification, predictions, and understanding model confidence.

01

Probability Basics

Beginner

$P(A)$ is the probability of event A, from 0 (impossible) to 1 (certain).

  • Joint: $P(A, B)$ — both occur
  • Marginal: $P(A) = \sum_B P(A, B)$
  • Conditional: $P(A|B) = \frac{P(A, B)}{P(B)}$
  • Independence: $P(A, B) = P(A)P(B)$
02

Random Variables

Beginner

A random variable assigns numbers to outcomes. Can be discrete or continuous.

PMF (discrete): $P(X = x)$
PDF (continuous): $p(x)$ where $\int p(x)dx = 1$
CDF: $F(x) = P(X \leq x)$
03

Probability Distributions

Intermediate
Distribution Visualizer
Distribution Type Key Use
Normal/Gaussian Continuous Noise, weight init (via Xavier/He)
Uniform Continuous Random sampling, initialization
Bernoulli Discrete Binary outcomes, dropout
Categorical Discrete Multi-class classification
Multinomial Discrete Word frequencies, counts
Beta Continuous Probabilities (0 to 1)
Dirichlet Continuous Distribution over distributions
04

Expectation & Variance

Intermediate
$$\mathbb{E}[X] = \sum_x x \cdot P(X=x)$$ $$\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$$

Covariance: $\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$

05

Bayes' Theorem

Advanced

Update beliefs based on evidence:

$$P(\theta|D) = \frac{P(D|\theta) \cdot P(\theta)}{P(D)}$$
  • Prior $P(\theta)$: Belief before data
  • Likelihood $P(D|\theta)$: Probability of data
  • Posterior $P(\theta|D)$: Belief after data
06

Maximum Likelihood

Advanced

Find parameters that maximize the probability of observed data.

$$\theta_{MLE} = \arg\max_\theta P(D|\theta)$$

Equivalent to minimizing negative log-likelihood (cross-entropy).

Optimization

Finding the best parameters. The algorithms that enable machines to learn from data.

01

Gradient Descent Variants

Beginner
Variant Update Pros/Cons
Batch Full dataset Stable, slow
Stochastic (SGD) One sample Fast, noisy
Mini-batch 32-512 samples Best balance
02

Learning Rate

Beginner

The step size $\eta$ (eta) is crucial. Too large: divergence. Too small: slow convergence.

  • Decay: $\eta_t = \eta_0 / (1 + \gamma t)$
  • Step decay: Drop by factor every N epochs
  • Cosine annealing: Smooth decay
  • Warmup: Start small, increase
03

Momentum

Intermediate

Accumulate velocity in directions of consistent gradient.

$$v_t = \gamma v_{t-1} + \eta \nabla L$$ $$\theta = \theta - v_t$$

Helps escape shallow local minima and accelerates through flat regions.

04

Adaptive Optimizers

Advanced
Optimizer Key Feature
AdaGrad Per-parameter LR based on history
RMSprop Moving average of squared gradients
Adam Momentum + RMSprop (most popular)
AdamW Adam with proper weight decay
LAMB Layer-wise LR for large batches
05

Regularization Techniques

Intermediate

L1 / L2 Regularization

Add penalty to loss: $\lambda \|\theta\|_1$ (sparse) or $\lambda \|\theta\|_2^2$ (small weights)

Dropout

Randomly zero neurons during training. Prevents co-adaptation.

Early Stopping

Stop training when validation loss stops improving.

Batch Normalization

Normalize layer inputs. Stabilizes training, allows higher LR.

06

Second-Order Methods

Advanced

Use curvature information (Hessian) for faster convergence.

$$\theta_{new} = \theta_{old} - H^{-1}\nabla L$$

Newton's method is expensive ($O(n^3)$). L-BFGS approximates Hessian efficiently.

Advanced Topics

Information theory, numerical methods, and specialized techniques for modern deep learning.

01

Information Theory

Advanced
Entropy: $H(X) = -\sum p(x)\log p(x)$
KL Divergence: $D_{KL}(p\|q) = \sum p(x)\log\frac{p(x)}{q(x)}$
Cross-Entropy: $H(p,q) = H(p) + D_{KL}(p\|q)$

Cross-entropy is the standard loss for classification. Minimizing it = making predicted distribution match true distribution.

02

Numerical Stability

Intermediate
  • Log-sum-exp trick: Prevent overflow in softmax
  • Xavier/He init: Keep variance stable through layers
  • Gradient clipping: Prevent exploding gradients
  • Mixed precision: FP16 training with FP32 master
03

Constrained Optimization

Advanced

Lagrange multipliers for optimization with constraints. Used in SVMs and when parameters must satisfy conditions.

$$\mathcal{L}(x, \lambda) = f(x) - \lambda g(x)$$
04

AutoDiff

Advanced

Automatic differentiation computes exact derivatives by traversing the computational graph.

  • Forward mode: One pass per input
  • Reverse mode: One pass per output (backprop)
  • Used in PyTorch, TensorFlow, JAX