Pandas Handbook | Minimalist Guide

Getting Started

Pandas bridges the gap between raw data and actionable insights. It provides two primary tools: Series (1D) and DataFrames (2D).

Series

A single column. Think of it as a labeled list or an array with an index.

[10, 20, 30, 40]
Index: [0, 1, 2, 3]

DataFrame

A table. A dictionary of Series aligned together. The standard for data science.

Columns: Name, Age, City
Rows: Indexed observations

$ Installation

# Using pip

pip install pandas

# Standard Import

import pandas as pd

Core Structures

Visualizing a DataFrame

A DataFrame is constructed from rows and columns. Unlike a simple array, it has labeled axes (Index for rows, Column names for columns).

pd.DataFrame(data)

Index

Data

Exploratory Analysis

.head()

Preview the first 5 rows to get a quick glance at the data structure.

.info()

Get concise summary: column types, non-null counts, and memory usage.

.describe()

Generate descriptive statistics (mean, std, min, quartiles) for numeric columns.

Example Workflow

# 1. Load Data
df = pd.read_csv('data.csv')

# 2. Quick Health Check
df.info()      # Check for missing values & types
df.describe()  # Statistical summary

# 3. Check for Duplicates
df.duplicated().sum()

Selection & Filtering

Label vs. Position

.loc[]
Label-based Select by row/column names. df.loc[row_label, col_label]
.iloc[]
Integer-based Select by index position (0-based). df.iloc[0, 1]
[]
Direct Column Quick column selection. df['col']

# Boolean Filtering
adults = df[df['age'] > 18]

# Multiple Conditions (use & and |)
result = df[(df['age'] > 25) & 
            (df['city'] == 'NYC')]

# The .query() method (cleaner syntax)
df.query("age > 25 and city == 'NYC'")

Data Cleaning

Handling Missing Values

NaN

Real-world data is messy. Pandas represents missing data as NaN (Not a Number).

Detect: df.isnull().sum()
Remove: df.dropna()
Fill: df.fillna(0) or df.fillna(method='ffill')

Before After fillna(0)

10

NaN

30

→

10

0

30

Type Conversion

.astype()

Ensure your data is computable. Strings that look like numbers can't be summed.

# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert to DateTime
df['date'] = pd.to_datetime(df['date'])

Reshaping Data

Data often comes in "Wide" format (human-readable) but needs to be "Long" format (machine-readable) for analysis. Melt unpivots, Pivot reconstructs.

Wide Format

Name	Math	Sci
Alice	90	95
Bob	85	88

Melt

Long Format

Name	Subject	Score
Alice	Math	90
Alice	Sci	95
Bob	Math	85
Bob	Sci	88


                            df.melt(id_vars=['Name'], value_vars=['Math', 'Sci'])

Aggregation

The Split-Apply-Combine Strategy

Grouping allows you to bucket data by category and compute summary statistics (mean, sum, count) for each bucket.

1

Split: Group by column (e.g., Department)

2

Apply: Run function (e.g., mean)

3

Combine: Return result

# Basic Grouping
df.groupby('Department')['Salary'].mean()

# Multiple Aggregations
df.groupby('Team').agg({
    'Salary': ['mean', 'max'],
    'Age': 'min'
})

Merging

Inner

Keep only matches

Left

Keep all left rows

Right

Keep all right rows

Outer

Keep all rows

pd.merge(df1, df2, on='key', how='left')

Equivalent to SQL JOIN operations. pd.concat is used to stack DataFrames vertically or horizontally.