Master Data
Manipulation
A visual, interactive guide to Pandas. From core structures to complex transformations, simplified for modern data workflows.
Getting Started
Pandas bridges the gap between raw data and actionable insights. It provides two primary tools: Series (1D) and DataFrames (2D).
Series
A single column. Think of it as a labeled list or an array with an index.
Index: [0, 1, 2, 3]
DataFrame
A table. A dictionary of Series aligned together. The standard for data science.
Rows: Indexed observations
$ Installation
# Using pip
pip install pandas
# Standard Import
import pandas as pd
Core Structures
Visualizing a DataFrame
A DataFrame is constructed from rows and columns. Unlike a simple array, it has labeled axes (Index for rows, Column names for columns).
Exploratory Analysis
.head()
Preview the first 5 rows to get a quick glance at the data structure.
.info()
Get concise summary: column types, non-null counts, and memory usage.
.describe()
Generate descriptive statistics (mean, std, min, quartiles) for numeric columns.
Example Workflow
# 1. Load Data
df = pd.read_csv('data.csv')
# 2. Quick Health Check
df.info() # Check for missing values & types
df.describe() # Statistical summary
# 3. Check for Duplicates
df.duplicated().sum()
Selection & Filtering
Label vs. Position
-
.loc[]
Label-based Select by row/column names.
df.loc[row_label, col_label] -
.iloc[]
Integer-based Select by index position (0-based).
df.iloc[0, 1] -
[]
Direct Column Quick column selection.
df['col']
# Boolean Filtering
adults = df[df['age'] > 18]
# Multiple Conditions (use & and |)
result = df[(df['age'] > 25) &
(df['city'] == 'NYC')]
# The .query() method (cleaner syntax)
df.query("age > 25 and city == 'NYC'")
Data Cleaning
Handling Missing Values
NaNReal-world data is messy. Pandas represents missing data as
NaN (Not a Number).
- Detect:
df.isnull().sum() - Remove:
df.dropna() - Fill:
df.fillna(0)ordf.fillna(method='ffill')
Type Conversion
.astype()Ensure your data is computable. Strings that look like numbers can't be summed.
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Convert to DateTime
df['date'] = pd.to_datetime(df['date'])
Reshaping Data
Data often comes in "Wide" format (human-readable) but needs to be "Long" format (machine-readable) for analysis. Melt unpivots, Pivot reconstructs.
Wide Format
| Name | Math | Sci |
|---|---|---|
| Alice | 90 | 95 |
| Bob | 85 | 88 |
Long Format
| Name | Subject | Score |
|---|---|---|
| Alice | Math | 90 |
| Alice | Sci | 95 |
| Bob | Math | 85 |
| Bob | Sci | 88 |
df.melt(id_vars=['Name'], value_vars=['Math', 'Sci'])
Aggregation
The Split-Apply-Combine Strategy
Grouping allows you to bucket data by category and compute summary statistics (mean, sum, count) for each bucket.
# Basic Grouping
df.groupby('Department')['Salary'].mean()
# Multiple Aggregations
df.groupby('Team').agg({
'Salary': ['mean', 'max'],
'Age': 'min'
})
Merging
pd.merge(df1, df2, on='key', how='left')
Equivalent to SQL JOIN operations. pd.concat is used to stack DataFrames vertically
or horizontally.