Python Data Analysis

Master Data
Manipulation

A visual, interactive guide to Pandas. From core structures to complex transformations, simplified for modern data workflows.

Official Docs ↗

Getting Started

Pandas bridges the gap between raw data and actionable insights. It provides two primary tools: Series (1D) and DataFrames (2D).

Series

A single column. Think of it as a labeled list or an array with an index.

[10, 20, 30, 40]
Index: [0, 1, 2, 3]

DataFrame

A table. A dictionary of Series aligned together. The standard for data science.

Columns: Name, Age, City
Rows: Indexed observations

$ Installation

# Using pip

pip install pandas

# Standard Import

import pandas as pd

Core Structures

Visualizing a DataFrame

A DataFrame is constructed from rows and columns. Unlike a simple array, it has labeled axes (Index for rows, Column names for columns).

pd.DataFrame(data)
Index
Data

Exploratory Analysis

.head()

Preview the first 5 rows to get a quick glance at the data structure.

.info()

Get concise summary: column types, non-null counts, and memory usage.

.describe()

Generate descriptive statistics (mean, std, min, quartiles) for numeric columns.

Example Workflow

# 1. Load Data
df = pd.read_csv('data.csv')

# 2. Quick Health Check
df.info()      # Check for missing values & types
df.describe()  # Statistical summary

# 3. Check for Duplicates
df.duplicated().sum()

Selection & Filtering

Label vs. Position

  • .loc[]
    Label-based Select by row/column names. df.loc[row_label, col_label]
  • .iloc[]
    Integer-based Select by index position (0-based). df.iloc[0, 1]
  • []
    Direct Column Quick column selection. df['col']
# Boolean Filtering
adults = df[df['age'] > 18]

# Multiple Conditions (use & and |)
result = df[(df['age'] > 25) & 
            (df['city'] == 'NYC')]

# The .query() method (cleaner syntax)
df.query("age > 25 and city == 'NYC'")

Data Cleaning

Handling Missing Values

NaN

Real-world data is messy. Pandas represents missing data as NaN (Not a Number).

  • Detect: df.isnull().sum()
  • Remove: df.dropna()
  • Fill: df.fillna(0) or df.fillna(method='ffill')
Before After fillna(0)
10
NaN
30
10
0
30

Type Conversion

.astype()

Ensure your data is computable. Strings that look like numbers can't be summed.

# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert to DateTime
df['date'] = pd.to_datetime(df['date'])

Reshaping Data

Data often comes in "Wide" format (human-readable) but needs to be "Long" format (machine-readable) for analysis. Melt unpivots, Pivot reconstructs.

Wide Format

Name Math Sci
Alice 90 95
Bob 85 88
Melt

Long Format

Name Subject Score
Alice Math 90
Alice Sci 95
Bob Math 85
Bob Sci 88
df.melt(id_vars=['Name'], value_vars=['Math', 'Sci'])

Aggregation

The Split-Apply-Combine Strategy

Grouping allows you to bucket data by category and compute summary statistics (mean, sum, count) for each bucket.

1
Split: Group by column (e.g., Department)
2
Apply: Run function (e.g., mean)
3
Combine: Return result
# Basic Grouping
df.groupby('Department')['Salary'].mean()

# Multiple Aggregations
df.groupby('Team').agg({
    'Salary': ['mean', 'max'],
    'Age': 'min'
})

Merging

Inner
Keep only matches
Left
Keep all left rows
Right
Keep all right rows
Outer
Keep all rows
pd.merge(df1, df2, on='key', how='left')

Equivalent to SQL JOIN operations. pd.concat is used to stack DataFrames vertically or horizontally.