📊 Data Visualization Mastery

Master Matplotlib & Seaborn - From Basics to Advanced

🎯 Why Data Visualization Matters

Data visualization is the bridge between raw data and human understanding

When done right, it helps us:

  • ✨ Reveal patterns, trends, and correlations in the data
  • 💬 Communicate insights clearly to stakeholders
  • ⚡ Speed up decision-making by simplifying complex datasets
  • 📖 Make data storytelling engaging and accessible to all
💡 John Tukey's Wisdom: "The greatest value of a picture is when it forces us to notice what we never expected to see."

Exploratory vs Explanatory Visualizations

Aspect Exploratory Explanatory
Goal Find insights Communicate insights
Audience Analyst / Data Scientist Stakeholders / Public
Style Raw, fast, flexible Polished, focused, clean
Examples Pair plots, correlation heatmaps Bar charts in presentations

🎨 5 Basic Principles of Good Visualizations

  1. Clarity: Avoid clutter. Use labels, legends, and proper axis scales
  2. Context: What is being measured? Over what time frame? In what units?
  3. Focus: Highlight the key insight using colors and annotations
  4. Storytelling: Don't just show data — tell a story. Guide the viewer
  5. Accessibility: Use color palettes that enhance readability for all viewers
Pro Tip: Always ask yourself: "What is the ONE thing I want the viewer to understand from this visual?"

📈 Introduction to Matplotlib

What is matplotlib.pyplot?

matplotlib.pyplot is a module in Matplotlib — it's like a paintbrush for your data.

We usually import it as plt to save typing!

import matplotlib.pyplot as plt # Create a simple plot plt.plot([1, 2, 3], [4, 5, 6]) plt.show() # Display the plot

🎮 Interacting with Plots

When a plot appears, you can:

  • 🔍 Zoom In/Out
  • ✋ Pan around
  • ⬅️ Use arrows to navigate history
  • 🏠 Reset to home view
  • 💾 Save as PNG using the disk icon

📊 Real Example: Cricket Player Runs Over Time

years = [1990, 1992, 1994, 1996, 1998, 2000, 2003, 2005, 2007, 2010] runs = [500, 700, 1100, 1500, 1800, 1200, 1700, 1300, 900, 1500] plt.plot(years, runs) plt.xlabel("Year") plt.ylabel("Runs Scored") plt.title("Sachin Tendulkar's Yearly Runs") plt.show()

🎨 Customization Options

Format Strings

plt.plot(years, runs, 'ro--') # red circles with dashed lines plt.plot(years, runs, 'g^:') # green triangles dotted

Color and Line Styles

plt.plot(years, runs, color='orange', linestyle='--', linewidth=3, label="Player 1") plt.legend() plt.grid(True) plt.tight_layout()

🎭 Plot Styles

# See all available styles print(plt.style.available) # Apply a style plt.style.use("ggplot") plt.style.use("seaborn-v0_8-bright") # XKCD Comic Style! with plt.xkcd(): plt.plot(years, runs) plt.title("Epic Battle!")
💡 Pro Tips:
  • Always start with simple plots
  • Add labels and legends early
  • Use plt.grid() and plt.tight_layout() for readability
  • Try different styles to find what works best

📊 Bar Charts

Bar charts are perfect for comparing quantities across categories. They're easy to read and powerful for visual analysis.

Basic Vertical Bar Chart

years = [1990, 1992, 1994, 1996, 1998, 2000] runs = [500, 700, 1100, 1500, 1800, 1200] plt.bar(years, runs, edgecolor='black') plt.xlabel("Year") plt.ylabel("Runs Scored") plt.title("Yearly Performance") plt.show()

🎯 Side-by-Side Comparison

import numpy as np sachin = [500, 700, 1100, 1500, 1800] kohli = [0, 500, 800, 1100, 1300] sehwag = [0, 200, 900, 1400, 1600] x = np.arange(len(years)) width = 0.25 plt.bar(x - width, sachin, width, label="Sachin") plt.bar(x, sehwag, width, label="Sehwag") plt.bar(x + width, kohli, width, label="Kohli") plt.xticks(x, years) plt.legend() plt.show()
🔍 Why use xticks()?
By default, plt.bar() uses numeric x-values (0, 1, 2, ...). We use plt.xticks() to set the correct category labels like years or names.

↔️ Horizontal Bar Charts

players = ["Sachin", "Sehwag", "Kohli", "Yuvraj"] total_runs = [5600, 4100, 2400, 3700] plt.barh(players, total_runs, color="skyblue") plt.xlabel("Total Runs in First 5 Years") plt.title("Performance Comparison") plt.show()

📝 Adding Value Labels

players = ["Sachin", "Sehwag", "Kohli"] runs = [1500, 1200, 1800] plt.bar(players, runs, color="skyblue") # Add labels on top for i in range(len(players)): plt.text(i, runs[i] + 50, str(runs[i]), ha='center') plt.show()

📋 Quick Reference

Feature Use
plt.bar() Vertical bars for categorical comparison
plt.barh() Horizontal bars (great for long labels)
width= Control thickness/spacing of bars
edgecolor= Add borders to bars
plt.xticks() Replace index numbers with real labels

🥧 Pie Charts

Pie charts show part-to-whole relationships. They're visually appealing but best used with fewer categories (3-6 slices ideal).

Basic Pie Chart

labels = ["Sachin", "Sehwag", "Kohli", "Yuvraj"] runs = [18000, 8000, 12000, 9500] plt.pie(runs, labels=labels, autopct='%1.1f%%') plt.title("Career Runs Distribution") plt.show()

🎨 Customization Options

colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99'] explode = [0.1, 0, 0, 0] # Pull out first slice plt.pie( runs, labels=labels, colors=colors, explode=explode, autopct='%1.1f%%', shadow=True, startangle=140, wedgeprops={'edgecolor': 'black'} ) plt.show()

📊 Key Parameters

Parameter Description
labels Label each slice
colors Customize slice colors
explode Pull out slices for emphasis
autopct Show percentage text ('%1.1f%%')
shadow Add 3D-like depth
startangle Rotate pie chart
⚠️ When to Avoid Pie Charts:
  • Too many categories (>6 slices)
  • When precise comparison is needed
  • When values are similar in size
  • Better alternatives: Bar charts or horizontal bar charts

📚 Stack Plots

Stack plots show how multiple quantities change over time, stacked on top of each other. Perfect for tracking composition over time!

Use Cases

  • ⏱️ Time spent on different activities over days
  • 👥 Distribution of tasks by team members
  • 📈 Website traffic sources over time
  • 💰 Budget allocation across departments

Stack Plot Example

days = [1, 2, 3, 4, 5, 6, 7] studying = [3, 4, 3, 5, 4, 3, 4] playing = [2, 2, 1, 1, 2, 3, 2] watching_tv = [2, 1, 2, 2, 1, 1, 1] sleeping = [5, 5, 6, 5, 6, 5, 5] labels = ['Studying', 'Playing', 'Watching TV', 'Sleeping'] colors = ['skyblue', 'lightgreen', 'gold', 'lightcoral'] plt.stackplot(days, studying, playing, watching_tv, sleeping, labels=labels, colors=colors, alpha=0.8) plt.legend(loc='upper left') plt.title('Weekly Activity Tracker') plt.xlabel('Day') plt.ylabel('Hours') plt.show()
💡 Stack Plot vs Pie Chart:
• Use pie charts for a snapshot in time
• Use stack plots to see how data changes over time

📊 Histograms

Histograms show the distribution of numerical data. They're essential for understanding data spread, detecting outliers, and seeing patterns.

When to Use Histograms

  • 📈 Understand distribution of numerical data (age, salary, test scores)
  • 🔍 Detect skewness and outliers
  • 📐 Check if data is normally distributed
  • 🎯 Analyze frequency within specific ranges

Understanding Bins

The bins argument controls how data is grouped:

  • Integer: Number of equal-width bins
  • List: Custom bin edges for specific ranges
# 10 equal-width bins plt.hist(ages, bins=10, edgecolor='black') # Custom age groups plt.hist(ages, bins=[10, 20, 30, 40, 60, 100], edgecolor='black')

Adding Reference Lines

import numpy as np ages = [22, 25, 47, 52, 46, 56, 55, 60, 34, 43, ...] bins = [10, 20, 30, 40, 50, 60, 70] plt.hist(ages, bins=bins, edgecolor='black') plt.axvline(np.mean(ages), color='red', linestyle='--', linewidth=2, label='Average Age') plt.legend() plt.title('Age Distribution with Mean') plt.show()
📝 Key Parameters:
bins: Number or custom edges
edgecolor: Border color for bars
axvline: Vertical reference line

🎯 Scatter Plots

Scatter plots reveal relationships between two variables. They're perfect for finding correlations, patterns, and outliers.

Basic Scatter Plot

study_hours = [1, 2, 3, 4, 5, 6, 7, 8, 9] exam_scores = [40, 45, 50, 55, 60, 65, 75, 85, 90] plt.scatter(study_hours, exam_scores) plt.title('Study Hours vs Exam Score') plt.xlabel('Study Hours') plt.ylabel('Exam Score') plt.grid(True) plt.show()

🎨 Adding Color & Size

# Size based on score sizes = [score * 2 for score in exam_scores] # Color based on performance colors = ['red' if score < 60 else 'green' for score in exam_scores] plt.scatter(study_hours, exam_scores, s=sizes, c=colors) plt.title('Colored & Sized Scatter Plot') plt.show()

🌈 Using Colormaps

plt.scatter(study_hours, exam_scores, c=exam_scores, cmap='viridis') plt.colorbar(label='Score') plt.title('Scatter with Gradient Colors') plt.show()

📝 Adding Annotations

plt.scatter(study_hours, exam_scores) for i in range(len(study_hours)): plt.annotate(f'Student {i+1}', (study_hours[i], exam_scores[i])) plt.title('Scatter with Labels') plt.show()

👥 Multiple Groups

class_a_hours = [2, 4, 6, 8] class_a_scores = [45, 55, 65, 85] class_b_hours = [1, 3, 5, 7, 9] class_b_scores = [40, 50, 60, 70, 90] plt.scatter(class_a_hours, class_a_scores, label='Class A', color='blue') plt.scatter(class_b_hours, class_b_scores, label='Class B', color='orange') plt.legend() plt.show()

🎛️ Subplots - Multiple Plots in One Figure

Subplots allow you to display multiple plots side-by-side or in a grid. Perfect for comparing datasets or showing different aspects of your data!

Method 1: Using plt.subplot()

x = [1, 2, 3, 4, 5] y1 = [i * 2 for i in x] y2 = [i ** 2 for i in x] # Create 1 row, 2 columns plt.subplot(1, 2, 1) # (rows, cols, plot_number) plt.plot(x, y1) plt.title('Double of x') plt.subplot(1, 2, 2) plt.plot(x, y2) plt.title('Square of x') plt.tight_layout() plt.show()

2×2 Grid of Subplots

y3 = [i ** 0.5 for i in x] y4 = [10 - i for i in x] plt.figure(figsize=(8, 6)) plt.subplot(2, 2, 1) plt.plot(x, y1) plt.title('x * 2') plt.subplot(2, 2, 2) plt.plot(x, y2) plt.title('x squared') plt.subplot(2, 2, 3) plt.plot(x, y3) plt.title('sqrt(x)') plt.subplot(2, 2, 4) plt.plot(x, y4) plt.title('10 - x') plt.tight_layout() plt.show()

Method 2: Using plt.subplots() (Recommended)

This method is cleaner and more flexible. It returns a figure and axes objects.

fig, axs = plt.subplots(1, 2, figsize=(10, 4)) # Plot on first subplot axs[0].plot(x, y1) axs[0].set_title('x * 2') # Plot on second subplot axs[1].plot(x, y2) axs[1].set_title('x squared') fig.suptitle('Comparison Plots', fontsize=14) fig.tight_layout() fig.subplots_adjust(top=0.85) # Prevent title overlap fig.savefig('my_plots.png') # Save as image plt.show()
🎯 Key Differences:
axs - for working on individual plots
fig - for settings that apply to the whole figure

🔄 Looping Over Subplots

fig, axs = plt.subplots(2, 2, figsize=(8, 6)) ys = [y1, y2, y3, y4] titles = ['x * 2', 'x squared', 'sqrt(x)', '10 - x'] for i in range(2): for j in range(2): idx = i * 2 + j axs[i, j].plot(x, ys[idx]) axs[i, j].set_title(titles[idx]) plt.tight_layout() plt.show()

This approach is ideal for:

  • Dynamic or repetitive data series
  • Creating dashboards
  • Comparing multiple datasets efficiently

🎨 Introduction to Seaborn

What is Seaborn?

Seaborn is a Python library built on top of Matplotlib that makes it easier to create beautiful, complex visualizations.

Why Choose Seaborn?

  • Less Code: High-level interface for complex plots
  • 🎨 Better Looking: Automatic styling and themes
  • 📊 DataFrame Ready: Works seamlessly with pandas
  • 🔧 Built-in Features: Statistical plots, color palettes, themes
  • 📦 Built-in Datasets: Practice data included

Setup & Installation

# Install Seaborn !pip install seaborn # Import libraries import seaborn as sns import matplotlib.pyplot as plt

🎭 Seaborn Themes

# Set theme (styles: darkgrid, whitegrid, dark, white, ticks) sns.set_theme(style="darkgrid") import numpy as np x = np.linspace(0, 10, 100) y = np.sin(x) sns.lineplot(x=x, y=y) plt.title('Beautiful Line Plot') plt.show()

📦 Built-in Datasets

Seaborn includes real-world datasets for practice and learning!

# See all available datasets print(sns.get_dataset_names()) # Load a dataset tips = sns.load_dataset('tips') print(tips.head())

Common Datasets:

  • tips - Restaurant bills and tips
  • iris - Iris flower measurements
  • titanic - Titanic passenger data
  • flights - Flight passenger counts
  • penguins - Penguin species data

📊 Basic Plot Types

1. Line Plot

tips = sns.load_dataset('tips') sns.lineplot(x="total_bill", y="tip", data=tips) plt.title('Line Plot Example') plt.show()

2. Scatter Plot with Color

sns.scatterplot(x="total_bill", y="tip", hue="time", data=tips) plt.title('Scatter Plot with Color by Time') plt.show()

3. Bar Plot

sns.barplot(x="day", y="total_bill", data=tips) plt.title('Average Bill per Day') plt.show()

4. Box Plot (Distribution Analysis)

sns.boxplot(x="day", y="total_bill", data=tips) plt.title('Boxplot of Total Bill per Day') plt.show()
📊 Boxplots show:
• Median (middle line)
• Quartiles (box edges)
• Outliers (dots)
• Data spread

5. Heatmap (Correlation Matrix)

flights = sns.load_dataset('flights') pivot_table = flights.pivot("month", "year", "passengers") sns.heatmap(pivot_table, annot=True, fmt="d", cmap="YlGnBu") plt.title('Heatmap of Passengers') plt.show()

🐼 Working with Pandas DataFrames

import pandas as pd df = pd.DataFrame({ "age": [22, 25, 47, 52, 46, 56, 55, 60, 34, 43], "salary": [25000, 27000, 52000, 60000, 58000, 62000, 61000, 65000, 38000, 45000], "gender": ["M", "F", "M", "F", "F", "M", "M", "F", "F", "M"] }) sns.scatterplot(x="age", y="salary", hue="gender", data=df) plt.title('Salary vs Age by Gender') plt.show()

📋 Matplotlib vs Seaborn Comparison

Feature Matplotlib Seaborn
Default Styles Basic Beautiful ✨
Syntax Level Low-level High-level
DataFrame Support Manual Native & Easy
Complex Plots Tedious Very Easy
Statistical Plots Manual calculation Built-in
Customization Full control Smart defaults + customizable
💡 Best Practice:
• Start with Seaborn for quick, beautiful plots
• Use Matplotlib for fine-tuning and customization
• Combine both for maximum power!

🎯 Final Recommendations

  • 📚 Learn Matplotlib basics - Foundation for customization
  • 🚀 Use Seaborn daily - Faster development, prettier results
  • 🔧 Combine both - Best of both worlds
  • 📊 Practice with real data - Use built-in datasets
  • 🎨 Experiment with styles - Find what works for you
🎓 Key Takeaway:
In real-world Data Science projects, Seaborn saves hours of manual work by offering higher-level, smarter defaults. Start with Seaborn, customize with Matplotlib!