Data Visualization Learning Platform

🎯 Why Data Visualization Matters

Data visualization is the bridge between raw data and human understanding

When done right, it helps us:

✨ Reveal patterns, trends, and correlations in the data
💬 Communicate insights clearly to stakeholders
⚡ Speed up decision-making by simplifying complex datasets
📖 Make data storytelling engaging and accessible to all

💡 John Tukey's Wisdom: "The greatest value of a picture is when it forces us to notice what we never expected to see."

Exploratory vs Explanatory Visualizations

Aspect	Exploratory	Explanatory
Goal	Find insights	Communicate insights
Audience	Analyst / Data Scientist	Stakeholders / Public
Style	Raw, fast, flexible	Polished, focused, clean
Examples	Pair plots, correlation heatmaps	Bar charts in presentations

🎨 5 Basic Principles of Good Visualizations

Clarity: Avoid clutter. Use labels, legends, and proper axis scales
Context: What is being measured? Over what time frame? In what units?
Focus: Highlight the key insight using colors and annotations
Storytelling: Don't just show data — tell a story. Guide the viewer
Accessibility: Use color palettes that enhance readability for all viewers

Pro Tip: Always ask yourself: "What is the ONE thing I want the viewer to understand from this visual?"

📈 Introduction to Matplotlib

What is matplotlib.pyplot?

matplotlib.pyplot is a module in Matplotlib — it's like a paintbrush for your data.

We usually import it as plt to save typing!

                    import matplotlib.pyplot as plt

                    # Create a simple plot
                    plt.plot([1, 2, 3], [4, 5, 6])
                    plt.show() # Display the plot
                

🎮 Interacting with Plots

When a plot appears, you can:

🔍 Zoom In/Out
✋ Pan around
⬅️ Use arrows to navigate history
🏠 Reset to home view
💾 Save as PNG using the disk icon

📊 Real Example: Cricket Player Runs Over Time

                    years = [1990, 1992, 1994, 1996, 1998, 2000, 2003, 2005, 2007, 2010]
                    runs = [500, 700, 1100, 1500, 1800, 1200, 1700, 1300, 900, 1500]

                    plt.plot(years, runs)
                    plt.xlabel("Year")
                    plt.ylabel("Runs Scored")
                    plt.title("Sachin Tendulkar's Yearly
                        Runs")
                    plt.show()
                

🎨 Customization Options

Format Strings

                        plt.plot(years, runs, 'ro--') # red circles with dashed lines
                        plt.plot(years, runs, 'g^:') # green triangles dotted
                    

Color and Line Styles

                        plt.plot(years, runs,
                        color='orange',
                        linestyle='--',
                        linewidth=3,
                        label="Player 1")
                        plt.legend()
                        plt.grid(True)
                        plt.tight_layout()
                    

🎭 Plot Styles

                    # See all available styles
                    print(plt.style.available)

                    # Apply a style
                    plt.style.use("ggplot")
                    plt.style.use("seaborn-v0_8-bright")

                    # XKCD Comic Style!
                    with plt.xkcd():
                    plt.plot(years, runs)
                    plt.title("Epic Battle!")
                

💡 Pro Tips:

Always start with simple plots
Add labels and legends early
Use plt.grid() and plt.tight_layout() for readability
Try different styles to find what works best

📊 Bar Charts

Bar charts are perfect for comparing quantities across categories. They're easy to read and powerful for visual analysis.

Basic Vertical Bar Chart

                    years = [1990, 1992, 1994, 1996, 1998, 2000]
                    runs = [500, 700, 1100, 1500, 1800, 1200]

                    plt.bar(years, runs, edgecolor='black')
                    plt.xlabel("Year")
                    plt.ylabel("Runs Scored")
                    plt.title("Yearly Performance")
                    plt.show()
                

🎯 Side-by-Side Comparison

                    import numpy as np

                    sachin = [500, 700, 1100, 1500, 1800]
                    kohli = [0, 500, 800, 1100, 1300]
                    sehwag = [0, 200, 900, 1400, 1600]

                    x = np.arange(len(years))
                    width = 0.25

                    plt.bar(x - width, sachin, width, label="Sachin")
                    plt.bar(x, sehwag, width, label="Sehwag")
                    plt.bar(x + width, kohli, width, label="Kohli")

                    plt.xticks(x, years)
                    plt.legend()
                    plt.show()
                

🔍 Why use xticks()?
By default, plt.bar() uses numeric x-values (0, 1, 2, ...). We use plt.xticks() to set the correct category labels like years or names.

↔️ Horizontal Bar Charts

                    players = ["Sachin", "Sehwag", "Kohli", "Yuvraj"]
                    total_runs = [5600, 4100, 2400, 3700]

                    plt.barh(players, total_runs, color="skyblue")
                    plt.xlabel("Total Runs in First 5 Years")
                    plt.title("Performance Comparison")
                    plt.show()
                

📝 Adding Value Labels

                    players = ["Sachin", "Sehwag", "Kohli"]
                    runs = [1500, 1200, 1800]

                    plt.bar(players, runs, color="skyblue")

                    # Add labels on top
                    for i in range(len(players)):
                    plt.text(i, runs[i] + 50, str(runs[i]), ha='center')

                    plt.show()
                

📋 Quick Reference

Feature	Use
`plt.bar()`	Vertical bars for categorical comparison
`plt.barh()`	Horizontal bars (great for long labels)
`width=`	Control thickness/spacing of bars
`edgecolor=`	Add borders to bars
`plt.xticks()`	Replace index numbers with real labels

🥧 Pie Charts

Pie charts show part-to-whole relationships. They're visually appealing but best used with fewer categories (3-6 slices ideal).

Basic Pie Chart

                    labels = ["Sachin", "Sehwag", "Kohli", "Yuvraj"]
                    runs = [18000, 8000, 12000, 9500]

                    plt.pie(runs, labels=labels, autopct='%1.1f%%')
                    plt.title("Career Runs Distribution")
                    plt.show()
                

🎨 Customization Options

                    colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
                    explode = [0.1, 0, 0, 0] # Pull out first
                        slice

                    plt.pie(
                    runs,
                    labels=labels,
                    colors=colors,
                    explode=explode,
                    autopct='%1.1f%%',
                    shadow=True,
                    startangle=140,
                    wedgeprops={'edgecolor': 'black'}
                    )
                    plt.show()
                

📊 Key Parameters

Parameter	Description
`labels`	Label each slice
`colors`	Customize slice colors
`explode`	Pull out slices for emphasis
`autopct`	Show percentage text ('%1.1f%%')
`shadow`	Add 3D-like depth
`startangle`	Rotate pie chart

⚠️ When to Avoid Pie Charts:

Too many categories (>6 slices)
When precise comparison is needed
When values are similar in size
Better alternatives: Bar charts or horizontal bar charts

📚 Stack Plots

Stack plots show how multiple quantities change over time, stacked on top of each other. Perfect for tracking composition over time!

Use Cases

⏱️ Time spent on different activities over days
👥 Distribution of tasks by team members
📈 Website traffic sources over time
💰 Budget allocation across departments

Stack Plot Example

                    days = [1, 2, 3, 4, 5, 6, 7]
                    studying = [3, 4, 3, 5, 4, 3, 4]
                    playing = [2, 2, 1, 1, 2, 3, 2]
                    watching_tv = [2, 1, 2, 2, 1, 1, 1]
                    sleeping = [5, 5, 6, 5, 6, 5, 5]

                    labels = ['Studying', 'Playing', 'Watching TV', 'Sleeping']
                    colors = ['skyblue', 'lightgreen', 'gold', 'lightcoral']

                    plt.stackplot(days, studying, playing, watching_tv, sleeping,
                    labels=labels, colors=colors, alpha=0.8)
                    plt.legend(loc='upper left')
                    plt.title('Weekly Activity Tracker')
                    plt.xlabel('Day')
                    plt.ylabel('Hours')
                    plt.show()
                

💡 Stack Plot vs Pie Chart:
• Use pie charts for a snapshot in time
• Use stack plots to see how data changes over time

📊 Histograms

Histograms show the distribution of numerical data. They're essential for understanding data spread, detecting outliers, and seeing patterns.

When to Use Histograms

📈 Understand distribution of numerical data (age, salary, test scores)
🔍 Detect skewness and outliers
📐 Check if data is normally distributed
🎯 Analyze frequency within specific ranges

Understanding Bins

The bins argument controls how data is grouped:

Integer: Number of equal-width bins
List: Custom bin edges for specific ranges

                    # 10 equal-width bins
                    plt.hist(ages, bins=10, edgecolor='black')

                    # Custom age groups
                    plt.hist(ages, bins=[10, 20, 30, 40, 60,
                        100], edgecolor='black')
                

Adding Reference Lines

                    import numpy as np

                    ages = [22, 25, 47, 52, 46, 56, 55, 60, 34, 43, ...]
                    bins = [10, 20, 30, 40, 50, 60, 70]

                    plt.hist(ages, bins=bins, edgecolor='black')
                    plt.axvline(np.mean(ages), color='red',
                    linestyle='--', linewidth=2,
                    label='Average Age')
                    plt.legend()
                    plt.title('Age Distribution with Mean')
                    plt.show()
                

📝 Key Parameters:
• bins: Number or custom edges
• edgecolor: Border color for bars
• axvline: Vertical reference line

🎯 Scatter Plots

Scatter plots reveal relationships between two variables. They're perfect for finding correlations, patterns, and outliers.

Basic Scatter Plot

                    study_hours = [1, 2, 3, 4, 5, 6, 7, 8, 9]
                    exam_scores = [40, 45, 50, 55, 60, 65, 75, 85, 90]

                    plt.scatter(study_hours, exam_scores)
                    plt.title('Study Hours vs Exam Score')
                    plt.xlabel('Study Hours')
                    plt.ylabel('Exam Score')
                    plt.grid(True)
                    plt.show()
                

🎨 Adding Color & Size

                    # Size based on score
                    sizes = [score * 2 for score in exam_scores]

                    # Color based on performance
                    colors = ['red' if score < 60 else 'green'
                        for score in exam_scores]

                        plt.scatter(study_hours, exam_scores, s=sizes, c=colors)
                        plt.title('Colored & Sized Scatter
                            Plot')
                        plt.show()
                

🌈 Using Colormaps

                    plt.scatter(study_hours, exam_scores,
                    c=exam_scores, cmap='viridis')
                    plt.colorbar(label='Score')
                    plt.title('Scatter with Gradient Colors')
                    plt.show()
                

📝 Adding Annotations

                    plt.scatter(study_hours, exam_scores)

                    for i in range(len(study_hours)):
                    plt.annotate(f'Student {i+1}',
                    (study_hours[i], exam_scores[i]))

                    plt.title('Scatter with Labels')
                    plt.show()
                

👥 Multiple Groups

                    class_a_hours = [2, 4, 6, 8]
                    class_a_scores = [45, 55, 65, 85]
                    class_b_hours = [1, 3, 5, 7, 9]
                    class_b_scores = [40, 50, 60, 70, 90]

                    plt.scatter(class_a_hours, class_a_scores,
                    label='Class A', color='blue')
                    plt.scatter(class_b_hours, class_b_scores,
                    label='Class B', color='orange')
                    plt.legend()
                    plt.show()
                

🎛️ Subplots - Multiple Plots in One Figure

Subplots allow you to display multiple plots side-by-side or in a grid. Perfect for comparing datasets or showing different aspects of your data!

Method 1: Using plt.subplot()

                    x = [1, 2, 3, 4, 5]
                    y1 = [i * 2 for i in x]
                    y2 = [i ** 2 for i in x]

                    # Create 1 row, 2 columns
                    plt.subplot(1, 2, 1) # (rows, cols, plot_number)
                    plt.plot(x, y1)
                    plt.title('Double of x')

                    plt.subplot(1, 2, 2)
                    plt.plot(x, y2)
                    plt.title('Square of x')

                    plt.tight_layout()
                    plt.show()
                

2×2 Grid of Subplots

                    y3 = [i ** 0.5 for i in x]
                    y4 = [10 - i for i in x]

                    plt.figure(figsize=(8, 6))

                    plt.subplot(2, 2, 1)
                    plt.plot(x, y1)
                    plt.title('x * 2')

                    plt.subplot(2, 2, 2)
                    plt.plot(x, y2)
                    plt.title('x squared')

                    plt.subplot(2, 2, 3)
                    plt.plot(x, y3)
                    plt.title('sqrt(x)')

                    plt.subplot(2, 2, 4)
                    plt.plot(x, y4)
                    plt.title('10 - x')

                    plt.tight_layout()
                    plt.show()
                

Method 2: Using plt.subplots() (Recommended)

This method is cleaner and more flexible. It returns a figure and axes objects.

                    fig, axs = plt.subplots(1, 2,
                    figsize=(10, 4))

                    # Plot on first subplot
                    axs[0].plot(x, y1)
                    axs[0].set_title('x *
                        2')

                    # Plot on second subplot
                    axs[1].plot(x, y2)
                    axs[1].set_title('x
                        squared')

                    fig.suptitle('Comparison Plots',
                    fontsize=14)
                    fig.tight_layout()
                    fig.subplots_adjust(top=0.85) # Prevent title overlap
                    fig.savefig('my_plots.png') # Save as image
                    plt.show()
                

🎯 Key Differences:
• axs - for working on individual plots
• fig - for settings that apply to the whole figure

🔄 Looping Over Subplots

                    fig, axs = plt.subplots(2, 2,
                    figsize=(8, 6))

                    ys = [y1, y2, y3, y4]
                    titles = ['x * 2', 'x squared', 'sqrt(x)', '10 - x']

                    for i in range(2):
                    for j in range(2):
                    idx = i * 2 + j
                    axs[i, j].plot(x, ys[idx])
                    axs[i, j].set_title(titles[idx])

                    plt.tight_layout()
                    plt.show()
                

This approach is ideal for:

Dynamic or repetitive data series
Creating dashboards
Comparing multiple datasets efficiently

🎨 Introduction to Seaborn

What is Seaborn?

Seaborn is a Python library built on top of Matplotlib that makes it easier to create beautiful, complex visualizations.

Why Choose Seaborn?

✨ Less Code: High-level interface for complex plots
🎨 Better Looking: Automatic styling and themes
📊 DataFrame Ready: Works seamlessly with pandas
🔧 Built-in Features: Statistical plots, color palettes, themes
📦 Built-in Datasets: Practice data included

Setup & Installation

                    # Install Seaborn
                    !pip install seaborn

                    # Import libraries
                    import seaborn as sns
                    import matplotlib.pyplot as plt
                

🎭 Seaborn Themes

                    # Set theme (styles: darkgrid, whitegrid, dark, white, ticks)
                    sns.set_theme(style="darkgrid")

                    import numpy as np
                    x = np.linspace(0, 10, 100)
                    y = np.sin(x)

                    sns.lineplot(x=x, y=y)
                    plt.title('Beautiful Line Plot')
                    plt.show()
                

📦 Built-in Datasets

Seaborn includes real-world datasets for practice and learning!

                    # See all available datasets
                    print(sns.get_dataset_names())

                    # Load a dataset
                    tips = sns.load_dataset('tips')
                    print(tips.head())
                

Common Datasets:

tips - Restaurant bills and tips
iris - Iris flower measurements
titanic - Titanic passenger data
flights - Flight passenger counts
penguins - Penguin species data

📊 Basic Plot Types

1. Line Plot

                    tips = sns.load_dataset('tips')
                    sns.lineplot(x="total_bill", y="tip", data=tips)
                    plt.title('Line Plot Example')
                    plt.show()
                

2. Scatter Plot with Color

                    sns.scatterplot(x="total_bill", y="tip",
                    hue="time", data=tips)
                    plt.title('Scatter Plot with Color by
                        Time')
                    plt.show()
                

3. Bar Plot

                    sns.barplot(x="day", y="total_bill", data=tips)
                    plt.title('Average Bill per Day')
                    plt.show()
                

4. Box Plot (Distribution Analysis)

                    sns.boxplot(x="day", y="total_bill", data=tips)
                    plt.title('Boxplot of Total Bill per Day')
                    plt.show()
                

📊 Boxplots show:
• Median (middle line)
• Quartiles (box edges)
• Outliers (dots)
• Data spread

5. Heatmap (Correlation Matrix)

                    flights = sns.load_dataset('flights')
                    pivot_table = flights.pivot("month", "year", "passengers")

                    sns.heatmap(pivot_table, annot=True,
                    fmt="d", cmap="YlGnBu")
                    plt.title('Heatmap of Passengers')
                    plt.show()
                

🐼 Working with Pandas DataFrames

                    import pandas as pd

                    df = pd.DataFrame({
                    "age": [22, 25, 47, 52, 46, 56, 55, 60, 34,
                        43],
                    "salary": [25000, 27000, 52000, 60000, 58000,
                        62000, 61000, 65000, 38000, 45000],
                    "gender": ["M", "F", "M", "F", "F",
                        "M", "M", "F", "F", "M"]
                    })

                    sns.scatterplot(x="age", y="salary", hue="gender", data=df)
                    plt.title('Salary vs Age by Gender')
                    plt.show()
                

📋 Matplotlib vs Seaborn Comparison

Feature	Matplotlib	Seaborn
Default Styles	Basic	Beautiful ✨
Syntax Level	Low-level	High-level
DataFrame Support	Manual	Native & Easy
Complex Plots	Tedious	Very Easy
Statistical Plots	Manual calculation	Built-in
Customization	Full control	Smart defaults + customizable

💡 Best Practice:
• Start with Seaborn for quick, beautiful plots
• Use Matplotlib for fine-tuning and customization
• Combine both for maximum power!

🎯 Final Recommendations

📚 Learn Matplotlib basics - Foundation for customization
🚀 Use Seaborn daily - Faster development, prettier results
🔧 Combine both - Best of both worlds
📊 Practice with real data - Use built-in datasets
🎨 Experiment with styles - Find what works for you

🎓 Key Takeaway:
In real-world Data Science projects, Seaborn saves hours of manual work by offering higher-level, smarter defaults. Start with Seaborn, customize with Matplotlib!