CodeBook Data Science Internship

Welcome to Your Internship! 🎉

Your Mission

Congratulations! You've been hired as a Data Scientist Intern at CodeBook – The Social Media for Coders. This Delhi-based company is offering you a ₹10 LPA job if you successfully complete this 1-month internship.

The Challenge

Your manager Puneet Kumar has assigned you to analyze CodeBook's user data using only pure Python – no pandas, NumPy, or fancy libraries allowed!

What You'll Build

Load & Display Data

Clean & Structure

People You May Know

Pages You Might Like

Understanding the Data Structure

The dataset contains three main components:

Users: Each user has an ID, name, a list of friends (by their IDs), and a list of liked pages (by their IDs)
Pages: Each page has an ID and a name
Connections: Users can have multiple friends and can like multiple pages

// Sample Data Structure (JSON format)
{
  "users": [
    {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
    {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
    {"id": 3, "name": "Rahul", "friends": [1], "liked_pages": [101, 103]},
    {"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]}
  ],
  "pages": [
    {"id": 101, "name": "Python Developers"},
    {"id": 102, "name": "Data Science Enthusiasts"},
    {"id": 103, "name": "AI & ML Community"},
    {"id": 104, "name": "Web Dev Hub"}
  ]
}
                

Ready to Begin? 🎯

Navigate through the tabs above to complete each task and earn your ₹10 LPA offer letter!

Task 1: Load the User Data 📂

Your Assignment

Your manager has given you a dataset containing information about CodeBook users, their connections (friends), and the pages they have liked. Your job is to load and explore this data to understand its structure.

Steps to Complete

Save the JSON data in a file called codebook_data.json
Read the JSON file using Python's built-in modules
Print user details and their connections
Print available pages

Code Implementation

import json

# Load the JSON file
def load_data(filename):
    with open(filename, "r") as file:
        data = json.load(file)
        return data

# Display users and their connections
def display_users(data):
    print("Users and Their Connections:\n")
    for user in data["users"]:
        print(f"{user['name']} (ID: {user['id']}) - Friends: {user['friends']} - Liked Pages: {user['liked_pages']}")
    
    print("\nPages:\n")
    for page in data["pages"]:
        print(f"{page['id']}: {page['name']}")

# Load and display the data
data = load_data("codebook_data.json")
display_users(data)
                

Expected Output

Console Output:

Users and Their Connections:

Amit (ID: 1) - Friends: [2, 3] - Liked Pages: [101]
Priya (ID: 2) - Friends: [1, 4] - Liked Pages: [102]
Rahul (ID: 3) - Friends: [1] - Liked Pages: [101, 103]
Sara (ID: 4) - Friends: [2] - Liked Pages: [104]

Pages:

101: Python Developers
102: Data Science Enthusiasts
103: AI & ML Community
104: Web Dev Hub

Manager's Feedback: "The data looks messy. Can you clean and structure it better?"

Task Completed! ✅

Move to the next tab to learn how to clean the data.

Task 2: Cleaning and Structuring the Data 🧹

The Problem

Your manager is impressed with your progress but points out that the data is messy. Before analyzing it effectively, we need to clean and structure the data properly.

Issues to Address

Missing Values: Some users have empty names
Duplicate Data: Duplicate friend entries in lists
Inactive Users: Users with no connections or liked pages
Inconsistent Data: Duplicate page IDs with different names

Example of Messy Data

{
  "users": [
    {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
    {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
    {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]}, // Empty name!
    {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]}, // Duplicate friend!
    {"id": 5, "name": "Amit", "friends": [], "liked_pages": []} // Inactive user!
  ],
  "pages": [
    {"id": 104, "name": "Web Dev Hub"},
    {"id": 104, "name": "Web Development"} // Duplicate page ID!
  ]
}
                

Cleaning Strategy

Remove users with missing names
Remove duplicate friend entries
Remove inactive users (no friends and no liked pages)
Deduplicate pages based on IDs

Code Implementation

import json

def clean_data(data):
    # 1. Remove users with missing names
    data["users"] = [user for user in data["users"] if user["name"].strip()]
    
    # 2. Remove duplicate friends
    for user in data["users"]:
        user["friends"] = list(set(user["friends"]))
    
    # 3. Remove inactive users
    data["users"] = [user for user in data["users"] 
                      if user["friends"] or user["liked_pages"]]
    
    # 4. Remove duplicate pages
    unique_pages = {}
    for page in data["pages"]:
        unique_pages[page["id"]] = page
    data["pages"] = list(unique_pages.values())
    
    return data

# Load, clean, and save the cleaned data
data = json.load(open("codebook_data.json"))
data = clean_data(data)
json.dump(data, open("cleaned_codebook_data.json", "w"), indent=4)
print("Data cleaned successfully!")
                

Cleaned Data Output

After Cleaning:

{
  "users": [
    {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
    {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
    {"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]}
  ],
  "pages": [
    {"id": 101, "name": "Python Developers"},
    {"id": 102, "name": "Data Science Enthusiasts"},
    {"id": 103, "name": "AI & ML Community"},
    {"id": 104, "name": "Web Development"}
  ]
}

What Changed?

✅ User with empty name (ID: 3) removed
✅ Duplicate friend in Sara's list removed (2, 2 → 2)
✅ Inactive user (ID: 5) removed
✅ Duplicate page (ID: 104) deduplicated

Data Cleaned Successfully! ✅

Your manager says: "Great! Now let's build a 'People You May Know' feature!"

Task 3: Finding "People You May Know" 👥

The Feature

Build a recommendation system that suggests potential friends based on mutual connections. This is a core feature of social networks that helps users expand their network!

Understanding the Logic

How It Works:

If User A and User B are not friends but have mutual friends, suggest User B to User A
More mutual friends = higher priority recommendation

Real Example

Scenario:

Amit (ID: 1) is friends with Priya (ID: 2) and Rahul (ID: 3)
Priya (ID: 2) is friends with Sara (ID: 4)
Amit is not directly friends with Sara, but they share Priya as a mutual friend
Recommendation: Suggest Sara to Amit as "People You May Know"

Algorithm Steps

Find all direct friends of the target user
For each friend, find their friends (friends of friends)
Count how many mutual connections exist with each potential friend
Rank suggestions by number of mutual friends (highest first)

Code Implementation

import json

def load_data(filename):
    with open(filename, "r") as file:
        return json.load(file)

def find_people_you_may_know(user_id, data):
    # Create a mapping of user_id to their friends (as a set)
    user_friends = {}
    for user in data["users"]:
        user_friends[user["id"]] = set(user["friends"])
    
    # If user doesn't exist, return empty list
    if user_id not in user_friends:
        return []
    
    # Get direct friends of the user
    direct_friends = user_friends[user_id]
    suggestions = {}
    
    # For each direct friend
    for friend in direct_friends:
        # Look at all friends of this friend
        for mutual in user_friends[friend]:
            # If mutual is not the user themselves and not already a friend
            if mutual != user_id and mutual not in direct_friends:
                # Count mutual friends (each shared connection adds 1)
                suggestions[mutual] = suggestions.get(mutual, 0) + 1
    
    # Sort by number of mutual friends (descending)
    sorted_suggestions = sorted(suggestions.items(), 
                                    key=lambda x: x[1], 
                                    reverse=True)
    
    # Return just the user IDs
    return [user_id for user_id, _ in sorted_suggestions]

# Load data and find recommendations
data = load_data("cleaned_codebook_data.json")
user_id = 1  # Example: Finding suggestions for Amit
recommendations = find_people_you_may_know(user_id, data)
print(f"People You May Know for User {user_id}: {recommendations}")
                

Expected Output

Console Output:

People You May Know for User 1: [4]

Explanation: Amit (User 1) should connect with Sara (User 4) because they share Priya as a mutual friend!

How the Algorithm Works:

For Amit (ID: 1):

Direct friends: [2, 3] (Priya and Rahul)
Priya's friends: [1, 4] → Sara (4) is a potential connection
Rahul's friends: [1] → No new suggestions
Result: Recommend Sara (User 4) to Amit

Real-World Application: The frontend developer of CodeBook can use this data via API and display it on Amit's profile when he logs in!

Ranking by Mutual Friends

When there are multiple suggestions:

If User A and User B share 5 mutual friends, while User A and User C share only 2 mutual friends, then User B gets ranked higher in the recommendation list.

More mutual connections = Stronger recommendation!

Feature Completed! 🎉

Your manager is excited and says: "Great job! Next, find 'Pages You Might Like' based on connections and preferences."

Task 4: Finding "Pages You Might Like" 📄

Final Milestone!

You've reached the final milestone of your first data science project at CodeBook! After cleaning data and building "People You May Know", it's time to launch: Pages You Might Like

Why This Matters

Content Discovery

In real-world social networks, content discovery keeps users engaged. This feature simulates that experience using pure Python, showing how even simple logic can power impactful insights!

Understanding the Logic

How "Pages You Might Like" Works:

Users engage with pages (like, comment, share, etc.)
If two users have interacted with similar pages, they likely have common interests
For this implementation, we consider "liking a page" as an interaction
Pages followed by similar users should be recommended

Real Example

Scenario:

Amit (ID: 1) likes: Python Hub (101) and AI World (102)
Priya (ID: 2) likes: AI World (102) and Data Science Daily (103)
Since Amit and Priya both like AI World (102), we suggest:

Data Science Daily (103) to Amit
Python Hub (101) to Priya

Collaborative Filtering Concept

The Core Principle:

"If two people like the same thing, maybe they'll like other things each one likes too."

This is a basic form of a real-world recommendation engine used by platforms like Facebook, LinkedIn, and Netflix!

Algorithm Steps

Map users to pages they have interacted with
Identify pages liked by users with similar interests
Calculate similarity score (number of shared pages)
Rank recommendations based on common interactions

Code Implementation

import json

# Function to load JSON data from a file
def load_data(filename):
    with open(filename, "r") as file:
        return json.load(file)

# Function to find pages a user might like based on common interests
def find_pages_you_might_like(user_id, data):
    # Dictionary to store user interactions with pages
    user_pages = {}
    for user in data["users"]:
        user_pages[user["id"]] = set(user["liked_pages"])
    
    # If the user is not found, return an empty list
    if user_id not in user_pages:
        return []
    
    # Get pages liked by the target user
    user_liked_pages = user_pages[user_id]
    page_suggestions = {}
    
    # Iterate through all other users
    for other_user, pages in user_pages.items():
        if other_user != user_id:
            # Find shared pages between users
            shared_pages = user_liked_pages.intersection(pages)
            
            # For each page the other user likes
            for page in pages:
                # If target user hasn't liked this page yet
                if page not in user_liked_pages:
                    # Add score based on number of shared pages
                    page_suggestions[page] = page_suggestions.get(page, 0) + len(shared_pages)
    
    # Sort recommended pages based on the number of shared interactions
    sorted_pages = sorted(page_suggestions.items(), 
                          key=lambda x: x[1], 
                          reverse=True)
    
    # Return just the page IDs
    return [page_id for page_id, _ in sorted_pages]

# Load data and find recommendations
data = load_data("cleaned_codebook_data.json")
user_id = 1  # Example: Finding recommendations for Amit
page_recommendations = find_pages_you_might_like(user_id, data)
print(f"Pages You Might Like for User {user_id}: {page_recommendations}")
                

Expected Output

Console Output:

Pages You Might Like for User 1: [103]

Explanation: Amit should be interested in 'AI & ML Community' (Page 103) because users with similar interests like it!

Understanding the Similarity Score

How Scoring Works:

If Amit and Priya both like AI World, they share 1 common page
Any page Priya likes gets a score of +1 for Amit's recommendations
If Amit and Rahul share 5 common pages, any page Rahul likes gets +5 score
Higher score = Higher priority recommendation

Real-World Application

Complex Scenario:

If two users each like 10 pages:

All pages liked by User 1 will be recommended to User 2 and vice versa
The score = number of common pages they like
This helps prioritize recommendations in large networks
Similar to how Facebook, LinkedIn, and other platforms work!

Key Concept: Collaborative Filtering

In Simple Terms:
If two users like some of the same pages, they likely share similar interests. So, pages liked by one can be recommended to the other.

Similarity Score: The number of pages they both like. Higher score = stronger connection = better recommendations!

🎉 CONGRATULATIONS! 🎉

You've Completed Your Internship!

What You've Accomplished:

✅ Loaded and displayed JSON data using pure Python
✅ Cleaned messy data and handled edge cases
✅ Built "People You May Know" recommendation feature
✅ Built "Pages You Might Like" recommendation feature
✅ Implemented collaborative filtering algorithms

Your ₹10 LPA Offer Letter is Ready! 💼

Your manager Puneet Kumar is impressed with your work!

What You've Learned

From loading and cleaning data to recommending people and pages, you've successfully completed your first end-to-end data project using raw Python.

Skills Gained:

Data manipulation and cleaning techniques
Algorithm design for recommendation systems
Collaborative filtering implementation
Social network analysis fundamentals
Pure Python programming without external libraries

🚀 Coders of Delhi

Welcome to Your Internship! 🎉

Your Mission

The Challenge

What You'll Build

Understanding the Data Structure

Ready to Begin? 🎯

Task 1: Load the User Data 📂

Your Assignment

Steps to Complete

Code Implementation

Expected Output

Console Output:

Task Completed! ✅

Task 2: Cleaning and Structuring the Data 🧹

The Problem

Issues to Address

Example of Messy Data

Cleaning Strategy

Code Implementation

Cleaned Data Output

After Cleaning:

What Changed?

Data Cleaned Successfully! ✅

Task 3: Finding "People You May Know" 👥

The Feature

Understanding the Logic

How It Works:

Real Example

Algorithm Steps

Code Implementation

Expected Output

Console Output:

How the Algorithm Works:

Ranking by Mutual Friends

Feature Completed! 🎉

Task 4: Finding "Pages You Might Like" 📄

Final Milestone!

Why This Matters

Content Discovery

Understanding the Logic

How "Pages You Might Like" Works:

Real Example

Collaborative Filtering Concept

The Core Principle:

Algorithm Steps

Code Implementation

Expected Output

Console Output:

Understanding the Similarity Score

Real-World Application

Complex Scenario:

Key Concept: Collaborative Filtering

🎉 CONGRATULATIONS! 🎉

You've Completed Your Internship!

Your ₹10 LPA Offer Letter is Ready! 💼

What You've Learned