๐Ÿš€ Coders of Delhi

Data Science Internship at CodeBook

โ‚น10 LPA Job Offer Awaits! ๐Ÿ’ฐ

Welcome to Your Internship! ๐ŸŽ‰

Your Mission

Congratulations! You've been hired as a Data Scientist Intern at CodeBook โ€“ The Social Media for Coders. This Delhi-based company is offering you a โ‚น10 LPA job if you successfully complete this 1-month internship.

The Challenge

Your manager Puneet Kumar has assigned you to analyze CodeBook's user data using only pure Python โ€“ no pandas, NumPy, or fancy libraries allowed!

What You'll Build

Load & Display Data
Clean & Structure
People You May Know
Pages You Might Like

Understanding the Data Structure

The dataset contains three main components:

  1. Users: Each user has an ID, name, a list of friends (by their IDs), and a list of liked pages (by their IDs)
  2. Pages: Each page has an ID and a name
  3. Connections: Users can have multiple friends and can like multiple pages
// Sample Data Structure (JSON format) { "users": [ {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]}, {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]}, {"id": 3, "name": "Rahul", "friends": [1], "liked_pages": [101, 103]}, {"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]} ], "pages": [ {"id": 101, "name": "Python Developers"}, {"id": 102, "name": "Data Science Enthusiasts"}, {"id": 103, "name": "AI & ML Community"}, {"id": 104, "name": "Web Dev Hub"} ] }

Ready to Begin? ๐ŸŽฏ

Navigate through the tabs above to complete each task and earn your โ‚น10 LPA offer letter!

Task 1: Load the User Data ๐Ÿ“‚

Your Assignment

Your manager has given you a dataset containing information about CodeBook users, their connections (friends), and the pages they have liked. Your job is to load and explore this data to understand its structure.

Steps to Complete

  1. Save the JSON data in a file called codebook_data.json
  2. Read the JSON file using Python's built-in modules
  3. Print user details and their connections
  4. Print available pages

Code Implementation

import json # Load the JSON file def load_data(filename): with open(filename, "r") as file: data = json.load(file) return data # Display users and their connections def display_users(data): print("Users and Their Connections:\n") for user in data["users"]: print(f"{user['name']} (ID: {user['id']}) - Friends: {user['friends']} - Liked Pages: {user['liked_pages']}") print("\nPages:\n") for page in data["pages"]: print(f"{page['id']}: {page['name']}") # Load and display the data data = load_data("codebook_data.json") display_users(data)

Expected Output

Console Output:

Users and Their Connections:

Amit (ID: 1) - Friends: [2, 3] - Liked Pages: [101]
Priya (ID: 2) - Friends: [1, 4] - Liked Pages: [102]
Rahul (ID: 3) - Friends: [1] - Liked Pages: [101, 103]
Sara (ID: 4) - Friends: [2] - Liked Pages: [104]

Pages:

101: Python Developers
102: Data Science Enthusiasts
103: AI & ML Community
104: Web Dev Hub
Manager's Feedback: "The data looks messy. Can you clean and structure it better?"

Task Completed! โœ…

Move to the next tab to learn how to clean the data.

Task 2: Cleaning and Structuring the Data ๐Ÿงน

The Problem

Your manager is impressed with your progress but points out that the data is messy. Before analyzing it effectively, we need to clean and structure the data properly.

Issues to Address

  • Missing Values: Some users have empty names
  • Duplicate Data: Duplicate friend entries in lists
  • Inactive Users: Users with no connections or liked pages
  • Inconsistent Data: Duplicate page IDs with different names

Example of Messy Data

{ "users": [ {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]}, {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]}, {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]}, // Empty name! {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]}, // Duplicate friend! {"id": 5, "name": "Amit", "friends": [], "liked_pages": []} // Inactive user! ], "pages": [ {"id": 104, "name": "Web Dev Hub"}, {"id": 104, "name": "Web Development"} // Duplicate page ID! ] }

Cleaning Strategy

  1. Remove users with missing names
  2. Remove duplicate friend entries
  3. Remove inactive users (no friends and no liked pages)
  4. Deduplicate pages based on IDs

Code Implementation

import json def clean_data(data): # 1. Remove users with missing names data["users"] = [user for user in data["users"] if user["name"].strip()] # 2. Remove duplicate friends for user in data["users"]: user["friends"] = list(set(user["friends"])) # 3. Remove inactive users data["users"] = [user for user in data["users"] if user["friends"] or user["liked_pages"]] # 4. Remove duplicate pages unique_pages = {} for page in data["pages"]: unique_pages[page["id"]] = page data["pages"] = list(unique_pages.values()) return data # Load, clean, and save the cleaned data data = json.load(open("codebook_data.json")) data = clean_data(data) json.dump(data, open("cleaned_codebook_data.json", "w"), indent=4) print("Data cleaned successfully!")

Cleaned Data Output

After Cleaning:

{
  "users": [
    {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
    {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
    {"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]}
  ],
  "pages": [
    {"id": 101, "name": "Python Developers"},
    {"id": 102, "name": "Data Science Enthusiasts"},
    {"id": 103, "name": "AI & ML Community"},
    {"id": 104, "name": "Web Development"}
  ]
}

What Changed?

  • โœ… User with empty name (ID: 3) removed
  • โœ… Duplicate friend in Sara's list removed (2, 2 โ†’ 2)
  • โœ… Inactive user (ID: 5) removed
  • โœ… Duplicate page (ID: 104) deduplicated

Data Cleaned Successfully! โœ…

Your manager says: "Great! Now let's build a 'People You May Know' feature!"

Task 3: Finding "People You May Know" ๐Ÿ‘ฅ

The Feature

Build a recommendation system that suggests potential friends based on mutual connections. This is a core feature of social networks that helps users expand their network!

Understanding the Logic

How It Works:

  • If User A and User B are not friends but have mutual friends, suggest User B to User A
  • More mutual friends = higher priority recommendation

Real Example

Scenario:

  • Amit (ID: 1) is friends with Priya (ID: 2) and Rahul (ID: 3)
  • Priya (ID: 2) is friends with Sara (ID: 4)
  • Amit is not directly friends with Sara, but they share Priya as a mutual friend
  • Recommendation: Suggest Sara to Amit as "People You May Know"

Algorithm Steps

  1. Find all direct friends of the target user
  2. For each friend, find their friends (friends of friends)
  3. Count how many mutual connections exist with each potential friend
  4. Rank suggestions by number of mutual friends (highest first)

Code Implementation

import json def load_data(filename): with open(filename, "r") as file: return json.load(file) def find_people_you_may_know(user_id, data): # Create a mapping of user_id to their friends (as a set) user_friends = {} for user in data["users"]: user_friends[user["id"]] = set(user["friends"]) # If user doesn't exist, return empty list if user_id not in user_friends: return [] # Get direct friends of the user direct_friends = user_friends[user_id] suggestions = {} # For each direct friend for friend in direct_friends: # Look at all friends of this friend for mutual in user_friends[friend]: # If mutual is not the user themselves and not already a friend if mutual != user_id and mutual not in direct_friends: # Count mutual friends (each shared connection adds 1) suggestions[mutual] = suggestions.get(mutual, 0) + 1 # Sort by number of mutual friends (descending) sorted_suggestions = sorted(suggestions.items(), key=lambda x: x[1], reverse=True) # Return just the user IDs return [user_id for user_id, _ in sorted_suggestions] # Load data and find recommendations data = load_data("cleaned_codebook_data.json") user_id = 1 # Example: Finding suggestions for Amit recommendations = find_people_you_may_know(user_id, data) print(f"People You May Know for User {user_id}: {recommendations}")

Expected Output

Console Output:

People You May Know for User 1: [4]

Explanation: Amit (User 1) should connect with Sara (User 4) because they share Priya as a mutual friend!

How the Algorithm Works:

For Amit (ID: 1):

  • Direct friends: [2, 3] (Priya and Rahul)
  • Priya's friends: [1, 4] โ†’ Sara (4) is a potential connection
  • Rahul's friends: [1] โ†’ No new suggestions
  • Result: Recommend Sara (User 4) to Amit
Real-World Application: The frontend developer of CodeBook can use this data via API and display it on Amit's profile when he logs in!

Ranking by Mutual Friends

When there are multiple suggestions:

If User A and User B share 5 mutual friends, while User A and User C share only 2 mutual friends, then User B gets ranked higher in the recommendation list.

More mutual connections = Stronger recommendation!

Feature Completed! ๐ŸŽ‰

Your manager is excited and says: "Great job! Next, find 'Pages You Might Like' based on connections and preferences."

Task 4: Finding "Pages You Might Like" ๐Ÿ“„

Final Milestone!

You've reached the final milestone of your first data science project at CodeBook! After cleaning data and building "People You May Know", it's time to launch: Pages You Might Like

Why This Matters

Content Discovery

In real-world social networks, content discovery keeps users engaged. This feature simulates that experience using pure Python, showing how even simple logic can power impactful insights!

Understanding the Logic

How "Pages You Might Like" Works:

  • Users engage with pages (like, comment, share, etc.)
  • If two users have interacted with similar pages, they likely have common interests
  • For this implementation, we consider "liking a page" as an interaction
  • Pages followed by similar users should be recommended

Real Example

Scenario:

  • Amit (ID: 1) likes: Python Hub (101) and AI World (102)
  • Priya (ID: 2) likes: AI World (102) and Data Science Daily (103)
  • Since Amit and Priya both like AI World (102), we suggest:
    • Data Science Daily (103) to Amit
    • Python Hub (101) to Priya

Collaborative Filtering Concept

The Core Principle:

"If two people like the same thing, maybe they'll like other things each one likes too."

This is a basic form of a real-world recommendation engine used by platforms like Facebook, LinkedIn, and Netflix!

Algorithm Steps

  1. Map users to pages they have interacted with
  2. Identify pages liked by users with similar interests
  3. Calculate similarity score (number of shared pages)
  4. Rank recommendations based on common interactions

Code Implementation

import json # Function to load JSON data from a file def load_data(filename): with open(filename, "r") as file: return json.load(file) # Function to find pages a user might like based on common interests def find_pages_you_might_like(user_id, data): # Dictionary to store user interactions with pages user_pages = {} for user in data["users"]: user_pages[user["id"]] = set(user["liked_pages"]) # If the user is not found, return an empty list if user_id not in user_pages: return [] # Get pages liked by the target user user_liked_pages = user_pages[user_id] page_suggestions = {} # Iterate through all other users for other_user, pages in user_pages.items(): if other_user != user_id: # Find shared pages between users shared_pages = user_liked_pages.intersection(pages) # For each page the other user likes for page in pages: # If target user hasn't liked this page yet if page not in user_liked_pages: # Add score based on number of shared pages page_suggestions[page] = page_suggestions.get(page, 0) + len(shared_pages) # Sort recommended pages based on the number of shared interactions sorted_pages = sorted(page_suggestions.items(), key=lambda x: x[1], reverse=True) # Return just the page IDs return [page_id for page_id, _ in sorted_pages] # Load data and find recommendations data = load_data("cleaned_codebook_data.json") user_id = 1 # Example: Finding recommendations for Amit page_recommendations = find_pages_you_might_like(user_id, data) print(f"Pages You Might Like for User {user_id}: {page_recommendations}")

Expected Output

Console Output:

Pages You Might Like for User 1: [103]

Explanation: Amit should be interested in 'AI & ML Community' (Page 103) because users with similar interests like it!

Understanding the Similarity Score

How Scoring Works:

  • If Amit and Priya both like AI World, they share 1 common page
  • Any page Priya likes gets a score of +1 for Amit's recommendations
  • If Amit and Rahul share 5 common pages, any page Rahul likes gets +5 score
  • Higher score = Higher priority recommendation

Real-World Application

Complex Scenario:

If two users each like 10 pages:

  • All pages liked by User 1 will be recommended to User 2 and vice versa
  • The score = number of common pages they like
  • This helps prioritize recommendations in large networks
  • Similar to how Facebook, LinkedIn, and other platforms work!

Key Concept: Collaborative Filtering

In Simple Terms:
If two users like some of the same pages, they likely share similar interests. So, pages liked by one can be recommended to the other.

Similarity Score: The number of pages they both like. Higher score = stronger connection = better recommendations!

๐ŸŽ‰ CONGRATULATIONS! ๐ŸŽ‰

You've Completed Your Internship!

What You've Accomplished:

  • โœ… Loaded and displayed JSON data using pure Python
  • โœ… Cleaned messy data and handled edge cases
  • โœ… Built "People You May Know" recommendation feature
  • โœ… Built "Pages You Might Like" recommendation feature
  • โœ… Implemented collaborative filtering algorithms

Your โ‚น10 LPA Offer Letter is Ready! ๐Ÿ’ผ

Your manager Puneet Kumar is impressed with your work!

What You've Learned

From loading and cleaning data to recommending people and pages, you've successfully completed your first end-to-end data project using raw Python.

Skills Gained:

  • Data manipulation and cleaning techniques
  • Algorithm design for recommendation systems
  • Collaborative filtering implementation
  • Social network analysis fundamentals
  • Pure Python programming without external libraries