Welcome to Your Internship! ๐
Your Mission
Congratulations! You've been hired as a Data Scientist Intern at CodeBook โ The Social Media for Coders. This Delhi-based company is offering you a โน10 LPA job if you successfully complete this 1-month internship.
The Challenge
Your manager Puneet Kumar has assigned you to analyze CodeBook's user data using only pure Python โ no pandas, NumPy, or fancy libraries allowed!
What You'll Build
Load & Display Data
Clean & Structure
People You May Know
Pages You Might Like
Understanding the Data Structure
The dataset contains three main components:
- Users: Each user has an ID, name, a list of friends (by their IDs), and a list of liked pages (by their IDs)
- Pages: Each page has an ID and a name
- Connections: Users can have multiple friends and can like multiple pages
{
"users": [
{"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
{"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
{"id": 3, "name": "Rahul", "friends": [1], "liked_pages": [101, 103]},
{"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]}
],
"pages": [
{"id": 101, "name": "Python Developers"},
{"id": 102, "name": "Data Science Enthusiasts"},
{"id": 103, "name": "AI & ML Community"},
{"id": 104, "name": "Web Dev Hub"}
]
}
Ready to Begin? ๐ฏ
Navigate through the tabs above to complete each task and earn your โน10 LPA offer letter!
Task 1: Load the User Data ๐
Your Assignment
Your manager has given you a dataset containing information about CodeBook users, their connections (friends), and the pages they have liked. Your job is to load and explore this data to understand its structure.
Steps to Complete
- Save the JSON data in a file called
codebook_data.json
- Read the JSON file using Python's built-in modules
- Print user details and their connections
- Print available pages
Code Implementation
import json
def load_data(filename):
with open(filename, "r") as file:
data = json.load(file)
return data
def display_users(data):
print("Users and Their Connections:\n")
for user in data["users"]:
print(f"{user['name']} (ID: {user['id']}) - Friends: {user['friends']} - Liked Pages: {user['liked_pages']}")
print("\nPages:\n")
for page in data["pages"]:
print(f"{page['id']}: {page['name']}")
data = load_data("codebook_data.json")
display_users(data)
Expected Output
Console Output:
Users and Their Connections:
Amit (ID: 1) - Friends: [2, 3] - Liked Pages: [101]
Priya (ID: 2) - Friends: [1, 4] - Liked Pages: [102]
Rahul (ID: 3) - Friends: [1] - Liked Pages: [101, 103]
Sara (ID: 4) - Friends: [2] - Liked Pages: [104]
Pages:
101: Python Developers
102: Data Science Enthusiasts
103: AI & ML Community
104: Web Dev Hub
Manager's Feedback: "The data looks messy. Can you clean and structure it better?"
Task Completed! โ
Move to the next tab to learn how to clean the data.
Task 2: Cleaning and Structuring the Data ๐งน
The Problem
Your manager is impressed with your progress but points out that the data is messy. Before analyzing it effectively, we need to clean and structure the data properly.
Issues to Address
- Missing Values: Some users have empty names
- Duplicate Data: Duplicate friend entries in lists
- Inactive Users: Users with no connections or liked pages
- Inconsistent Data: Duplicate page IDs with different names
Example of Messy Data
{
"users": [
{"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
{"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
{"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
{"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
{"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
],
"pages": [
{"id": 104, "name": "Web Dev Hub"},
{"id": 104, "name": "Web Development"}
]
}
Cleaning Strategy
- Remove users with missing names
- Remove duplicate friend entries
- Remove inactive users (no friends and no liked pages)
- Deduplicate pages based on IDs
Code Implementation
import json
def clean_data(data):
data["users"] = [user for user in data["users"] if user["name"].strip()]
for user in data["users"]:
user["friends"] = list(set(user["friends"]))
data["users"] = [user for user in data["users"]
if user["friends"] or user["liked_pages"]]
unique_pages = {}
for page in data["pages"]:
unique_pages[page["id"]] = page
data["pages"] = list(unique_pages.values())
return data
data = json.load(open("codebook_data.json"))
data = clean_data(data)
json.dump(data, open("cleaned_codebook_data.json", "w"), indent=4)
print("Data cleaned successfully!")
Cleaned Data Output
After Cleaning:
{
"users": [
{"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
{"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
{"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]}
],
"pages": [
{"id": 101, "name": "Python Developers"},
{"id": 102, "name": "Data Science Enthusiasts"},
{"id": 103, "name": "AI & ML Community"},
{"id": 104, "name": "Web Development"}
]
}
What Changed?
- โ
User with empty name (ID: 3) removed
- โ
Duplicate friend in Sara's list removed (2, 2 โ 2)
- โ
Inactive user (ID: 5) removed
- โ
Duplicate page (ID: 104) deduplicated
Data Cleaned Successfully! โ
Your manager says: "Great! Now let's build a 'People You May Know' feature!"
Task 3: Finding "People You May Know" ๐ฅ
The Feature
Build a recommendation system that suggests potential friends based on mutual connections. This is a core feature of social networks that helps users expand their network!
Understanding the Logic
How It Works:
- If User A and User B are not friends but have mutual friends, suggest User B to User A
- More mutual friends = higher priority recommendation
Real Example
Scenario:
- Amit (ID: 1) is friends with Priya (ID: 2) and Rahul (ID: 3)
- Priya (ID: 2) is friends with Sara (ID: 4)
- Amit is not directly friends with Sara, but they share Priya as a mutual friend
- Recommendation: Suggest Sara to Amit as "People You May Know"
Algorithm Steps
- Find all direct friends of the target user
- For each friend, find their friends (friends of friends)
- Count how many mutual connections exist with each potential friend
- Rank suggestions by number of mutual friends (highest first)
Code Implementation
import json
def load_data(filename):
with open(filename, "r") as file:
return json.load(file)
def find_people_you_may_know(user_id, data):
user_friends = {}
for user in data["users"]:
user_friends[user["id"]] = set(user["friends"])
if user_id not in user_friends:
return []
direct_friends = user_friends[user_id]
suggestions = {}
for friend in direct_friends:
for mutual in user_friends[friend]:
if mutual != user_id and mutual not in direct_friends:
suggestions[mutual] = suggestions.get(mutual, 0) + 1
sorted_suggestions = sorted(suggestions.items(),
key=lambda x: x[1],
reverse=True)
return [user_id for user_id, _ in sorted_suggestions]
data = load_data("cleaned_codebook_data.json")
user_id = 1
recommendations = find_people_you_may_know(user_id, data)
print(f"People You May Know for User {user_id}: {recommendations}")
Expected Output
Console Output:
People You May Know for User 1: [4]
Explanation: Amit (User 1) should connect with Sara (User 4) because they share Priya as a mutual friend!
How the Algorithm Works:
For Amit (ID: 1):
- Direct friends: [2, 3] (Priya and Rahul)
- Priya's friends: [1, 4] โ Sara (4) is a potential connection
- Rahul's friends: [1] โ No new suggestions
- Result: Recommend Sara (User 4) to Amit
Real-World Application: The frontend developer of CodeBook can use this data via API and display it on Amit's profile when he logs in!
Ranking by Mutual Friends
When there are multiple suggestions:
If User A and User B share 5 mutual friends, while User A and User C share only 2 mutual friends, then User B gets ranked higher in the recommendation list.
More mutual connections = Stronger recommendation!
Feature Completed! ๐
Your manager is excited and says: "Great job! Next, find 'Pages You Might Like' based on connections and preferences."
Task 4: Finding "Pages You Might Like" ๐
Final Milestone!
You've reached the final milestone of your first data science project at CodeBook! After cleaning data and building "People You May Know", it's time to launch: Pages You Might Like
Why This Matters
Content Discovery
In real-world social networks, content discovery keeps users engaged. This feature simulates that experience using pure Python, showing how even simple logic can power impactful insights!
Understanding the Logic
How "Pages You Might Like" Works:
- Users engage with pages (like, comment, share, etc.)
- If two users have interacted with similar pages, they likely have common interests
- For this implementation, we consider "liking a page" as an interaction
- Pages followed by similar users should be recommended
Real Example
Scenario:
- Amit (ID: 1) likes: Python Hub (101) and AI World (102)
- Priya (ID: 2) likes: AI World (102) and Data Science Daily (103)
- Since Amit and Priya both like AI World (102), we suggest:
- Data Science Daily (103) to Amit
- Python Hub (101) to Priya
Collaborative Filtering Concept
The Core Principle:
"If two people like the same thing, maybe they'll like other things each one likes too."
This is a basic form of a real-world recommendation engine used by platforms like Facebook, LinkedIn, and Netflix!
Algorithm Steps
- Map users to pages they have interacted with
- Identify pages liked by users with similar interests
- Calculate similarity score (number of shared pages)
- Rank recommendations based on common interactions
Code Implementation
import json
def load_data(filename):
with open(filename, "r") as file:
return json.load(file)
def find_pages_you_might_like(user_id, data):
user_pages = {}
for user in data["users"]:
user_pages[user["id"]] = set(user["liked_pages"])
if user_id not in user_pages:
return []
user_liked_pages = user_pages[user_id]
page_suggestions = {}
for other_user, pages in user_pages.items():
if other_user != user_id:
shared_pages = user_liked_pages.intersection(pages)
for page in pages:
if page not in user_liked_pages:
page_suggestions[page] = page_suggestions.get(page, 0) + len(shared_pages)
sorted_pages = sorted(page_suggestions.items(),
key=lambda x: x[1],
reverse=True)
return [page_id for page_id, _ in sorted_pages]
data = load_data("cleaned_codebook_data.json")
user_id = 1
page_recommendations = find_pages_you_might_like(user_id, data)
print(f"Pages You Might Like for User {user_id}: {page_recommendations}")
Expected Output
Console Output:
Pages You Might Like for User 1: [103]
Explanation: Amit should be interested in 'AI & ML Community' (Page 103) because users with similar interests like it!
Understanding the Similarity Score
How Scoring Works:
- If Amit and Priya both like AI World, they share 1 common page
- Any page Priya likes gets a score of +1 for Amit's recommendations
- If Amit and Rahul share 5 common pages, any page Rahul likes gets +5 score
- Higher score = Higher priority recommendation
Real-World Application
Complex Scenario:
If two users each like 10 pages:
- All pages liked by User 1 will be recommended to User 2 and vice versa
- The score = number of common pages they like
- This helps prioritize recommendations in large networks
- Similar to how Facebook, LinkedIn, and other platforms work!
Key Concept: Collaborative Filtering
In Simple Terms:
If two users like some of the same pages, they likely share similar interests. So, pages liked by one can be recommended to the other.
Similarity Score: The number of pages they both like. Higher score = stronger connection = better recommendations!
๐ CONGRATULATIONS! ๐
You've Completed Your Internship!
What You've Accomplished:
- โ
Loaded and displayed JSON data using pure Python
- โ
Cleaned messy data and handled edge cases
- โ
Built "People You May Know" recommendation feature
- โ
Built "Pages You Might Like" recommendation feature
- โ
Implemented collaborative filtering algorithms
Your โน10 LPA Offer Letter is Ready! ๐ผ
Your manager Puneet Kumar is impressed with your work!
What You've Learned
From loading and cleaning data to recommending people and pages, you've successfully completed your first end-to-end data project using raw Python.
Skills Gained:
- Data manipulation and cleaning techniques
- Algorithm design for recommendation systems
- Collaborative filtering implementation
- Social network analysis fundamentals
- Pure Python programming without external libraries