Petre Leonard Macamete
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
β€Œ
Sign up
Beta
Spinner

Netflix Movie Data

This dataset contains more than 8,500 Netflix movies and TV shows, including cast members, duration, and genre. It contains titles added as recently as late September 2021.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd

Netflix = pd.read_csv("netflix_dataset.csv", index_col=0)

Source of dataset.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • πŸ—ΊοΈ Explore: How much variety exists in Netflix's offering? Base this on three variables: type, country, and listed_in.
  • πŸ“Š Visualize: Build a word cloud from the movie and TV shows descriptions. Make sure to remove stop words!
  • πŸ”Ž Analyze: Has Netflix invested more in certain genres (see listed_in) in recent years? What about certain age groups (see ratings)?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

A talent agency has hired you to analyze patterns in the professional relationships of cast members and directors. The key deliverable is a network graph where each node represents a cast member or director. An edge represents a movie or TV show worked on by both nodes in this undirected graph. You can limit the actors to the first four names listed in cast. The client is interested in any insights you can derive from your Netflix network analysis, such as actor/actor and actor/director pairs that work most closely together, most popular actors and directors to work with, and graph differences over time.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.


✍️ If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system.

Movies = Netflix[Netflix['type'] == 'Movie']

print(Movies)
Movies['duration'] = Movies['duration'].str.replace(' min', '')

Movies['duration'] = Movies['duration'].apply(pd.to_numeric)
print(Movies)
Movies.hist(column= 'duration')
import numpy as np
np_durations = Movies['duration'].to_numpy()
np_durations = np_durations[~np.isnan(np_durations)]
print (np_durations)

np_durations2 = np_durations / np.amax(np_durations)
# np_durations2 = np.sort(np_durations2)
print (np_durations2)
from scipy.stats import kstest

kstest(np_durations2, 'norm')
x_i = np_durations[(np_durations > 65) & (np_durations < 83)]
np.random.shuffle(x_i)
x_i = x_i [:35]
y_i = np_durations[(np_durations > 62) & (np_durations < 96)]
np.random.shuffle(y_i)
y_i = y_i [:29]
print (x_i.size)
print (y_i.size)
  • AI Chat
  • Code