We'll work with a dataset on the top IMDB movies, as rated by IMDB.
Specifically, we have a CSV that contains:
[Details available at the above link]
import pandas as pd
import numpy as np
import matplotlib as plt
import re
%matplotlib inline
First, read in the dataset, called movies.csv
into a DataFrame called "movies." It's in the ./data
folder.
Let's first explore our dataset to verify we have what we expect.
Print the first five rows.
How many rows and columns are in the datset?
What are the column names?
How many unique genres are there?
How many movies are there per genre?
For each of these prompts, create a plot to visualize the answer. Consider what plot is most appropriate to explore the given prompt.
What is the relationship between IMDB ratings and Rotten Tomato ratings?
What is the relationship between IMDB rating and movie duration?
How many movies are there in each genre category? (Remember to create a plot here)
What does the distribution of Rotten Tomatoes ratings look like?
There are many things left unexplored! Consider investigating something about gross revenue and genres.
# histogram of gross sales
# top 10 grossing films