Open In Colab

The Dataset

We'll work with a dataset on the top IMDB movies, as rated by IMDB.

Specifically, we have a CSV that contains:

  • IMDB star rating
  • Movie title
  • Year
  • Content rating
  • Genre
  • Duration
  • Gross

[Details available at the above link]

Import our necessary libraries

In [0]:
import pandas as pd
import numpy as np
import matplotlib as plt
import re
%matplotlib inline

Read in the dataset

First, read in the dataset, called movies.csv into a DataFrame called "movies." It's in the ./data folder.

In [0]:
 

Check the dataset basics

Let's first explore our dataset to verify we have what we expect.

Print the first five rows.

In [0]:
 

How many rows and columns are in the datset?

In [0]:
 

What are the column names?

In [0]:
 

How many unique genres are there?

In [0]:
 

How many movies are there per genre?

In [0]:
 

Exploratory data analysis with visualizations

For each of these prompts, create a plot to visualize the answer. Consider what plot is most appropriate to explore the given prompt.

What is the relationship between IMDB ratings and Rotten Tomato ratings?

In [0]:
 

What is the relationship between IMDB rating and movie duration?

In [0]:
 

How many movies are there in each genre category? (Remember to create a plot here)

In [0]:
 

What does the distribution of Rotten Tomatoes ratings look like?

In [0]:
 

Bonus

There are many things left unexplored! Consider investigating something about gross revenue and genres.

In [0]:
# histogram of gross sales
In [0]:
# top 10 grossing films