Plotting with Pandas and Matplotlib¶

Learning Objectives¶

After this lesson, you will be able to:

Implement different types of plots on a given dataset.

Recap¶

In the last lesson, we learned about when to use the different types of plots. Can anyone give an example of when we would use a:

line plot?
bar plot?
histogram?
scatter plot?

Pandas and Matplotlib¶

As we explore different types of plots, notice:

Different types of plots are drawn very similarly -- they even tend to share parameter names.
In Pandas, calling plot() on a DataFrame is different than calling it on a Series. Although the methods are both named plot, they may take different parameters.

Sometimes Pandas can be a little frustrating... perserverence is key!

Lesson Guide¶

Line Plots
Bar Plots
Histograms
Scatter Plots
Using Seaborn
OPTIONAL: Understanding Matplotlib (Figures, Subplots, and Axes)
OPTIONAL: Additional Topics
Summary

Plotting with Pandas: How?¶

<data_set>.<columns>.plot()

Note: These are example plots on a ficticious dataset. We'll work with real ones in just a minute!

population['states'].value_counts().plot() creates:

Plotting: Visualization Types¶

Line charts are default.

# line chart

population['states'].value_counts().plot()

For other charts:

population['states'].plot(kind='bar')

population['states'].plot(kind='hist', bins=3);

population['states'].plot(kind='scatter', x='states', y='population')

Let's try!

Import packages¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# set the plots to display in the Jupyter notebook
%matplotlib inline

# change plotting colors per client request
plt.style.use('ggplot')

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

Load in data sets for visualization¶

Football Records: International football results from 1872 to 2018
Avocado Prices: Historical data on avocado prices and sales volume in multiple US markets
Chocolate Bar Ratings: Expert ratings of over 1,700 chocolate bars

These have been included in ./datasets of this repo for your convenience.

!ls

04-plotting-with-pandas.ipynb		 chocolate_ratings.csv
04-plotting-with-pandas-solutions.ipynb  international_football_results.csv
avocado.csv				 readme.md

foot = pd.read_csv('./international_football_results.csv')
avo = pd.read_csv('./avocado.csv')
choc = pd.read_csv('./chocolate_ratings.csv')

Line plots: Show the trend of a numerical variable over time¶

Let's focus on the football scores for starters.

foot.head(3)

We can extract the year by converting the date to a datetime64[ns] object, and then using the pd.Series.dt.year property to return the year (as an int). We'll learn more about this in future lessons.

foot['year'] = pd.to_datetime(foot['date']).dt.year

foot[['date', 'year']].head(3)

We can then get the number of games played every year by using pd.Series.value_counts, and using the sort_index() method to ensure our year is sorted chronologically.

foot['year'].value_counts().sort_index().head()

1872    1
1873    1
1874    1
1875    1
1876    2
Name: year, dtype: int64

Using this date, we can use the pd.Series.plot() method to graph count of games against year of game:

foot['year'].value_counts().sort_index().plot();

Knowledge Check

Why does it make sense to use a line plot for this visualization?

Another example¶

foot['home_team'].sort_index().value_counts().head()

Brazil       552
Argentina    535
Germany      495
Mexico       494
England      483
Name: home_team, dtype: int64

Knowledge Check

Why would it NOT make sense to use a line plot for this visualization?

Bar Plots: Show a numerical comparison across different categories¶

Count the number of games played in each country in the football dataset.

foot['country'].value_counts().head()

USA         1087
France       775
England      659
Malaysia     634
Sweden       632
Name: country, dtype: int64

Let's view the same information, but in a bar chart instead. Note we are using .head() to return the top 5. Also note that value_counts() automatically sorts by the value (read the docs!)

foot['country'].value_counts().head().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f1529677b00>

Histograms: Show the distribution of a numerical variable¶

Let's change to the chocolate bar dataset.

choc.head()

How would you split the `Rating` values into 3 equally sized bins?¶

choc['Rating'].unique()

array([3.75, 2.75, 3.  , 3.5 , 4.  , 3.25, 2.5 , 5.  , 1.75, 1.5 , 2.25,
       2.  , 1.  ])

Use a histogram! The bins=n kwarg allows us to specify the number of bins ('buckets') of values.

choc.REF

# choc.select_dtypes(include='number')
plt.hist([choc['Rating'].values, choc['REF'].values], stacked=True)

([array([1795.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
            0.]),
  array([1944.,  157.,  150.,  176.,  178.,  187.,  191.,  215.,  197.,
          195.])],
 array([1.0000e+00, 1.9610e+02, 3.9120e+02, 5.8630e+02, 7.8140e+02,
        9.7650e+02, 1.1716e+03, 1.3667e+03, 1.5618e+03, 1.7569e+03,
        1.9520e+03]),
 <a list of 2 Lists of Patches objects>)

choc['Rating'].plot(kind='hist', bins=3);

Sometimes it is helpful to increase this number if you think you might have an outlier or a zero-weighted set.

choc['Rating'].plot(kind='hist', bins=20)
plt.ylabel('Number of Ratings')
plt.xlabel('Chocolate Rating')
plt.title('My Title');

Knowledge check:¶

What does the y-axis represent on a histogram? What about the x-axis? How would you explain a histogram to a non-technical person?

Making histograms of an entire dataframe:¶

choc.hist(figsize=(16,8));

Why doesn't it make plots of ALL the columns in the dataframe?¶

Hint: what is different about the columns it plots vs. the ones it left out?

choc.head(3)

Let's take a look at the data types of all the columns:

choc.dtypes

Company \n(Maker-if known)            object
Specific Bean Origin\nor Bar Name     object
REF                                    int64
Review\nDate                           int64
Cocoa\nPercent                        object
Company\nLocation                     object
Rating                               float64
Bean\nType                            object
Broad Bean\nOrigin                    object
dtype: object

It looks like it included REF, Review Date, and Rating. These have datatypes of int64, int64, and float64 respectively. What do these all have in common, that is different from the other data types?

Click for the answer!

They're all **numeric!** The other columns are **categorical**, specifically string values.

We can filter on these types using the select_dtypes() DataFrame method (which can be very handy!)

choc.select_dtypes(include='number').head()

Challenge: create a histogram of the `Review Date` column, with 10 bins, and label both axes¶

Scatter plots: Show the relationship between two numerical variables¶

Scatter plots are very good at showing the interaction between two numeric variables (especially when they're continuous)!

avo.head(3)

avo.plot(kind='scatter', x='Total Volume', y='AveragePrice', \
        color='dodgerblue', figsize=(10,4), s=10, alpha=0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x7f15290700f0>

Oh snap! What did we just make?! It's a price elasticity curve!

We can also use a thing called a scatter matrix or a pairplot, which is a grid of scatter plots. This allows you to quickly view the interaction of N x M features. You are generally looking for a trend between variables (a line or curve). Using machine learning, you can fit these curves to provide predictive power.

avo.select_dtypes(include='number').iloc[:,-5:-1]

pd.plotting.scatter_matrix(
    avo.select_dtypes(include='number').iloc[:,-5:-1],
    figsize=(12,12)
);

We can also use a very handy parameter, c, which allows us to color the dots in a scatter plot. This is extremely helpful when doing classification problems, often you will set the color to the class label.

avo['type'].unique()
Let's map the type field to the color of the dot in our price elasticity curve. To use the type field, we need to convert it from a string into a number. We can use pd.Seri

array(['conventional', 'organic'], dtype=object)

Let's map the type field to the color of the dot in our price elasticity curve. To use the type field, we need to convert it from a string into a number. We can use pd.Series.apply() for this.

mapping_dict = {}
initial_class_label = 0
for type in list(avo['type'].unique()):
    mapping_dict[type] = initial_class_label
    initial_class_label += 1
mapping_dict

{'conventional': 0, 'organic': 1}

We can see we have two unique type labels, conventional and organic. Although that is the case for this dataset, let's create a function that will store the labels in a dictionary, incrementing the number up by 1 for each new label. This way, if we receive an additional type label in the future, our code won't break. Always think about extensible code!

Now we can use this mapping_dict dictionary to map the values using .apply():

avo['type_as_num'] = avo['type'].apply(lambda x: mapping_dict[x])
avo[['type', 'type_as_num']].head(3)

Finally, we can use this binary class label as our c parameter to gain some insight:

avo.plot(kind='scatter', x='Total Volume', y='AveragePrice', \
        c='type_as_num', colormap='winter', figsize=(8,4))
plt.xlabel('Volume')
plt.savefig('./avo_price.png');

!ls

04-plotting-with-pandas.ipynb		 chocolate_ratings.csv
04-plotting-with-pandas-solutions.ipynb  international_football_results.csv
avocado.csv				 readme.md
avo_price.png

Amazing! It looks like the organic avocados (value of 1) totally occupy the lower volume, higher price bracket. Those dang kids with their toast and unicycles driving up the price of my 'cados!

Here, we can also see a 'more' continuous c parameter, year, which makes use of the gradient a little better. There are tons of gradients you can use, check them out here.

Finally, we can save the plot to a file, using the plt.savefig() method:

Summary¶

In this lesson, we showed examples of how to create a variety of plots using Pandas and Matplotlib. We also showed how to use each plot to effectively display data.

Do not be concerned if you do not remember everything — this will come with practice! Although there are many plot styles, many similarities exist between how each plot is drawn. For example, they have most parameters in common, and the same Matplotlib functions are used to modify the plot area.

We looked at:

Line plots
Bar plots
Histograms
Scatter plots

Additional Resources¶

Always read the documentation!

	date	home_team	away_team	home_score	away_score	tournament	city	country	neutral
0	1872-11-30	Scotland	England	0	0	Friendly	Glasgow	Scotland	False
1	1873-03-08	England	Scotland	4	2	Friendly	London	England	False
2	1874-03-07	Scotland	England	2	1	Friendly	Glasgow	Scotland	False

	Company (Maker-if known)	Specific Bean Origin or Bar Name	REF	Review Date	Cocoa Percent	Company Location	Rating	Broad Bean Origin
0	A. Morin	Agua Grande	1876	2016	63%	France	3.75	Sao Tome
1	A. Morin	Kpime	1676	2015	70%	France	2.75	Togo
2	A. Morin	Atsane	1676	2015	70%	France	3.00	Togo
3	A. Morin	Akata	1680	2015	70%	France	3.50	Togo
4	A. Morin	Quilla	1704	2015	70%	France	3.50	Peru

	Company (Maker-if known)	Specific Bean Origin or Bar Name	REF	Review Date	Cocoa Percent	Company Location	Rating	Broad Bean Origin
0	A. Morin	Agua Grande	1876	2016	63%	France	3.75	Sao Tome
1	A. Morin	Kpime	1676	2015	70%	France	2.75	Togo
2	A. Morin	Atsane	1676	2015	70%	France	3.00	Togo

	REF	Review Date	Rating
0	1876	2016	3.75
1	1676	2015	2.75
2	1676	2015	3.00
3	1680	2015	3.50
4	1704	2015	3.50

	Unnamed: 0	Date	AveragePrice	Total Volume	4046	4225	4770	Total Bags	Small Bags	Large Bags	type	year	region
0	0	2015-12-27	1.33	64236.62	1036.74	54454.85	48.16	8696.87	8603.62	93.25	conventional	2015	Albany
1	1	2015-12-20	1.35	54876.98	674.28	44638.81	58.33	9505.56	9408.07	97.49	conventional	2015	Albany
2	2	2015-12-13	0.93	118220.22	794.70	109149.67	130.50	8145.35	8042.21	103.14	conventional	2015	Albany

	Total Bags	Small Bags	Large Bags	XLarge Bags
0	8696.87	8603.62	93.25	0.00
1	9505.56	9408.07	97.49	0.00
2	8145.35	8042.21	103.14	0.00
3	5811.16	5677.40	133.76	0.00
4	6183.95	5986.26	197.69	0.00
5	6683.91	6556.47	127.44	0.00
6	8318.86	8196.81	122.05	0.00
7	6829.22	6266.85	562.37	0.00
8	11388.36	11104.53	283.83	0.00
9	8625.92	8061.47	564.45	0.00
10	8205.66	7877.86	327.80	0.00
11	10123.90	9866.27	257.63	0.00
12	8756.75	8379.98	376.77	0.00
13	6034.46	5888.87	145.59	0.00
14	9267.36	8489.10	778.26	0.00
15	9286.68	8665.19	621.49	0.00
16	7990.10	7762.87	227.23	0.00
17	10306.73	10218.93	87.80	0.00
18	10880.36	10745.79	134.57	0.00
19	10443.22	10297.68	145.54	0.00
20	9225.89	9116.34	109.55	0.00
21	11847.02	11768.52	78.50	0.00
22	13192.69	13061.53	131.16	0.00
23	11287.48	11103.49	183.99	0.00
24	24431.90	24290.08	108.49	33.33
25	29898.96	29663.19	235.77	0.00
26	26662.08	26311.76	350.32	0.00
27	21875.65	21662.00	213.65	0.00
28	29002.59	28343.14	659.45	0.00
29	22775.21	22314.99	460.22	0.00
...	...	...	...	...
18219	945638.02	768242.42	177144.00	251.60
18220	977084.84	774695.74	201878.69	510.41
18221	936859.49	796104.27	140652.84	102.38
18222	914409.26	710654.40	203526.59	228.27
18223	1005593.78	858772.69	146808.97	12.12
18224	1089861.24	915452.78	174381.57	26.89
18225	166747.85	87108.00	79495.39	144.46
18226	129353.55	73163.12	56020.24	170.19
18227	176465.63	107174.93	69290.70	0.00

Plotting with Pandas and Matplotlib¶

Learning Objectives¶

Recap¶

Pandas and Matplotlib¶

Lesson Guide¶

Plotting with Pandas: How?¶

Plotting: Visualization Types¶

Import packages¶

Load in data sets for visualization¶

Line plots: Show the trend of a numerical variable over time¶

Knowledge Check

Another example¶

Knowledge Check

Bar Plots: Show a numerical comparison across different categories¶

Histograms: Show the distribution of a numerical variable¶

How would you split the Rating values into 3 equally sized bins?¶

Knowledge check:¶

Making histograms of an entire dataframe:¶

Why doesn't it make plots of ALL the columns in the dataframe?¶

Challenge: create a histogram of the Review Date column, with 10 bins, and label both axes¶

Scatter plots: Show the relationship between two numerical variables¶

Summary¶

Additional Resources¶

How would you split the `Rating` values into 3 equally sized bins?¶

Challenge: create a histogram of the `Review Date` column, with 10 bins, and label both axes¶