Open In Colab

Plotting with Pandas and Matplotlib


Learning Objectives

After this lesson, you will be able to:

  • Implement different types of plots on a given dataset.

Recap

In the last lesson, we learned about when to use the different types of plots. Can anyone give an example of when we would use a:

  • line plot?
  • bar plot?
  • histogram?
  • scatter plot?

Pandas and Matplotlib

As we explore different types of plots, notice:

  1. Different types of plots are drawn very similarly -- they even tend to share parameter names.
  2. In Pandas, calling plot() on a DataFrame is different than calling it on a Series. Although the methods are both named plot, they may take different parameters.

Sometimes Pandas can be a little frustrating... perserverence is key!

Plotting with Pandas: How?

<data_set>.<columns>.plot()

Note: These are example plots on a ficticious dataset. We'll work with real ones in just a minute!

population['states'].value_counts().plot() creates:

Plotting: Visualization Types

Line charts are default.

# line chart

population['states'].value_counts().plot()

For other charts:

population['states'].plot(kind='bar')

population['states'].plot(kind='hist', bins=3);

population['states'].plot(kind='scatter', x='states', y='population')

Let's try!

Import packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# set the plots to display in the Jupyter notebook
%matplotlib inline

# change plotting colors per client request
plt.style.use('ggplot')

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

Load in data sets for visualization

These have been included in ./datasets of this repo for your convenience.

In [3]:
!ls
04-plotting-with-pandas.ipynb		 chocolate_ratings.csv
04-plotting-with-pandas-solutions.ipynb  international_football_results.csv
avocado.csv				 readme.md
In [4]:
foot = pd.read_csv('./international_football_results.csv')
avo = pd.read_csv('./avocado.csv')
choc = pd.read_csv('./chocolate_ratings.csv')

Line plots: Show the trend of a numerical variable over time


Let's focus on the football scores for starters.

In [5]:
foot.head(3)
Out[5]:
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False

We can extract the year by converting the date to a datetime64[ns] object, and then using the pd.Series.dt.year property to return the year (as an int). We'll learn more about this in future lessons.

In [18]:
foot['year'] = pd.to_datetime(foot['date']).dt.year
In [19]:
foot[['date', 'year']].head(3)
Out[19]:
date year
0 1872-11-30 1872
1 1873-03-08 1873
2 1874-03-07 1874

We can then get the number of games played every year by using pd.Series.value_counts, and using the sort_index() method to ensure our year is sorted chronologically.

In [22]:
foot['year'].value_counts().sort_index().head()
Out[22]:
1872    1
1873    1
1874    1
1875    1
1876    2
Name: year, dtype: int64

Using this date, we can use the pd.Series.plot() method to graph count of games against year of game:

In [26]:
foot['year'].value_counts().sort_index().plot();

Knowledge Check

Why does it make sense to use a line plot for this visualization?


Another example


In [29]:
foot['home_team'].sort_index().value_counts().head()
Out[29]:
Brazil       552
Argentina    535
Germany      495
Mexico       494
England      483
Name: home_team, dtype: int64

Knowledge Check

Why would it NOT make sense to use a line plot for this visualization?


Bar Plots: Show a numerical comparison across different categories


Count the number of games played in each country in the football dataset.

In [30]:
foot['country'].value_counts().head()
Out[30]:
USA         1087
France       775
England      659
Malaysia     634
Sweden       632
Name: country, dtype: int64

Let's view the same information, but in a bar chart instead. Note we are using .head() to return the top 5. Also note that value_counts() automatically sorts by the value (read the docs!)

In [31]:
foot['country'].value_counts().head().plot(kind='bar')
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1529677b00>

Histograms: Show the distribution of a numerical variable


Let's change to the chocolate bar dataset.

In [32]:
choc.head()
Out[32]:
Company  (Maker-if known) Specific Bean Origin or Bar Name REF Review Date Cocoa Percent Company Location Rating Bean Type Broad Bean Origin
0 A. Morin Agua Grande 1876 2016 63% France 3.75 Sao Tome
1 A. Morin Kpime 1676 2015 70% France 2.75 Togo
2 A. Morin Atsane 1676 2015 70% France 3.00 Togo
3 A. Morin Akata 1680 2015 70% France 3.50 Togo
4 A. Morin Quilla 1704 2015 70% France 3.50 Peru

How would you split the Rating values into 3 equally sized bins?

In [33]:
choc['Rating'].unique()
Out[33]:
array([3.75, 2.75, 3.  , 3.5 , 4.  , 3.25, 2.5 , 5.  , 1.75, 1.5 , 2.25,
       2.  , 1.  ])

Use a histogram! The bins=n kwarg allows us to specify the number of bins ('buckets') of values.

In [ ]:
choc.REF
In [89]:
# choc.select_dtypes(include='number')
plt.hist([choc['Rating'].values, choc['REF'].values], stacked=True)
Out[89]:
([array([1795.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
            0.]),
  array([1944.,  157.,  150.,  176.,  178.,  187.,  191.,  215.,  197.,
          195.])],
 array([1.0000e+00, 1.9610e+02, 3.9120e+02, 5.8630e+02, 7.8140e+02,
        9.7650e+02, 1.1716e+03, 1.3667e+03, 1.5618e+03, 1.7569e+03,
        1.9520e+03]),
 <a list of 2 Lists of Patches objects>)
In [35]:
choc['Rating'].plot(kind='hist', bins=3);

Sometimes it is helpful to increase this number if you think you might have an outlier or a zero-weighted set.

In [43]:
choc['Rating'].plot(kind='hist', bins=20)
plt.ylabel('Number of Ratings')
plt.xlabel('Chocolate Rating')
plt.title('My Title');

Knowledge check:

What does the y-axis represent on a histogram? What about the x-axis? How would you explain a histogram to a non-technical person?

Making histograms of an entire dataframe:

In [45]:
choc.hist(figsize=(16,8));

Why doesn't it make plots of ALL the columns in the dataframe?

Hint: what is different about the columns it plots vs. the ones it left out?

In [46]:
choc.head(3)
Out[46]:
Company  (Maker-if known) Specific Bean Origin or Bar Name REF Review Date Cocoa Percent Company Location Rating Bean Type Broad Bean Origin
0 A. Morin Agua Grande 1876 2016 63% France 3.75 Sao Tome
1 A. Morin Kpime 1676 2015 70% France 2.75 Togo
2 A. Morin Atsane 1676 2015 70% France 3.00 Togo

Let's take a look at the data types of all the columns:

In [47]:
choc.dtypes
Out[47]:
Company \n(Maker-if known)            object
Specific Bean Origin\nor Bar Name     object
REF                                    int64
Review\nDate                           int64
Cocoa\nPercent                        object
Company\nLocation                     object
Rating                               float64
Bean\nType                            object
Broad Bean\nOrigin                    object
dtype: object

It looks like it included REF, Review Date, and Rating. These have datatypes of int64, int64, and float64 respectively. What do these all have in common, that is different from the other data types?

Click for the answer! They're all **numeric!** The other columns are **categorical**, specifically string values.

We can filter on these types using the select_dtypes() DataFrame method (which can be very handy!)

In [49]:
choc.select_dtypes(include='number').head()
Out[49]:
REF Review Date Rating
0 1876 2016 3.75
1 1676 2015 2.75
2 1676 2015 3.00
3 1680 2015 3.50
4 1704 2015 3.50

Challenge: create a histogram of the Review Date column, with 10 bins, and label both axes


In [ ]:
 

Scatter plots: Show the relationship between two numerical variables


Scatter plots are very good at showing the interaction between two numeric variables (especially when they're continuous)!

In [50]:
avo.head(3)
Out[50]:
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
In [ ]:
 
In [56]:
avo.plot(kind='scatter', x='Total Volume', y='AveragePrice', \
        color='dodgerblue', figsize=(10,4), s=10, alpha=0.5)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f15290700f0>

Oh snap! What did we just make?! It's a price elasticity curve!

We can also use a thing called a scatter matrix or a pairplot, which is a grid of scatter plots. This allows you to quickly view the interaction of N x M features. You are generally looking for a trend between variables (a line or curve). Using machine learning, you can fit these curves to provide predictive power.

In [63]:
avo.select_dtypes(include='number').iloc[:,-5:-1]
Out[63]:
Total Bags Small Bags Large Bags XLarge Bags
0 8696.87 8603.62 93.25 0.00
1 9505.56 9408.07 97.49 0.00
2 8145.35 8042.21 103.14 0.00
3 5811.16 5677.40 133.76 0.00
4 6183.95 5986.26 197.69 0.00
5 6683.91 6556.47 127.44 0.00
6 8318.86 8196.81 122.05 0.00
7 6829.22 6266.85 562.37 0.00
8 11388.36 11104.53 283.83 0.00
9 8625.92 8061.47 564.45 0.00
10 8205.66 7877.86 327.80 0.00
11 10123.90 9866.27 257.63 0.00
12 8756.75 8379.98 376.77 0.00
13 6034.46 5888.87 145.59 0.00
14 9267.36 8489.10 778.26 0.00
15 9286.68 8665.19 621.49 0.00
16 7990.10 7762.87 227.23 0.00
17 10306.73 10218.93 87.80 0.00
18 10880.36 10745.79 134.57 0.00
19 10443.22 10297.68 145.54 0.00
20 9225.89 9116.34 109.55 0.00
21 11847.02 11768.52 78.50 0.00
22 13192.69 13061.53 131.16 0.00
23 11287.48 11103.49 183.99 0.00
24 24431.90 24290.08 108.49 33.33
25 29898.96 29663.19 235.77 0.00
26 26662.08 26311.76 350.32 0.00
27 21875.65 21662.00 213.65 0.00
28 29002.59 28343.14 659.45 0.00
29 22775.21 22314.99 460.22 0.00
... ... ... ... ...
18219 945638.02 768242.42 177144.00 251.60
18220 977084.84 774695.74 201878.69 510.41
18221 936859.49 796104.27 140652.84 102.38
18222 914409.26 710654.40 203526.59 228.27
18223 1005593.78 858772.69 146808.97 12.12
18224 1089861.24 915452.78 174381.57 26.89
18225 166747.85 87108.00 79495.39 144.46
18226 129353.55 73163.12 56020.24 170.19
18227 176465.63 107174.93 69290.70 0.00
18228 250090.37 85835.17 164087.33 167.87
18229 218560.51 99989.62 118314.77 256.12
18230 155725.83 120428.13 35257.73 39.97
18231 188559.45 88497.05 99810.80 251.60
18232 205409.91 70232.59 134666.91 510.41
18233 129911.47 77822.23 51986.86 102.38
18234 128267.76 76091.99 51947.50 228.27
18235 126261.89 89115.78 37133.99 12.12
18236 199330.12 103761.55 95544.39 24.18
18237 10806.44 10569.80 236.64 0.00
18238 12341.48 12114.81 226.67 0.00
18239 16762.57 16510.32 252.25 0.00
18240 13655.49 13401.93 253.56 0.00
18241 13964.33 13698.27 266.06 0.00
18242 13776.71 13553.53 223.18 0.00
18243 12693.57 12437.35 256.22 0.00
18244 13498.67 13066.82 431.85 0.00
18245 9264.84 8940.04 324.80 0.00
18246 9394.11 9351.80 42.31 0.00
18247 10969.54 10919.54 50.00 0.00
18248 12014.15 11988.14 26.01 0.00

18249 rows × 4 columns

In [58]:
pd.plotting.scatter_matrix(
    avo.select_dtypes(include='number').iloc[:,-5:-1],
    figsize=(12,12)
);

We can also use a very handy parameter, c, which allows us to color the dots in a scatter plot. This is extremely helpful when doing classification problems, often you will set the color to the class label.

In [67]:
avo['type'].unique()
Let's map the type field to the color of the dot in our price elasticity curve. To use the type field, we need to convert it from a string into a number. We can use pd.Seri
Out[67]:
array(['conventional', 'organic'], dtype=object)

Let's map the type field to the color of the dot in our price elasticity curve. To use the type field, we need to convert it from a string into a number. We can use pd.Series.apply() for this.

In [68]:
mapping_dict = {}
initial_class_label = 0
for type in list(avo['type'].unique()):
    mapping_dict[type] = initial_class_label
    initial_class_label += 1
mapping_dict
Out[68]:
{'conventional': 0, 'organic': 1}

We can see we have two unique type labels, conventional and organic. Although that is the case for this dataset, let's create a function that will store the labels in a dictionary, incrementing the number up by 1 for each new label. This way, if we receive an additional type label in the future, our code won't break. Always think about extensible code!

In [ ]:
 

Now we can use this mapping_dict dictionary to map the values using .apply():

In [69]:
avo['type_as_num'] = avo['type'].apply(lambda x: mapping_dict[x])
avo[['type', 'type_as_num']].head(3)
Out[69]:
type type_as_num
0 conventional 0
1 conventional 0
2 conventional 0

Finally, we can use this binary class label as our c parameter to gain some insight:

In [76]:
avo.plot(kind='scatter', x='Total Volume', y='AveragePrice', \
        c='type_as_num', colormap='winter', figsize=(8,4))
plt.xlabel('Volume')
plt.savefig('./avo_price.png');
In [77]:
!ls
04-plotting-with-pandas.ipynb		 chocolate_ratings.csv
04-plotting-with-pandas-solutions.ipynb  international_football_results.csv
avocado.csv				 readme.md
avo_price.png

Amazing! It looks like the organic avocados (value of 1) totally occupy the lower volume, higher price bracket. Those dang kids with their toast and unicycles driving up the price of my 'cados!

Here, we can also see a 'more' continuous c parameter, year, which makes use of the gradient a little better. There are tons of gradients you can use, check them out here.

In [ ]:
 

Finally, we can save the plot to a file, using the plt.savefig() method:

In [ ]:
 

Summary

In this lesson, we showed examples of how to create a variety of plots using Pandas and Matplotlib. We also showed how to use each plot to effectively display data.

Do not be concerned if you do not remember everything — this will come with practice! Although there are many plot styles, many similarities exist between how each plot is drawn. For example, they have most parameters in common, and the same Matplotlib functions are used to modify the plot area.

We looked at:

  • Line plots
  • Bar plots
  • Histograms
  • Scatter plots

Additional Resources

Always read the documentation!