🎉🎈🎂🍾🎊🍻💃

A hands on and practical introduction to programming and python development.

The purpose of this course is to introduce some fundamental concepts of software development. We will be using the python programming language, which provides a readable, powerful syntax that is used by data scientists, web developers, even NASA engineers! In particular, we'd like to introduce the pandas library, which is a very widely used in python for data science and visualization Our aspiration in this workshop is to work up to a point where we can confidently and feasibly level up our python knowledge without external support from anyone.

hello

Before we begin, let us explore some class tools and resources that we will be leveraging as we traverse this course. Additionally, let's take some time to set up our local dev environments so that we can run python on our machines!

Please find below important tools and resources that would be useful for class.

🎉 Introductory Slides

This will be one of the only two slide decks we ever get through in class. Use this resource to set expectations about class in general on a high level.

🎈Live Class Notes

Live class notes! Anything I write in my code editor will be beamed here for your convenience!

🎊 Slack

Class slack! This is how we communicate and keep in touch.

Before we get into writing our code, we will have to install a few programs and tools.

We will use REPL.IT as a quick, fast, simple way to get started writing python code. REPL, or Read, Edit, Play, Loop allows us to run python code from our browser. You will need to create an account - but it's free!

After signing up, please visit this link and type in PYTHON to choose the correct python environment.

Download Sublime Text

Sublime Text — code editor — you'll be writing code here. This is a free tool, but they will ask you to donate every few saves. However, you can use the program for free as long as you'd like.

Although wrangling the PyCharm / Anaconda set up described above will allow us to safely and happily write python code locally, it is in some ways severely limiting because we are not able to run long standing processes or communicate with our code from real world inputs.

In order to truly achieve freedom to do anything we want with python, we must configure an environment in the cloud that is accessible via the internet.

Normally, this is an expensive and skills-intensive process. But! The Future is Now fam, and our service based economy affords us the ability to relatively easily set up a python environment for experimenting around in the cloud for free(...mium).

Pls go to Python Anywhere and create a free account. If you find the service useful, feel free to upgrade later. For now, just create the account and verify that you can log in. We will have instructions for transferring some of our projects to the internets later on in the day.

If you are interested, you may choose to download and run python locally. There are several ways to do this, an easy way is to follow the steps delineated in the next section.

Before we get into writing our code, we will have to install a few programs and tools. It may take about a half hr to pull off but ultimately a properly established development environment will pay off in spades as we navigate the rest of our day.

Instructions vary slightly depending on what kind of machine you're using. Click the link below that applies to you:

Installation Instructions: Mac

Installation Instructions: Linux

Installation Instructions: Windows

Macs usually come with Python 2 already installed. We're going to run through some installation steps to make sure you've got the latest and greatest that Python has to offer.

You can do this by pressing command+space bar and typing "terminal," or by locating the application and clicking on the icon.

xcode-select --install

This may take a few minutes. Once it's done, you can run the following command to make sure it's installed properly.

xcode-select -p

Your output should look something like this:

/Applications/Xcode.app/Contents/Developer

Pro tip: Do not try to type this in. Copy and paste to make sure everything is correct. Do this by selecting the text with your cursor and pressing command+C. Then, go to your terminal and press command+V.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Once this command runs, type brew doctor on your terminal prompt. If you get the output Your system is ready to brew, you are ready to move on to the next step.

This is a bit confusing, but basically we're setting the path up so Homebrew knows where to install something.

open ~/.profile

The file should open up. Ask your instructor for help if it didn't. Copy and paste the following line at the bottom of this file:

export PATH=/usr/local/bin:/usr/local/sbin:$PATH

Save the changes and close the file.

Homebrew, by default, gets the latest stable version of whatever you're trying to install.

brew install python

open ~/.bashrc

At the bottom of that file, copy and paste the following lines:

alias python=python3
alias pip=pip3

Learn more about aliases here.

Right click (control+click on most Macs) on the Terminal icon in your application tray. Select Quit from the menu to make sure Terminal is fully stopped. Then, open it again (see Step 1).

Pro tip: Your settings won't be updated until Terminal is fully stopped and restarted. If you simply minimize the program, you will not see any updates!

python --version

You will get something like this. As long as it starts with a 3, you're good to go!

Python 3.6.5

Now let's check pip, the package installer.

pip --version

pip 10.0.1 from /usr/local/lib/python3.6/site-packages/pip (python 3.6)

You want pip to be pointing to the Python 3.x version. If either python or pip are still pointing to version 2, please alert your instructor.

You are now in a development environment!

Pro tip: The instructions are for Ubuntu. If you have another version of Linux, please follow these suggested directions.

Either:

Click Ubuntu icon (upper-left corner) to open Dash. Then, type "terminal" and select Terminal from the results.

Or:

Hit the keyboard shortcut Ctrl - Alt + T.

Some distributions of Linux come with Python 3 already installed. How nice! To check if you have Python 3 already, run the following command:

python3 --version

If it gives you a version, you're good to go! Otherwise, move to Step 3.

sudo apt-get update
sudo apt-get install python3.6

Check again for the Python 3 version.

python3 --version

This time, things should be all good.

If you are still unable to get Python 3, please alert your instructor now.

Pro tip: If you have Windows XP, you need to be downgraded from Python 3.6 to 3.4. Please ask your instructor for help if you plan on using Windows XP.

Visit python.org and download the web-based installer for Windows. You'll find this under a "Files" section at the bottom of the page.

If you have 64-bit Windows, use the link that contains 64. If you have 32-bit Windows, download the one without 64. If you have no idea what you have, click here to learn how to find out.

Make sure both Add Python 3.6 to PATH and Install for all users are checked.
Click Install Now.

After the initial installation is finished, there will be an additional option that says something about a max character limit. You want this! Provide permission for this setting to be changed.

    * Click *Start*.
    * Open *Windows System* menu.
    * Select *Command Prompt*.

py

You should get a message telling you what version of Python you're using as well as opening an in-terminal REPL. If you did, great! Skip to the next step.

If you instead received an error message like the one below, something went wrong and Python didn't install correctly.

'py' is not recognized as an internal or external command,
operable program or batch file.

In this case, ask your instructor for assistance.

Pro tip: These directions are for Windows 7 and Windows Vista operating systems. If you have Windows 10, you most likely have a 64-bit machine, but if you want to be extra sure, check here.

Open "System" by clicking the "Start" button.
Right click "Computer."
Click "Properties."
Under "System," you can view the system type.

This will give you a bunch of stats about your machine, including whether it is 32-bit or 64-bit.

Return to Installation Instructions: Windows.

Jupyter Notebooks

Open source web application that allows us to run "live" python code in "code" blocks and add explanatory text around it, describing the code and our methods.

In data science, this is of paramount importance because we are using code to tell a story - one that interprets a set of data and offers insight and/or conclusions.

Can be done locally, but we will leverage:

Colab

A google project.

Open the link above and sign in. Together, let's explore what a notebook can do!

Please find a list of lectures here. Each lecture outlines the learning objectives and the corresponding topics that we hope to cover.

Get to know each other!
Install python locally

Learn the essential words and concepts that are used on a daily basis by engineers and project/product managers on the job.

Essential Terminology

Understand what basic data types are in Python

Basic Data Types

Use comparison and equality operators to evaluate and compare statementsbasis by engineers and project/product managers on the job.
Use if/elif/else conditionals to achieve control flow.
Create lists in Python.
Print out specific elements in a list.
Perform common list operations.

Conditionals
Lists

Homework

Due Tuesday April 9th, 6:30PM

Homework 1 is due tonight!

Create lists in Python.
Print out specific elements in a list.
Perform common list operations.

Lists

Homework 1 is due tonight!

Perform common dictionary actions.
Build more complex dictionaries.

Dicts

Homework

Due Tuesday April 18th, 6:30PM

Homework 2 is due Thursday!

Understand how to write code that repeats itself
Understand the different ways to create loops in python
Use loops to iterate through lists and dicts

Loops

Homework 2 is due TODAY!

Understand how to leverage python modules
Understand how to import and export modules
Understand how to use virtual environments to "save" modules

Modules

Understand how to leverage, import, and export python modules
Understand how to use virtual environments to "save" modules
Understand how to create and call functions

Modules & Packages
Functions

What this means

Homework 3 is due Tuesday April 30th!

Understand how to use classes in python
Understand how inheritance works in python

Functions Review
Classes

Homework 4 is due Tuesday May 7th!

Understand how to use classes in python
Understand how inheritance works in python

Classes

Homework 4 is due Tuesday May 7th!

Understand how to use classes in python

Classes

Understand the basics of data science

Data Science

Use Pandas to perform data science tasks

Pandas Basics
Pandas Data Manipulation

Use Pandas to perform exploratory data analysis

Exploratory Data Analysis w. 🐼

Homework 5 is due Tuesday May 21st!

Use Pandas to perform exploratory data analysis, II

Exploratory Data Analysis w. 🐼, II

Project Requirements is due Tuesday June 4th!

Jupyter Notebooks
Use Pandas to perform data visualizations.

Jupyter Notebooks
🐼 Data Viz

Project Requirements is due Tuesday June 4th!

Work on final projects / ask questions.

Project Requirements is due Tuesday June 4th!

Work on final projects / ask questions.

🎉🎈🎂🍾🎊🍻💃

Final Presentations!
🍻🍻🍻

These are the main topics that we will explore in this course. These topics will be broken into Lectures, which is how we will organize each class.

Here are some words and concepts that will hopefully give you a more holistic view of the more technical aspects of the industry.

Discrete, highly logical and explicit instructions that are parsed and executed by a computer.

We call this set of human-readable instructions source code, or colloquially, a computer program.

Compilers can take this source code and transform it into machine code, a representation of the source that can be executed by the computer's central processing unit or CPU.

Not all programs are compiled though, some are interpreted. The difference is that compiled languages need a step where the source code is physically transformed into machine code. However, with an interpreted language, this additional step is excluded in favor of parsing and executing the source code directly when the program is run.

All programs are composed with a collection of fundamental concepts that, when combined, can essentially dictate a wide variety of tasks a computer can perform.

Here are a collection of these most important concepts:

Typically, we can store and retrieve data in our programs by associating them with intermediary values that we call variables

We use expressions to evaluate stuff. For example, 2 + 2 is an example of an expression that will evaluate a value, namely 4.

NOTE: typically we can use expressions and declarations in tandem to perform complex tasks. For instance, we can reference a variable we declared in an expression to help us evaluate new values which can then be stored.

Statements will use expressions and declarations to alternate a program's control flow, which is essentially the order in which declarations, expressions, and other statements are executed.

Aside from these fundamental concepts, we also talk a lot about this idea of algorithms. An algorithm is simple a series of declarations, expressions, and statements that can be used over and over again to solve well defined problems of a certain type.

For example, we can implement an algorithm that converts temperature from fahrenheit to celsius. It would look something like this:

Declare F = 32;
Expression ( F - 32 ) / 1.8;
Declare C = Evaluated expression from [2]

This is a form of pseudo code where we define the steps a computer program — any — computer program can take to convert fahrenheit to celsius.

The beauty of programming is that all of it revolves around the same key set of concepts and ideas. For this reason, we do not need to specify any particular programming language when discussing the functional aspects of a program.

A programming language is a series of grammar and rules that we can define towards writing source code.

Languages are effectively different approaches towards communicating the same ideas in programming. Essentially, we can communicate in say both French and English, what mainly differs is the structure of our sentences and the actual words and sounds themselves.

The same analogy can be made with programming languages.

There are many. Way too many.

Here are some of the most popular ones, though.

JavaScript: this language is interpreted.
Python: this language is interpreted.
Java: this language is compiled
Ruby: this language is interpreted.
C/C++: this language is compiled.

These languages all build on the same concepts defined above; the main difference lies in how they are run (compiled vs interpreted) and also how they are used.

In general, anything programmable can be programmed in each of the languages defined above. However, some languages are better suited for certain tasks above others.

For example, to perform web programming on the front-end, you'll want to write JavaScript. This is because all browsers collectively support running javascript within it's environment.

Here's a blog post from Dan Bader that outlines some data-driven reasons learning python right now can pay off -- https://dbader.org/blog/why-learn-python

Let's pseudocode a thermostat. User is able to:

Set a temperature
When room temp is greater than set temp, turn on heat
Otherwise, turn off heat

Pseduocode Rock, Paper, Scissors!

Given two player inputs, p1 and p2 - where each selection can be one of: {"r", "p", "s"} - write a program that outputs the winner as:

p1, meaning player 1 has won
p2, meaning player 2 has won

Let's discuss data types, variables, and naming.

A data type is a unit of information that can be stored and retrieved using a program language. We store data into, and retrieve data from, variables.

first_prime = 2

print(first_prime) # expect to see 2

PRACTICE

In python, the best practice is to snake_case variables, where we delimit spaces within variable names with the _ character.

this_is_snake_cased = 1


example_int = 1
example_int_type = type(1) # <class 'int'>

Floats are defined as decimals


example_float = 1.001
example_float_type = type(1.001) # <class 'float'>

We can operate on integers/floats in the following ways

example_int = 1

another_int = example_int + 5 # addition
another_int = example_int * 5 # multiplication
another_int = example_int - 5 # subtraction
another_int = example_int / 5 # division
another_int = example_int % 5 # modulus operator

Sequences of characters are called "strings"

my_name = 'Taq Karim'
your_name = "John Smith" # single or double quotes are valid

string_type = type("testing") # <class 'str'>

You can also store several separate snippets of text within a single string. Let's say you're storing song lyrics, so you want to have a line break between each line of the song. To do this, you can use triple quotes i.e. ''' or """. You can use single and double quotes within the string freely, so no need to worry about that detail!

'''
'Cause if you liked it, then you should have put a ring on it
If you liked it, then you should have put a ring on it
Don't be mad once you see that he want it
If you liked it, then you should have put a ring on it
'''

We can "add" strings

print("this string" + "that string") # what does this output?

We cannot add strings to non strings

print("this will not work" + 4) # 4 is not stype str

As a convenience, we can format strings like so:

a = 1
b = 2

formatted_string = f"{a} is {b}" # notice how a, b are formatted into string even tho they are ints

print(formatted_string) # "1 is 2"

Booleans represent true/false


is_it_winter = True
is_it_warm_out = False

boolean_type = type(True) # <class 'bool'>

We use booleans primarily in conditional statements

None represents variables that have not yet been defined.

print(type(None)) # <class 'NoneType'>

Sometimes, we need to convert one datatype to another. Typecasting allows us to convert between types


# convert string to int
int('10') # 10 - but as type int
int('tasdfa') # throws a ValueError


# convert int to str
str(10) # '10' - but as type str


# convert int to bool
bool(10) # True
bool(0) # False

To check the type of a data type:


# check types
isinstance(-1, bool) # False
isinstance(False, bool) # True

# ..etc

How to use the PSETS Repo

PRACTICE: Shopping List Calculator I

PRACTICE: Shopping List Calculator II

PRACTICE: Shopping List Calculator III

In order for code to be useful, it is imperative to have the ability to make decisions. In most languages, we use the conditional statement to facilitate decision making.

Before we dig deeper into conditionals, let us first examine the Boolean datatype.

In short, a boolean represents a "yes" or "no" value. In python, booleans are written as:

True # this is a boolean, for "yes"
False # this is a boolean, for "no"

Because booleans are just datatypes, we can store them into variables.

is_it_summer = False
will_it_be_summer_soon = True

Moreover, because booleans are data types, certain operators will evaluate to booleans:

age = 13
is_eligible_to_buy_lotto = age > 13

# ^^ this will evaluate to False and then 
# that value, False, will be stored in variable
# is_eligible_to_buy_lotto

The operator above, > is called a boolean operator. Notice how we stored the evaluation of the > expression into a variable. Remember, booleans are just datatypes, therefore they work the same way we would expect numbers and strings to work - except that the operators look / do different things (but in principle they are one and the same!)

Let's now explore the boolean operators available in python.

my_money = 37.00
total = 35.00

enough_money = my_money > total # True
just_enough_money = my_money >= total # also True

speed_limit = 65
my_speed = 32

under_speed_limit = my_speed < speed_limit # True
at_or_under_speed_limit = my_speed <= speed_limit # also True

Because we use the = symbol for identity (ie: to set a variable), it is not available for comparison operations. Instead, we must use the == and != symbols.

speed_limit = 65
my_speed = 32

are_they_equal = (speed_limit == my_speed) # False
are_they_not_equal = (speed_limit != my_speed) # True

Note that the parens are unnecessary here, but we add them anyways for the sake of clarity.

Also worth noting that the is keyword can be used in lieu of the ==:

pi = 3.14

result = pi is 3.14 # True

x = 2
# a
1 < x < 3 # True

# b
10 < x < 20 # False

# c
3 > x <= 2 # True

# d
2 == x < 4 # True

For a, we check to see if 1 is less than x AND x is less than 3.

For b, we check to see if 10 is less than x (it is not) and stop right there

For c, we check to see if 3 is greater than x AND x is less than or equal to 2.

For d, we check to see if x is equal to 2 AND x is less than 4.

In addition to comparison operators, python also offers support for logical operators - in the form of:

not
or
and

The not operator simply negates. For instance,

is_it_cold = True

result = not is_it_cold # False

Likewise,

is_it_hot = False

result = not is_it_hot # True

The or operator evaluates to True if any one of the operands is true.

is_it_warm = True
is_it_cold = False
is_it_foggy = False

result = is_it_warm or is_it_cold or is_it_foggy # True

Will be true since at least once of the items is True

The and operator evaluates to True is all of the operands are true.

is_it_warm = True
is_it_foggy = True
is_it_humid = True

result = is_it_warm or is_it_humid or is_it_foggy # True

Will be true since at ALL of the items are True

Membership operators are: in and not in. They are used to determine if a value is in a sequence, for instance:

line = 'a b c d e f g'

result = 'a' in line # True
result = 'z' in line # False
result = 'k' not in line # True
result = 'a' not in line # False

A conditional will attempt to evaluate an expression down to a boolean value - either True or False. Based on the boolean evaluation, the program will then execute or skip a block of code.

So for instance:

if True:
    print("this will always run!")

if False:
    print("this will NEVER run!")

However, since we know booleans to be datatypes, any of the operators discussed above can also be used:

temp = 43

if temp < 65:
    print("wear a jacket!")

The code above will only run if temp is less than 65.

We can also do something like:

temp = 43
is_it_raining = True

if is_it_raining and temp < 65:
    print('wear a jacket and bring an umbrella!')

In the example above, we make use of comparison operators and logical operators in a compound statement.

If we have a condition that can only go two ways (ie: it will only be true or false), we can leverage the else statement:

temp = 43

if temp < 65:
    print('wear a coat!')
else:
    print('you will not need a coat!')

But what if we wanted support for multiple possibilities? That's where the elif statement comes in:

temp = 43

if temp < 30:
    print('wear a heavy jacket')
elif temp < 50:
    print('wear a light jacket')
elif temp < 60:
    print('wear a sweater')
else:
    print('you do not need any layers!')

In the example above, we print one of 4 possibilities - the elif allows us to go from 2 potential conditions to N potential conditions.

🚗 PSETS

The problems are reproduced below, but you will want to run on github. First,

$ . ./update

from random import randint

randn = randint(1,3) # generates a random number from 1 to 3
# if 1, print 'red'
# if 2, print 'green',
# if 3, print 'blue'

from random import randint

# generate a random phone number of the form:
# 1-718-786-2825
# This should be a string
# Valid Area Codes are: 646, 718, 212
# if phone number doesn't have this area code, pick
# one of the above at random


p1 = 'r' # or 'p' or 's'
p2 = 'r' # or 'p' or 's'

# Given a p1 and p2
# print 1 if p1 has won
# print 2 if p2 has won
# print 0 if tie
# print -1 if invalid input
# expects both p1 and p2 inputs to be either
# "r", "p", or "s"

from random import randint

p1 = # randomly choose 'r' or 'p' or 's'
p2 = # randomly choose 'r' or 'p' or 's'

# Given a p1 and p2
# print 1 if p1 has won
# print 2 if p2 has won
# print 0 if tie
# print -1 if invalid input
# expects both p1 and p2 inputs to be either
# "r", "p", or "s"


p1 = # from user input
p2 = # from user input

# Given a p1 and p2
# print 1 if p1 has won
# print 2 if p2 has won
# print 0 if tie
# print -1 if invalid input
# expects both p1 and p2 inputs to be either
# "r", "p", or "s"

This is the same as the original RPS problem, except that cannot expect the input to be valid. While we want r or p or s, there is a possibility that input can be anything like...

ROCK (all caps)
R (r but capitalized)
PAPrrRR (incorrectly spelled, upper/lowercased)

Implement conditional statements that will sanitize the user input or let user know that input is invalid.

p1 = # from user input
p2 = # from user input

# Given a p1 and p2
# print 1 if p1 has won
# print 2 if p2 has won
# print 0 if tie
# print -1 if invalid input
# expects both p1 and p2 inputs to be either
# "r", "p", or "s"


p1 = # from user input - we still want validation from above!
p2 = # randomly generated against computer

# Given a p1 and p2
# print 1 if p1 has won
# print 2 if p2 has won
# print 0 if tie
# print -1 if invalid input
# expects both p1 and p2 inputs to be either
# "r", "p", or "s"

grade = 15 # expect this to be a number

# write a program that will print the "letter" 
# equivalent of the grade, for example:
# when grade = 90 # -> expect A
# when grade = 80 # -> expect B
# when grade = 70 # -> expect C
# when grade = 60 # -> expect D
# when grade = 54 # -> expect F
# when grade = -10 # -> expect Error
# when grade = 10000 # -> expect Error
# when grade = "lol skool sucks" # -> expect Error

Challenge: Can you raise an error if unexpected input supplied vs just printing out Error? What's the difference?

Given three numbers, a, b, c, without multiplying, determine the sign of their product.

EXAMPLE: a = -5, b = 6, c = -4, print 1

EXAMPLE: a = 5, b = 6, c = -4, print -1

Given a string str, determine if there are any uppercase values in it. Use only conditional statements and string methods (you may have to look some up!)

EXAMPLE: str = "teSt", print True

Given any empty string, of the form:

''
' '
'  '
# ...
'        ' # etc

determine if the str is empty or not (print True or False)

Given the following inputs:

P = # True or False
Q = # True or False
op = # '^' (logical AND, conjunction)
     # OR, 'v' (logical OR, disjunction)
     # OR, '->' (logical conditional, implication)
     # OR, '<->' (biconditional)

determine the correct outcome.

Info on truthtables

In order to begin to truly write dynamic programs, we need to be able to work with dynamic data where we do not know how much of a certain type of variable we have.

The problem, essentially is, variables hold only one item.

my_color = "red"
my_peer = "Brandi"

Lists hold multiple items - and lists can hold any datatype.

Here are some different ways to declare a list variable:

colors = ['red', 'yellow', 'green'] #strings
grades = [100, 99, 65, 54, 19] #numbers
bools = [True, False, True, True] #booleans

To create a new blank list, simply use python blank_list = list().

The list index means the location of something (an element) in the list.

List indexes start counting at 0!

List	"Brandi"	"Zoe"	"Steve"	"Aleksander"	"Dasha"
Index	0	1	2	3	4

my_class = ['Brandi', 'Zoe', 'Steve', 'Aleksander', 'Dasha']
print(my_class[0]) # Prints "Brandi"
print(my_class[1]) # Prints "Zoe"
print(my_class[4]) # Prints "Dasha"

If you want to extend the content of a single list, you can use .append(), .extend() .insert() to add elements of any data type.

.append() & .extend(): These methods both add items to the end of the list. The difference here is that .append() will add whatever value or group of values you pass it in one chunk. In contrast, if you pass a group of values into .extend(), it will add each element of the group individually. Here are a few examples to show you the difference in outcomes.

# passing direct argument
x = ['a', 'b', 'c', 'd']
x.append(['e', 'f', 'g'])
print(x) # ['a', 'b', 'c', 'd', ['e', 'f', 'g']]

x = ['a', 'b', 'c', 'd']
x.extend(['e', 'f', 'g'])
print(x) # ['a', 'b', 'c', 'd', 'e', 'f', 'g']

# passing argument within a var
x = ['a', 'b', 'c', 'd']
y = ['e', ('f', 'g'), ['h', 'i'], 'j']
x.append(y)
print(y) # ['a', 'b', 'c', 'd', ['e', ('f', 'g'), ['h', 'i'], 'j']]

x = ['a', 'b', 'c', 'd']
y = ['e', ('f', 'g'), ['h', 'i'], 'j']
x.extend(y)
print(x) # ['a', 'b', 'c', 'd', 'e', ('f', 'g'), ['h', 'i'], 'j']

Notice that .extend() only considers individual values of the parent list. It still added the tuple and list - ('f', 'g') and ['h', 'i'] - to our list x as their own items.

.insert(index, value): If you want to add an item to a specific point in your list, you can pass the desired index and value into .insert() as follows.

# your_list.insert(index, item)

my_class = ['Brandi', 'Zoe', 'Steve', 'Aleksander', 'Dasha', 'Sonyl']
my_class.insert(1, 'Sanju')
print(my_class)
# => ['Brandi', 'Sanju', 'Zoe', 'Steve', 'Aleksander', 'Dasha', 'Sonyl']

l[index:index]=: To replace items in a list by their index position, you can use the same syntax for adding a single new value. You simply reference which indeces you want to replace and specify the new values.

x = ['Brandi', 'Sanju', 'Zoe', 'Steve', 'Aleksander', 'Dasha', 'Sonyl']
x[1] = 'Raju'
x[6:] = ['Chloe', 'Phoebe']
print(x) # ['Brandi', 'Raju', 'Zoe', 'Steve', 'Aleksander', 'Dasha', 'Chloe', 'Phoebe']

.join(): If you need to, you can compile your list items into a single string.

letters = ['j', 'u', 'l', 'i', 'a', 'n', 'n', 'a']
name = ''.join(letters)
print(name) # 'julianna'

words = ['this', 'is', 'fun']
sentence = ' '.join(words)
print(f'{sentence}.') # 'this is fun.'

.split('by_char'): You can also do the opposite - split values out of a string and turn each value into a list item. This one doesn't work for single words you might want to split into individual characters. That said, you can specify what character should convey to the method when to split out a new item. By default, .split() will use a space character to split the string.

x = 'this is fun'
sentence = x.split() # note - using default split char at space
print(sentence) # ['this', 'is', 'fun']

y = 'Sandra,hi@email.com,646-212-1234,8 Cherry Lane,Splitsville,FL,58028'
data = y.split(',')
print(data) # ['Sandra', 'hi@email.com', '646-212-1234', '8 Cherry Lane', 'Splitsville', 'FL', '58028']

Likewise, you can use .pop() or .pop(index) to remove any type of element from a list.

.pop():

Removes an item from the end of the list.

# your_list.pop()

my_class = ['Brandi', 'Zoe', 'Steve', 'Aleksander', 'Dasha', 'Sonyl']
student_that_left = my_class.pop()
print("The student", student_that_left, "has left the class.")
# Sonyl
print(my_class)
# => ['Brandi', 'Zoe', 'Steve', 'Aleksander', 'Dasha']

.pop(index):

Removes an item from the list.
Can take an index.

# your_list.pop(index)

my_class = ['Brandi', 'Zoe', 'Steve', 'Aleksander', 'Dasha']
student_that_left = my_class.pop(2) # Remember to count from 0!
print("The student", student_that_left, "has left the class.")
# => "Steve"
print(my_class)
# => ['Brandi', 'Zoe', 'Aleksander', 'Dasha']

Python has some built-in operations that allow you to analyze the content of a list. Some basic ones include:

len(): This tells you how many items are in the list; can be used for lists composed of any data type (i.e. strings, numbers, booleans)

# length_variable = len(your_list)

my_class = ['Brandi', 'Zoe', 'Aleksander', 'Dasha']
num_students = len(my_class)
print("There are", num_students, "students in the class")
# => 5

sum(): This returns the sum of all items in numerical lists.


# sum_variable = sum(your_numeric_list)

team_batting_avgs = [.328, .299, .208, .301, .275, .226, .253, .232, .287]
sum_avgs = sum(team_batting_avgs)
print(f"The total of all the batting averages is {sum_avgs}")
# => 2.409

min() & max():

These return the smallest and largest numbers in a numerical list respectively.

# max(your_numeric_list)
# min(your_numeric_list)

team_batting_avgs = [.328, .299, .208, .301, .275, .226, .253, .232, .287]
print(f"The highest batting average is {max(team_batting_avgs}")
# => 0.328
print("The lowest batting average is", min(team_batting_avgs))
# => 0.208

If you want to organize your lists better, you can sort them with the sorted() operator. At the some basic level, you can sort both numerically and alphabetically.

Numbers - Ascending & Descending

numbers = [1, 3, 7, 5, 6, 4, 2]

ascending = sorted(numbers)
print(ascending) # [1, 2, 3, 4, 5, 6, 7]

To do this in descending order, simply add reverse=True as an argument in sorted() like this:

descending = sorted(numbers, reverse=True)
print(descending) # [7, 6, 5, 4, 3, 2, 1]

Letters - Alphabetically & Reverse

letters = ['b', 'e', 'c', 'a', 'd']

ascending = sorted(letters)
print(ascending) # ['a', 'b', 'c', 'd', 'e']

descending = sorted(letters, reverse=True)
print(descending) # ['e', 'd', 'c', 'b', 'a']

NOTE! You cannot sort a list that includes different data types.

Tuples are a special subset of lists - they are immutable - in that they cannot be changed after creation.

We write tuples as:

score_1 = ('Taq', 100)

# OR

score_2 = 'Sue', 101

Tuples are denoted with the ().

We read tuples just like we would read a list:

print(score_1[0]) # 'Taq'

Sets are special lists in that they can only have unique elements

set_1 = {1,2,3,4,5} # this is a set, notice the {}
set_2 = {1,1,1,2,2,3,4,5,5,5} # this is still a set
print(set_2) # {1,2,3,4,5}

print(set_1 == set_2) # True

Sets are not indexed, so you cannot access say the 3rd element in a set. Instead, you can:

print(2 in set_1) # True
print(9 in set_1) # False

Here's a helpful list of set operations.

Create a list with the names "Holly", "Juan", and "Ming".
Print the third name.
Create a list with the numbers 2,4, 6, and 8.
Print the first number.

Declare a list with the names of your classmates
Print out the length of that list
Print the 3rd name on the list
Delete the first name on the list
Re-add the name you deleted to the end of the list
You work for Spotify and are creating a feature for users to alphabetize their playlists by song title. Below are is a list of titles from one user's playlist. Alphabetize these songs. playlist_titles = ["Rollin' Stone", "At Last", "Tiny Dancer", "Hey Jude", "Movin' Out"]
Create a list with 6 numbers and sort it in descending order.

On your local computer, create a .py file named list_practice.py. In it:

Save a list with the numbers 2, 4, 6, and 8 into a variable called numbers.
Print the max of numbers.
Pop the last element in numbers off; re-insert it at index 2.
Pop the second number in numbers off.
Append 3 to numbers.
Print out the average number.
Print numbers.

In addition to lists, another more comprehensive method for storing complex data are dicts, or dictionaries. In the example below, we associate a key (e.g. 'taq') to a value (e.g. 'karim').

dict1 = {
  'taq': 'karim',
  'apple': 35,
  False: 87.96,
  35: 'dog',
  'tree': True,
  47: 92,
  # etc.
}

print(dict1) # {'taq': 'karim', 'apple': 35, False: 87.96, 35: 'dog', 'tree': True, 47: 92}

The values in a dict can be any valid Python data type, but there are some restrictions on what you can use as keys. Keys CAN be strings, integers, floats, booleans, and tuples. Keys CANNOT be lists or dicts. Do you see the pattern here? The data in a dict key must be immutable. Since lists and dicts are mutable, they cannot be used as keys in a dict.

NOTE! The keys in a dict must be unique as well. Be careful not to add a key to a dict a second time. If you do, the second item will override the first item. For instance, if you upload data from a .csv file into a dict, it would be better to create a new dict first, then compare the two to check for identical keys and make any adjustments necessary.

One last thing before we move past the nitty gritty -- the keys and values of a single dict don't have to be homogenous. In other words, you can mix and match different key, value, and key value pair data types within one dict as seen above.

There are several ways you can create your dict, but we'll go through the most basic ones here.

1. The simplest is to create an empty list with the dict() method.

students = dict() # this creates a new, empty dict

2. You can create a dict by passing in key value pairs directly using this syntax:

food_groups = {
    'pomegranate': 'fruit',
    'asparagus': 'vegetable',
    'goat cheese': 'dairy',
    'walnut': 'legume'
}

3. You can also convert a list of tuples into a dict using dict()...

# list of tuples   
listofTuples = [("Hello" , 7), ("hi" , 10), ("there" , 45),("at" , 23),("this" , 77)]

wordFrequency = dict(listofTuples)
print(wordFrequency) # {'this': 77, 'there': 45, 'hi': 10, 'at': 23, 'Hello': 7}

4. ...and even combine two lists to create a dict by using the zip() method.

The zip() method takes the name of each list as parameters - the first list will become the dict's keys, and the second list will become the dict's values. NOTE! This only works if you're sure the key value pairs have the same index position in their original lists (so they will match in the dict).

names = ['Taq', 'Zola', 'Valerie', 'Valerie']
scores = [[98, 89, 92, 94], [86, 45, 98, 100], [100, 100, 100, 100], [76, 79, 80, 82]]

grades = dict(zip(names,scores))
print(grades) # {'Taq': [98, 89, 92, 94], 'Zola': [86, 45, 98, 100], 'Valerie': [76, 79, 80, 82]}

Once you've stored data in your dict, you'll need to be able to get back in and access it! Take a look at this dict holding state capitals.

state_capitals = {
    'NY': 'Albany',
    'NJ': 'Trenton',
    'CT': 'Hartford',
    'MA': 'Boston'
}

We can access each value in the list by referencing its key like so:

MAcap = state_capitals['MA']
print('The capital of MA is {}.'.format(MAcap)) # 'The capital of MA is Boston.'

Attempting to find a key that does not exist leads to error. You also can't access dict items with index numbers like you do with lists! If you try, you will get a KeyError - because an index number does not function like a dict key.

print(state_capitals['PA']) # KeyError from missing key
print(state_capitals[2]) # KeyError from index reference

Instead, it's better to look up a key in a dict using .get(key, []). The .get() method takes the key argument just as above EXCEPT it allows you to enter some default value it should return if the key you enter does not exist. Usually, we use [] as that value.

print(state_capitals.get('PA', []))
# PA is not in our dict, so .get() returns []

Now, this dict has 4 keys, but what if it had hundreds? We can retrieve data from large dicts using .keys(), .values(), or .items().

pets_owned = {
  'Taq': ['teacup pig','cat','cat'],
  'Francesca': ['llama','horse','dog'],
  'Walter': ['ferret','iguana'],
  'Caleb': ['dog','rabbit','parakeet']
}

pets.keys() # ['Taq', 'Francesca', 'Walter', 'Caleb']

pets.values() # [['teacup pig','cat','cat'], ['dog','rabbit','parakeet'], etc ]

pets.items() # [('Taq', ['teacup pig','cat','cat']), ('Francesca', [['llama','horse','dog']), etc]

Just like lists, you can edit, analyze, and format your dicts. Some work the same for dicts and lists such as len(). However, adding, deleting, and updating data requires a little more detail for dicts than for lists.

We can add a single item to a dict...

state_capitals = {
    'NY': 'Albany',
    'NJ': 'Trenton',
    'CT': 'Hartford',
    'MA': 'Boston'
}

state_capitals['CA'] = 'Sacramento'

print(state_capitals) # {'NY': 'Albany', 'NJ': 'Trenton', 'CT': 'Hartford', 'MA': 'Boston', 'CA': 'Sacramento'}

...but more likely you'll want to make bulk updates to save yourself time. To do so, you can use the .update() method to add one or more items to the dict. NOTE!: It's easy to accidentally override items when you're merging datasets. Don't worry though - we'll learn an easy way to check for duplicate keys in the next section.

state_capitals = {
    'NY': 'Albany',
    'NJ': 'Trenton',
    'CT': 'Hartford',
    'MA': 'Boston',
    'CA': 'Sacramento'
}
more_states = {
    'WA': 'Olympia',
    'OR': 'Salem',
    'TX': 'Austin',
    'NJ': 'Hoboken',
    'AZ': 'Phoenix',
    'GA': 'Atlanta'
}

state_capitals.update(more_states)

state_capitals = {
    'NY': 'Albany',
    'NJ': 'Hoboken',
    'CT': 'Hartford',
    'MA': 'Boston',
    'CA': 'Sacramento',
    'WA': 'Olympia',
    'OR': 'Salem',
    'TX': 'Austin',
    'AZ': 'Phoenix',
    'GA': 'Atlanta'
}

Notice something? It's easy to accidentally override items when you're merging datasets. Oops, we just changed the capital of NJ to Hoboken! Don't worry though - we'll learn an easy way to check for duplicate keys in the next section.

.clear() simply empties the dict of all items.

.pop(): This removes an item, which you must specify by key. There are two things to note here -

First, you cannot delete a dict item by specifying a value. Since values do not have to be unique the way keys are, trying to delete items by referencing values could cause issues.

Second, just like we saw earlier with .get(key, value), .pop(key, value) will raise a KeyError if you try to remove a key that does not exist in the dict. We avoid this in the same way, by setting a default value - typically [] - for the program to return in case of a missing key.

Unfortunately, you can't use the same method as we did for .update() to delete larger portions of data. We'll learn a way to do that in the next section.

state_capitals.pop('AZ', [])
# removes 'AZ': 'Phoenix' from our dict

popitem(): This one just removes an arbitrary key value pair from dict and returns it as a tuple.

seceded1 = state_capitals.popitem()
# ^ removes a random item and returns it as a tuple
print(seceded1) # ('GA': 'Atlanta') for example

In programming, we define iteration to be the act of running the same block of code over and over again a certain number of times.For example, say you want to print out every item within a list. You could certainly do it this way -

visible_colors = ["red", "orange", "yellow", "green", "blue", "violet"]
print(visible_colors[0])
print(visible_colors[1])
print(visible_colors[2])
print(visible_colors[3])
print(visible_colors[4])
print(visible_colors[5])

Attempting to print each item in this list - while redundant - isn't so bad. But what if there were over 1000 items in that list? Or, worse still, what if that list changed based on user input (ie: either 10 items or 10000 items)?

To solve such problems, we can create a loop that will iterate through each item on our list and run the print() function. This way, we only have to write the print() one time to print out the whole list!

When you can iterate through an object (e.g. a string, list, dict, tuple, set, etc.), we say that the object is iterable. Python has many built-in iterables. You can reference some of the most common ones in the itertools module (read more about itertools here).

You can also define your own Python iterables using the principles of OOP (object-oriented programming). In fact, Python features a construct called a generator to simplify this process for you.

This is the simplest loop and has two primary use cases.

i = 0
while i < 10:
    print(i)
    i += 1
print(i) # will print out numbers 1 through 10

What is happening here is we are running the code block within the while 100 times. We know to stop because the boolean comparison will evaluate to False once i exceeds 100, which is possible only because i is being incremented when we write i += 1.

Here's real-life scenario where you might apply a while loop. Let's say you've programmed your Amazon Echo or Google Home to make a pot of coffee whenever you say the trigger word "tired". Once you say tired, here's a simplified pseudo-code version of what happens behind the scenes:

tired = True
while tired:
  print('I\'ll make some coffee!') # this might be a "say" command
  # code to turn on coffee maker
  tired = False

Whenever a pot of coffee is made, the smart device sets tired back to False. Next time you say "tired", it will reset tired to True.

Let's go back to that list of colors we wanted to print out and use a for loop. The most important part of the for loop is the statement for item in obj. This means the code considers each item in the iterable one at a time when executing the code below.

# Syntax:
# for <item> in <iterable>:
#     <statement(s)>


visible_colors = ["red", "orange", "yellow", "green", "blue", "violet"]
for color in visible_colors:
  print(color)

If you want to iterate through only a section of a list, the range() and enumerate() functions can facilitate this.

range():

With while loops, we saw one way to iterate while counting. Using range() with a for loop allows us to be more concise and more specific. The range() function uses this syntax: range(<begin>, <end>, <stride>). It returns an iterable that yields integers starting with , up to but NOT including . The argument isn't required, but if specified, it indicates an amount to skip between values. For example, range(5, 20, 3) would iterate through 5, 8, 11, 14, and 17. If is omitted, it defaults to incrementing by 1.

Consider the differences in the loops below:

# numeric range with a while loop
i = 0
while i < 5:
  print i # prints numbers 1, 2, 3, 4


# numeric range with a for loop & range()
x = range(0,5)
for i in x: 
    print(i) # prints numbers 1, 2, 3, 4

enumerate():

When you iterate through an object, enumerate() can allow you to keep track of the current item's index position. It stores each one in a Counter object.

test_scores = [100, 68, 95, 84, 79, 99]
for idx, score in enumerate(test_scores):
  print(idx, score)

Something very important to watch out for here is falling into an infinite loop. This is one of the most common traps and can make your code go crazy running the loop over and over without moving through the rest of the program!

The break keyword, the continue keyword, and the else: statement are three core ways to help control the flow and logic within your loops.

In a Python loop, the break keyword escapes the loop, regardless of the iteration number and regardless of how much of the loop code it has completed on its current iteration. Once a break executes, the program will continue to execute after the loop.

We might use a break statement if we only want the loop to iterate under a certain condition. For example:

a = ['foo', 'bar', 'baz', 'qux', 'corge']
while a:
    if len(a) < 3:
        break
    print(a.pop())
print('Done.')

## This loop will output...
"""
corge
qux
baz
Done.
"""

Let's walk through the logic of how we got that outcome:

a = ['foo', 'bar', 'baz', 'qux', 'corge']
while a:

^^^ This tells us that as long as a is True - essentially, as long as it exists - go ahead with the next loop iteration.

if len(a) < 3:
        break
    print(a.pop())

^^^ This says that, if the length of a is less than 3, break out of the loop. In the first iteration, a has 5 items. Given this, the break is not executed. Instead, the code removes a random item from a and prints it. Once the loop gets to the 4th iteration, len(a) is 2. This triggers the break.

After that, the program goes to the next line of code after the break, in this case print('Done.').

This works the same with a for loop as in the example below. Can you think through why we get the outcome foo here?

for i in ['foo', 'bar', 'baz', 'qux']:
  if 'b' in i:
    break
  print(i) # foo

You can also use the continue keyword to interrupt the loop code. The difference is that the continue keyword escapes only the current iteration. A break escapes the loop entirely and goes on to execute the code immediately following the loop. A continue tells the program to stop where it is within the within the current iteration and skip to the the next iteration of the loop.

Here's an example using a while loop. Notice that the continue applies to the outer while loop, whereas the break applies only to the inner while loop.

# Prints out 0,1,2,3,4
s = ''

n = 5
while n > 0:
    n -= 1
    if (n % 2) == 0:
        continue

    a = ['foo', 'bar', 'baz']
    while a:
        s += str(n) + a.pop(0)
        if len(a) < 2:
            break

print(s) # '3foo3bar1foo1bar'

As the program iterates through the decreasing values of n, it determines whether each value is even. The continue executes only for these even-number iterations. Then the loop continues to the next iteration. Thus, the inner while loop only initiates when n is 3 and 1.

Inside the inner while loop, a.pop(0) removes the first item of a. Once this has occurred twice, yielding 'foo' and 'bar', a has fewer than two items, and the break terminates the inner loop. Thus, the values concatenated onto s are, in turn, 3foo, 3bar, 1foo, and 1bar.

Again, this works the same with for loops like so:

for i in ['foo', 'bar', 'baz', 'qux']:
  if 'b' in i:
    continue
  print(i) # foo, qux

The else statement works similarly to a break in that it is triggered once the loop has finished all iterations that meet any conditional specifications. Now, you might wonder why you might use this because putting a statement after the loop will also execute once the loop has finished all iterations that meet any conditional specifications.

Here's the difference:

Statements after the loop will always execute. But if you place additional statements in an else clause, the program will only execute them if the loop terminates by exhaustion. In other words, it only executes if the loop fully completes each iteration until the controlling condition becomes false. If a break terminates the loop before that, for example, the else clause won't be executed.

a = ['foo', 'bar', 'baz', 'qux', 'corge']
while a:
    print(a.pop())
else:
    print('Done.') # foo, bar, baz, qux, Done.

And again, here's are for loop examples where the else statement will and will NOT execute:

# else DOES execute
for i in ['foo', 'bar', 'baz', 'qux']:
  print(i)
else:
  print('Done.') # foo, bar, baz, qux, Done.

# else DOES NOT execute
for i in ['foo', 'bar', 'baz', 'qux']:
  if i == 'bar':
    break
  print(i)
else:
  print('Done.') # foo

Here, i == 'bar' evaluates to True during the second iteration. Even though the third and fourth iterations could have printed when evaluated by the conditional, the break executed before the loop got there. Therefore, the loop did not exhaust all viable iterations and it does not trigger the else statement.

Infinite loops can occur when there is not proper control flow in the loop's code. See if you can figure out why this loop is infinite.

a = ['foo', 'bar', 'baz', 'qux', 'corge']
while a:
    if len(a) < 3:
        continue
    print(a.pop())
print('Done.')

Got it? After the first three iterations, a shrinks to fewer than three items and executes a continue statement. It then returns to the beginning of the loop, where it will find that a still has fewer than three items. So it goes back to the beginning again... and again and again and again...

Your program will get stuck here, so you want to make sure you pay special attention to the control flow when you write loops!

Iterating over dicts is slightly more complicated than other iterabless because each item consists of two elements, specifically mapped to each other. That said, you can do some really cool stuff with your dicts using loops!

Let's start with a few simple examples. This first one iterates over the dict by each item, i.e. each key-value pair.

transaction = {
  "amount": 10.00,
  "payee": "Joe Bloggs",
  "account": 1234
}

for key, value in transaction.items():
    print("{}: {}".format(key, value))

# Output:
account: 1234
payee: Joe Bloggs
amount: 10.0

If you only have a dict's keys, you can still iterate through the dict. Notice the loop below results in the same output as the one above iterating through items.

for key in transaction:
    print("{}: {}".format(key, transaction[key]))

# Output:
account: 1234
payee: Joe Bloggs
amount: 10.0

You can also sort a dict by iterating through its keys.

for key in sorted(transaction): # this is the only difference
    print("{}: {}".format(key, transaction[key]))

# Output:
account: 1234
amount: 10.0
payee: Joe Bloggs

Note that the dict itself will not be sorted by the first value in each item. Because the keys are the unique element of a dict, you can only sort dict values within each key.

dict1 ={ 
  "L1":[87, 34, 56, 12], 
  "L2":[23, 00, 30, 10], 
  "L3":[1, 6, 2, 9], 
  "L4":[40, 34, 21, 67] 
}

for i, j in dict1.items(): 
  sorted_dict = {i:sorted(j)} # here is sorting!
  dict1.update(sorted_dict)

print(dict1)
""" # prints out...
{'L1': [12, 34, 56, 87],
'L2': [0, 10, 23, 30],
'L3': [1, 2, 6, 9],
'L4': [21, 34, 40, 67]
} """

In Python, a module is Python source file that contains pre-defined objects like variables, functions, classes, and other items we'll talk about soon. A Python package, sometimes used synonymously with the term library, is simply a collection of Python modules. The diagram below can show you this hierarchy visually.

package_def

Essentially, packages and modules are a means of modularizing code by grouping functions and objects into specific areas of focus. For instance, the statsmodels module (here) contains code useful to a data scientist. The Pyglet library (here) contains code useful to game developers needing shortcuts for 3D game animation. But vice versa?

Modular programming allows us to break out modules and packages dealing with specific topics in order make the standard library more efficient for the general public. It's sort of like "a la carte" code. This becomes especially valuable once you scale your programs. Who needs that extra baggage?

One of the reasons Python leverages modular programming is because it helps avoid conflicts between local and global variables by creating separate namespaces. Namespaces are the place where variables are stored, and they exist on several independent levels, including local, global, built-in, and nested namespaces. For instance, the functions builtin.open() and os.open() are distinguished by their namespaces. Namespaces also aid readability and maintainability by making it clear which module implements a function.

At a high level, a variable declared outside a function has global scope, meaning you can access a it inside or outside functions. A variable declared within a function has local scope, which means you can only access it within the object you created it. If you try to access it outside that, you will get a NameError telling you that variable is not defined.

We'll get more into how to use and interpret local and global scope as we dive into modules and functions...

Importing modules and packages is very easy and saves you a lot of time you'd otherwise spend reinventing the wheel. Modules can even import other modules! The best practice is to place all import statements at the of your script file so you can easily see everything you've imported right at the top.

Let's look at a few different way to import modules and their contents. The simplest way to import a module is to simply write import module_name. This will allow you to access all the contents within that module.

If you want to easily find out exactly what is in your newly imported module, you can call the built-in function dir() on it. This will list all types of names: variables, modules, functions, etc.

import math
dir(math)
# prints ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', ... etc.]

You can also import one specific object from a module like this:

from math import sqrt
sqrt(25) # 5

Notice how we included math. when we called the sqrt function. Because of variable scope, you need to reference the namespace where sqrt is defined. Simply importing sqrt does not give it global scope. It still has local scope within the math module.

However, you can help avoid verbose code by importing modules and their items like this:

from math import sqrt as s
s(25) # 5

By importing the sqrt as s, you can call the function as s() instead of math.sqrt. The same works for modules. Note the difference in how we reference the square root function though...

import math as m
m.sqrt(25) # 5.0

...we only renamed the module in this import and not the function. So we have to go back to the module_name.function() syntax. However, because we renamed the module on import, we can reference it in function calls by its shortened name, i.e. m.sqrt.

In addition to "built-in" modules, we have the ability in python to create, distribute and most importantly consume community defined python modules.

This is powerful because anyone who builds something useful has the ability to share with the larger python community. Creating and distributing python modules is outside the scope of this class, but we can consume any module we'd like by running the:

pip install [module_name]

Modules can be found in PyPI, or, the Python Package Index. Any registered module in pypi is installable via pip.

However, in order to safely install modules across projects (ie: perhaps project A requires module 1 v1 but then project B, started a year later needs to use module 1 v2) we need to create what are called virtual environments, isolated python environments where we can safely install our pip modules and rest assured that they don't interfere with other projects / the system at lare.

In order to create a virtual environment:

python3 -m venv .env
source .env/bin/activate

The .env folder contains everything needed for this "virtualenv". We go inside the env by running the source ./env/bin/activate command. To deactivate, (while in virtualenv):

deactivate

The best part about this is not only can we install our pip modules safely, we can also do this:

pip freeze > requirements.txt

This will collect all the installed pip modules in the virtual env and store into a file (that we are calling requirements.txt). This is useful because if we ever wanted to run this software from a different computer, all we would have to do is pull down the python files, create a new virtualenv and then:

pip install -r requirements.txt

and this would effectively "copy" our installed modules into the new virtualenv.

Python's itertools library
Pandas / (Pandas github repo)
NumPy / (NumPy github repo)
SciPy / (SciPy github repo)
Matplotlib / (Matplotlib github repo)
scikit-learn / (scikit-learn github repo)

In Python, functions are your best friends! Let's say you need to perform some action or calculation multiple times for multiple values. For example, you might want to convert temperatures in Celsius to Fahrenheit like you did in the last chapter's exercises. It would be inefficient and messy to copy that code every time you need it. Instead, you can define a function to contain that code. Every time you call that function, it runs the whole block of code inside and saves you lots of time. Sweet!

Python includes lots of built-in functions in its main library. We've seen lots of these already like len(), sum(), .append(), .popitem, etc. You can extend the range of built-in functions available to you by importing modules. We'll talk about those next!

For now, let's start with the basics. Here's the skeleton of a function and a breakdown of each part.

def function_name(parameters):
    """docstring"""
    # statement(s)

def shows you are "defining" a new function
A unique function name; same naming rules as variables)
Optional parameters, or arguments, to be passed into the function when it is called.
: ends the function header
An optional docstring, i.e. a comment with documentation describing the function.
At least one statement make up the "function body"; this code achieves the purpose for calling the function.
An optional return statement, which exits the function and passes out some value from the body code.

NOTE! It is a best practice to always create notes and documentation. Other potential users of your functions - and maybe future YOU - will thank you for the extra info.

When you create a function, you might need to feed it some input and have it give back some output. We call function input arguments and function output return values. Remember - both arguments and return values are optional depending on the purpose of your function.

Let's say we want to create a function to get the square of a number. At the most basic level, there are three parts:

Input the number we want to square
Calculate the square of that number
Output the square of that number

Let's implement this in a function called NumSquared().

def num_squared(num):
    """Find the square of some number passed in"""
    square = num*num # code to find the square
    return square

Input the number we want to square We create an parameter called num to represent the number we will past into our function as an argument. (p.s. Parameters are the names used when defining a function.) Remember that arguments should always be passed in the correct format and positional order, or the function will not be able to recognize them.
Calculate the square of that number Using the value of num, we write the formula for calculating a square and assign it to the variable square.
Output the square of that number We return square to pass out the numeric value we calculated. The return statement exits the function so the program can move on to the next block of code you've written. If you don't need to specify a value to return, the function will default to return None in order to exit the function.

Once we've written this logic, we can call NumSquared() every time we want to use it. Let's say we want to find the value of 12 squared...

sq12 = num_squared(12)
print(sq12) # 144

NOTE! You should store the function call within a var so that the return value gets stored in the var. If you don't, how will you access the output you wanted??

One last thing - you should know that the return statement can return multiple values by using tuples. Once you return the tuple from the function, you can unpack its values by simultaneously assigning each one to a new var as follows...

    # some function...
    return 3,'a',True

x, y, z = (3,'a',True)
print(x, type(x)) # 3 <class 'int'>
print(y, type(y)) # a  <class 'str'>
print(z, type(z)) # True <class 'bool'>

If your function won't work without specific arguments, you can define the function with required arguments. In order for anyone to call the function, that user must always pass values for the required arguments in the correct positional order with the correct syntax you defined in advance. For example...

def plus(a,b):
  return a + b

c = plus(8,12)
print(c) # 20

Now switch perspectives. You're using a function that your colleague defined. If you want to make sure that you call all the required arguments in the right order, you can use the keyword arguments in your function call. Essentially, this means that you mention each argument's parameter name when you assign it a value during the function call. It works like this...

def plus(a,b):
  return a + b

c = plus(a=8,b=12)
print(c) # 20

Back to writing our own functions! If you want, you can give your function a default argument. Functions with default arguments take some pre-defined default value if no argument value is passed when you call the function. When defining your own function, you can assign this default value like this:

def plus(a,b = 12):
  return a + b
  
# Only passing a value for `a`...
c = plus(a=8)
print(c) # 20

# ...vs. passing values for `a` and `b`
c = plus(8, 17)
print(c) # 25

Even if you're not sure how many arguments you will need to pass to your function, you can still define it. To do this, you use the parameter *args as a stand-in. This signals to the function that it should expect any variety of arguments. Let's take a look at a few different ways to implement this.

Using integers (as we did in the earlier examples)

def plus(*args):
  return sum(args)

c = plus(8,12,17)
print(c) # 37

Using different data types

def length(*args):
  list1 = [*args]
  return len(list1)

c = length(8,'a',True)
print(c) # 3

Using a variable

var1 = 'h' + 'i'
def print_all(*args):
  list1 = [*args]
  return list1

c = print_all(8,'a',True,var1)
print(c) # [8, 'a', True, 'hi']

NOTE! If you use *args, your function will be more flexible, but only if you write it that way. If you expect different types of arguments, you will have to write the function such that it can handle every use case you expect could occur.

global variable: a variable declared outside a function; any function in your script can access this
local variable: a variable declared within a function's code block; you can only access this variable within the function where it is declard, otherwise you will get a NameError telling you that variable is not defined.

x = 'I\'m a global variable.'

def foo():
    x = 'I\'m a local variable.'
    print(x) # I'm a local variable.
    return x

y = foo()

print(x) # I'm a global variable.
print(y) # I'm a local variable.

Notice that even though the function foo() above says return x, it only returns the value of the local variable x. We assign this value to the variable y when we call foo().

Look at the nuanced difference in this example though:

def foo():
    x = 'I\'m a local variable.'
    print(x) # I'm a local variable.
    return x

foo()

print(x) # NameError: name 'x' is not defined

Even though we called the function foo(), we did not assign its return value to a variable outside the function. Therefore, trying to print x will output NameError: name 'x' is not defined. This is because x only exists within the function.

At their core, list comprehensions are a short-cut for transforming lists into other lists. Essentially, you can iterate through my_list using a condensed for-loop syntax. Till now, we've been fine using for loops to transform lists, but as your code gets more complicated, you'll be thankful for any short-cut!

Here's a one-to-one comparison of the general syntax for creating a list with a for loop versus a list comprehension. We'll use pseudo-code here for better initial context. These are the key elements to note in the list comprehension:

The square brackets, a signature of Python lists;
The for keyword, followed by an arbitrary variable to represent the list items
The in keyword, followed by a list variable

# for loop
<variable_for_values> = []
for <item> in <iterable>:
    <variable_for_values>.append(<expression>)

# list comprehension
<variable_for_values> = [<expression> for <item> in <iterable>]

The examples below also achieve the same outcome, but with actual code...

# for loop
squares = []
for x in range(8):
    squares.append(x*x)
print(squares) # [0, 1, 4, 9, 16, 25, 36, 49]

# list comprehension
squares = [x*x for x in range(8)]
print(squares) # [0, 1, 4, 9, 16, 25, 36, 49]

Just like iterating through list items with a for loop, you might want to access only items adhering to one or more specific conditions. Let's walk through these use cases.

Modify a List's Existing Items

grades = [100, 33, 98, 76, 54, 98, 89, 49]
curved_grades = []

# for loop
for grade in grades:
  curved_grades.append(grade + 10)

print(curved_grades) # [110, 43, 108, 86, 64, 108, 99, 59]

# list comprehension
curved_grades2 = [(grade + 10) for grade in grades]

print(curved_grades2) # [110, 43, 108, 86, 64, 108, 99, 59]

Create a New List w. a Specific Subset of the Original List Items

grades = [100, 33, 98, 76, 54, 98, 89, 49]
failing_grades = []

# for loop
failing_grades = []
for grade in grades:
  if grade < 65:
    failing_grades.append(grade)
  
print(failing_grades) # [33, 54, 49]

# list comprehension
failing_grades = [grade for grade in grades if grade < 65]

print(failing_grades) # [33, 54, 49]

We already know that Python is based on the concept of OOP, or Object-Oriented Programming. Almost everything in Python is an object -- even functions are objects! Classes, and their facilitation of inheritance, are one of the most important and valuable Python objects. In this section, we'll cover:

Class structure
Class attributes
Class methods
The __init__() method
The self keyword
Class vs. instance variables
Class instantiation
Inheritance and child classes

A class is essentially a data structure that serves as a blueprint for categorizing other objects and storing metadata about them. Once you have your "blueprint", you can create new instances of that class, which store unique metadata values.

Creating a class is similar to defining a function. You start with the class keyword and then specify a name for the class. Note that class names are generally the only objects, which use a CamelCase notation naming convention. For example, if you were a zoologist, you might create a class called Animal. Each instance might represent a type of animal at your zoo.

# Define a class called Animal
class Animal:
    # attributes
    # methods
    # etc ...

# Create the most basic instance
chameleons = Animal()

Before we go into the details of thoroughly defining a class, let's isolate some basic elements and concepts to get a general understanding of them.

Each piece of a class's metadata is called an attribute. Once you have your "blueprint", you can create new instances of that class, which stores unique attribute values. As a zoologist, you would want define your Animals class so that it could store attributes of each type animal at your zoo such as species, natural habitat, etc..

class Animal:
    kingdom = 'Animalia' # attribute
    
    # some other code...

In addition to attributes, classes also contain custom methods. Methods are essentially functions that belong to the class. You can call a function without referencing any other object, but to call a method, you need to reference its class. Thus, all methods are functions, but not all functions are methods. We've already used some List methods like my_list.pop(), my_list.append(), my.list.insert(index), etc.. When you create a class, you can define methods to serve as shortcuts for actions you might want to call frequently on instances of your class.

class Animal:
    # some other code...

    def method1(self): # method
        # some action

Once you've defined attributes and methods, here's how you call them on your class instance:

chameleons = Animal() # Create the instance.

print(chameleons.kingdom) # 'Animalia'

chameleons.method1() # This completes the defined method operations.

Classes can inherit attributes and methods from other classes according to a parent-child class hierarchy. Naturally, a child class inherits from a parent class. When you define a brand new class, Python 3 implicitly uses the generic, built-in object as the parent class. That means, whether we explicitly see it or not, every parent class is also the child class of its own parent class!

In the context of our zoo example, the different instances of Animal each store general information about a certain type of animal. Imagine you want to expand on an instance of Animal called elephants. In order to document information about each elephant at the zoo, you might create an Elephant class that inherits from your Animal class. To do so, you use this general syntax:

class Elephant(Animal):
    # attributes
    # methods
    # etc ...

Although the child class has access to everything defined for its parent class, the child class can also override or extend the parent class's traits and behavior. Note that this does NOT redefine the parent class. The new attributes and methods the child class declares apply only to instances of the child class. Parent class instances still adhere to the original parent class specs. For example:

class Animal:
    category = 'Animals'
    # etc ...

class Toucan(Animal):
    category = 'Birds'
    # etc ...

If you wanted, the Toucan class could simply inherit the category class attribute from its parent class Animal. In this case, every instance of Toucan would would have the same value for category -- Animals. However, it makes sense that you'd want to differentiate further for the child class Toucan. To do that, you'd simply override category when you define Toucan by setting its value to Birds.

When you create a new instance of your Class, you might want to it to exist in some default state. For example, you might want to initially assign default values for its attributes. In Python terms - when you instantiate a new instance object, you initialize it with pre-defined default values.

The init() method is where you give instructions for how you want each instance to exist in its initial state. Every time you instantiate a new instance object of your Class, you automatically invoke the __init__() method. That means when you create a new Class, the first thing you want to do is create its __init__() method. In general, the syntax looks like this:

class Animal():
    def __init__(self):
        # ...

Notice we used the same notation as we did for defining functions. The __init__() method must have at least one argument, including the self variable. The self variable serves as a reference to the current instance of the class, and it must be the first parameter of any method in a class, including the __init__() method.

Now we can get to the good stuff! As you define attributes and methods for your class, keep in mind their scope. If you want a certain attribute or method to be shared by ALL instances of a class, define it as a class variable. If you instead want it to be unique to each instance, define it as an instance variable. Before we see this in context, we first have to understand the two most basic elements of every Python class...

class Animal():
    def __init__(self):
        # ...

NOTE! Any methods defined inside the __init__() method will NOT be called upon instantiation.

Class Definition Example 1 - Basic Elements in Context

Now that we've isolated each key component of classes, let's put everything together by completing the code for our zoology scenario. At the highest level, we define a class called Animal. The annotated code below illustrates how each key structural element we covered above fits into this task.

class Animal: # A.
    def __init__(self, species = '', diet= ''): # B. 
        self.species = species # C.
        self.diet = diet # C.

    kingdom = 'Animalia' # D.

    def my_kingdom(self):
        print(self.kingdom)

    def feed_me(self): # E.
        if self.diet == 'omnivore':
            food = 'plants and meat'
        elif self.diet == 'carnivore':
            food = 'meat'
        elif self.diet == 'herbivore':
            food = 'plants'
        print(f'{self.species} eat {food}!')
        return None

A. Animal is a child class of object as well as a potential parent class. B. Every time we instantiate a new class object, the __init__() method will automatically be called to initialize the instance's values. C. Each instance of the Animal class will store unique values for the instance attributes species and diet. By default these will be blank or Nonetypes, but each instance can have its own unique values for them. D. ALL instances of the Animal class will have the kingdom class attribute with the value Animalia. E. We can call instance methods my_kingdom and feed_me on ANY instance of the Animal class. Note! In my_kingdom, we access the class variable kingdom, but still reference it using self.

Let's go into some more detail with a new child class for Animal. In the Elephant class below, we define __init__() method and its parameters, class attributes, and instance methods with the same syntax used for any class we might create. There are a few key differences annotated in the comments below.

class Elephant(Animal): # A.
    def __init__(self, name, genus = '', species = '', habitat = '', age = None): # B.
        self.name = name
        self.genus = genus
        self.species = species
        self.habitat = habitat
        self.age = age
        self.taxonomy = {'Kingdom': Animal.kingdom, 'Class': self.common_taxonomy['Class'], 'Family': self.common_taxonomy['Family'], 'Genus': self.genus, 'Species': self.species} # C.

    diet = 'Herbivore' # D.

    common_taxonomy = {
    'Class': 'Mammalia',
    'Family': 'Elephantidae',
    }

    def summary(self):
      print(f'All about {self.name} -')
      print(f'Elephant, age {self.age}\nHabitat: {self.habitat}\nDiet: {self.diet}\n\nTaxonomy:')
      for k,v in self.taxonomy.items():
        print(f'{k}: {v}')

A. Declares Elephant as a child class of Animal by adding Animal into it as a definition parameter.

B. Notice that even though taxonomy is not a parameter for the __init__() method, we can still define it as an instance attribute upon every instantiation.

C. If you look closely, you'll see that the values for taxonomy all come from different places.

Some of the taxonomy attributes are inherited from Animal; while
some are constant class attributes across all elephants; and
others are instance attributes unique to each elephant at the zoo.

This is a great opportunity to dissect the syntax for referencing attributes from different sources.

D. Here's a potential "gotcha". Remember that the Animal class also had an attribute called diet? Elephant does NOT inherit the diet attribute's value from Animal. Why? Two reasons:

First, Elephant defines diet as a class attribute for itself. This would supercede any variable called diet from the parent class.
Second, for Animal, diet is an instance attribute. Even if Elephant didn't define any type of attribute called diet for itself, a child class never inherits the instance attributes from their its parent.

Now we'll create the first instance of the Elephant class. To do so, you would pass arguments for the __init__() parameters defined above. This automatically invokes the __init__() method and assigns the values of the arguments you passed to your new instance attributes. Note that the name argument is required, but the rest are optional. Their values will default to empty strings if no argument for them is passed.

elephant1 = Elephant('Felicia', 'Elephas', 'Elephas maximus', '', 38)
# Notice we passed the default empty string for the habitat argument.

You can access or modify any instance attribute like so:

# Access
print(elephant1.name) # Felicia


# Add value for an empty attribute
print(elephant1.habitat) # empty string by default
elephant1.habitat = 'Asian forests'


# Update an existing attribute value
print(elephant1.age) # 38
elephant1.age = 39 # Update the value of the age attribute.
print(elephant1.age) # 39

# Define a new instance attribute, which will apply only to elephant1.
elephant1.weight_pounds = 6000

Finally, here's what happens when we call the summary() instance method:

elephant1.summary()

# Here's the output
"""
All about Felicia -
Elephant, age 38
Habitat: Asian forests
Diet: Herbivore

Taxonomy:
Kingdom: Animalia
Class: Mammalia
Family: Elephantidae
Genus: Elephas
Species: Elephas maximus
"""

In case someone who is not an expert zoologist like you needs to access the zoo's database of animals, that person could use the isinstance() function is used to determine if an instance is also an instance of a certain parent class. For this example, imagine you have already also defined another class called Toucan with the same input variables as our Elephant class.

# Is elephant1 an instance of Animal()?
print(isinstance(elephant1, Animal)) # True

# Is toucan1 an instance of Elephant()?
print(isinstance(toucan1, Elephant)) # False

A class outlines a set of attributes and methods, which will help categorize other objects.
To add objects to the class, you declare them as an instance of that class.
Class variables store values belonging to ALL instances of a class, whereas instance variables store values unique to each instance.
The init() method is where you give instructions for how you want each instance to exist in its initial state. Every time you instantiate a new instance object of your Class, you automatically invoke the __init__() method.
The self variable serves as a reference to the current instance of the class, and it must be the first parameter of any method in a class, including the __init__() method.
Child classes can inherit attributes and methods from parent classes.
Child classes can also override parent attributes and behaviors without redefining the parent class.

Whether or not they realize, most people have come into contact with data science in their daily lives. We've seen trending articles on digital news outlets, personalized product recommendations from online stores, and advertisments that seemingly hear our every thought and conversation. But what exactly is data science?

Acquiring, organizing, and delivering complex data
Building and deploying machine learning models
Conducting statistical analyses, including ANOVA, linear models, regression analysis, and hypothesis tests
Visualizing data distributions, hierarchical clustering, histograms, pie and bar charts, etc.

Identify hidden patterns, correlations, and outliers to glean meaningful insights.
Based on these insights, validate assumptions, make predictions, define optimizations, and most importantly make strategic decisions.

Professionals who practice data science for businesses, government institutions, nonprofits, and other organizations might have one of these titles:

Machine Learning Engineer:
- Work in production code.
- Identify machine learning applications.
- Manage infrastructure and data pipelines
Data Engineer:
- Create an architecture that facilitates data acquisition and machine learning problems at scale.
- Focus on the algorithm and the analysis rather than the software.
Research Scienctist:
- Specialized research scientist focused on driving scientific discovery rather than pursuing industrial applications.
- Backgrounds in both data science and computer science.
- Determines new algorithmic optimizations, especially in the realm of AI.
Advanced Analyst:
- Apply descriptive and inferential exploratory data analysis and modeling.

Effective data science lives at the intersection of...

That's pretty broad though. What skills in each of these areas are needed for data science specifically? A good data scientist:

MATHEMATICS: Understands statistical concepts and modeling; proficient in R and/or Python
COMPUTER SCIENCE: Has experience in data engineering (i.e. organizing data, running models, visualizing results, etc.); proficient in R and/or Python
DOMAIN EXPERTISE: Understands the business and social context of issue and can ask questions that lead to appropriate approaches and insights

Safer, smarter self-driving cars
- Data from sensors, including radars, cameras and lasers, to create a map of its surroundings.
- Create a map of its current surroundings such as proximity to other moving or stationary objects like other vehicles, traffic light signals, sirens, pedestrian crosswalk signals, etc.
- Decisions like when to speed up/down, stop, turn, signal, etc.
Pre-emptive code alerts in the ER
- Data from heart monitors, pulse oximeter, arterial lines, ventilators, etc. hooked to patients
- Find commonalities in biological health indicators preceding a code
- Identify patients at risk of imminently coding to give doctors an early warning and increase chances of patient revival
Natural disaster prediction
- Data from ships, aircrafts, radars, satellites
- Predict occurrences of natural disasters, the areas to be affected, and (where applicable) the path of the storm
- Earlier predictions to maximize evacuation potential

Image Source: http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/

The image above delineates the general steps you would take when you start a data science project. Of course, they're really guidelines because you have to let your results guide you. Sometimes you might skip a step, repeat certain steps, or restart the entire cycle when trying to answer a question. Let's talk through each step using this contextual example:

Data Science Wearables (DSW) is a retail store. DSW is interested in improving their human resource operations. Specifically, as a cost center in the business, this company wants to reduce their expenses associated with staffing the firm's in-store associates across the United States. You have a table of DSW current retail sales associates across department stores. These are some questions you have:

What drives up costs of staffing?
Is there an underlying reason for those costs?
What factors affect HR costs? How could we minimize these?
What hypothesis can we test to reduce costs?

Let's presume the key cost driver for this HR function is twofold - employees turning over early (low total years of service) and a high time to fill (positions going unfilled, costing producitivity losses). Thus, we start by pursuing the goal of minimizing turnover.

The first three rows of data look like this. Note that "time-to-fill" indicates how long it took to fill this person's role. Typically minimizing time-to-fill is key to lower costs.

Job Level	Current Employee	Reason for Termination	Years of Service	Candidate Source	Previous Employer	School	Time to Fill (Days)
Associate	N	New offer	1.5	Referral	Jake's Hawaiian Shirts	NYU	40
Associate	Y	N/A	2.0	Internship	N/A	UCLA	15
Associate	No	Tardiness	0.5	Online	Hats and Caps	Boston College	25

The inconsistencies and N/A missing values you see above are incredibly common. In fact, this dataset is comparatively clean and apt for the task at hand. When we start working with Pandas, we will discuss how to handle N/A missing values and other way to ensure data integrity.

We already looked at the columns in this dataset, but now we want to gain a deeper understanding and create some meaning to help determine our path forward. To do so, we will look at descriptive statistics, probably starting with summary statistics for the different categories in the dataset.

Min & max years of service and their corresp values
Means of each var
Frequency counts of each value in a var
Plot the distribution of values as a histogram. A histogram uses the frequency counts for a single var, where the values themselves appear on the x-axis and the frequency of each one appears on the y-axis. This helps us gain a quick visual understanding of variance, spread, and skew.

NOTE! Based on this, our original goal of minimizing turnover might change!

This step is where we transition from merely describing and summarizing the data to manipulating and analyzing it. This step always starts with the same question - What else do you want to know about the dataset? The answers to this usually pertain to some pre-existing assumption, ostensible relationships (or lack thereof), unexpected values, or anomalies, which you want to investigate further. In our example with DSW employees, here are some pathways we might choose to follow:

We previously assumed the relationship between Time-to-Fill and Years of Service is negative. Is this true? How strong is this negative correlation? If it's significantly and consistently strong, we might choose to use this as hiring criteria going forward. To determine this, we would conduct a statistical correlation analysis.
We could repeat the statistical correlation analysis with any pair of variables we think show potential for significant correlation (such as school and application source). But time is money, and we need to choose where to start intelligently! To do this, we might want to visualize the relationships between pairs of variables. In statistics, we often start by creating a scatterplot with a trendline because it allows us to immediately see the spread of data points and how far they are from the trendline.
In more complex situations, we might conduct regression analysis to determine the potential for accurately predicting values for Years of Service based on Time-to-Fill values. We could use this to justify building machine learning model to generate a predictive algorithm.

NOTE! It is common for this step to reinforce and revisit the prior step as we discover anomalies or intriguing relationships.

This is where the magic happens. We won't get into the details of machine learning here. However, the model you create for any data science project will be core source of insights and conclusions. Once you have results, it's time to dig in and think outside the box! Ask yourself questions like:

How do our results compare to our initial hypothesis?
How statistically significant (i.e. accurate) are our predictions?
Do we have enough information to draw decisive conclusions? If so, what are they?
Based on our conclusions, what concrete actions do we recommend?

Remember that your results might not be sufficient after only one iteration. They might point you in the right direction, but they won't necessarily answer all your questions sufficiently. You'll probably have to repeat parts of the cycle several times before you can confidently draw conclusions and make recommendations.

This final step is so important, we're going to give it its own section...

Visualizations & Data Storytelling

The single most important takeaway from this walk-through is this - the value of your results depends directly on how well key stakeholders understand them! Data science is valuable because of the insights we can discover using it. You can have all the mathematical evidence in the world for those insights, but your stakeholders have to understand their contextual significance and believe they can turn them into strategic, impactful business actions. Otherwise, what value do those insights have?

Now, a data scientist might not present results to clients or high-level managers, but you do need to be able to explain results to team members who are not expert data scientists.

This where the ubiquitous buzz phrase data storytelling comes into play. The goal of data storytelling is to convey your message in a way that provokes thoughts and ideas, inspires questions, encourages conversation and brainstorming, and ultimately, ignites action. All this boils down to two core pillars:

Honing a cohesive narrative that establishes a thesis
Highlighting meaningful key metrics as evidence to support that thesis

Data vizualization is key to this endeavor because it's the easiest way to simplify heaps and mounds of numerical data into a clear message. As the saying goes, a picture can say a thousand words!

Focus the message on a central theme. Ensure your visualizations aid the progression of that message appropriately.
- Display the visualization at the appropriate point in your story.
- If you have more than one visualization in view at a time, position each one contextually, according to natural reading eye movement.
Do not use color for decorative or non-informational purposes. It should be used to highlight key metrics or data points that help support your message.
Most importantly, avoid visual clutter like the plague!
- Eliminate the legend if it will not detract from understanding.
- Where you have long, vertical x-axis labels, try flipping the chart if possible.
- Remove excessive boxes or lines that separate data.
- Don't graph too many variables in one chart. For instance, ten lines on one chart will be too convoluted to follow!

HOWEVER, there's always one exception! Generally, "less is more" surpasses everything else in importance except for "consider your audience". You always want to minimize the amount of text on your visualization, but "the minimum" differs based on how much context your audience has. Ultimately, you need to make sure every viewer has enough context to be grounded in the appropriate frame of reference.

If you want, you can see browse through a lot more tips on Data to Viz's "Caveats" page.

How Many People Have Ever Lived on Earth?, a study from the U.S. Population Reference Bureau (PRB).

How Many People Have Ever Lived on Earth? Table 2. Snapshot of Population History

Number of people ever born -- 108,470,690,115

World population in mid-2017 -- 7,536,000,000

Percent of those ever born who are living in 2017 -- 6.9%

"Any estimate of the total number of people who have ever lived depends essentially on two factors: the length of time humans are thought to have been on Earth and the average size of the human population at different periods...Guesstimating the number of people ever born, then, requires selecting population sizes for different points from antiquity to the present and applying assumed birth rates to each period."

Population Pyramid

This project focuses on predicting future population growth. It's compiled from various sources - primarily the United Nations, Department of Economic and Social Affairs, Population Division. Their interactive population pyramid tool is a great example of an informative and compelling and data visualization.

Now that we understand the process we'll follow, we can jump into applying it with our Python skills. First, we have to set up our environments and ensure we have all the tools we need to conduct a thorough data science analysis. We won't use all of these in this introductory class, but these are the most common across the industry.

NumPy for computational operations on large multi-dimensional arrays and matrices
Pandas for data structuring, manipulation, and analysis
Matplotlib & Seaborn for data visualization
Scikit-learn for machine learning
Scrapy for data wrangling via web scraping
Jupyter Notebooks & Jupyter Lab for data science integrated development environments (IDEs)

Pandas is an open-source Python library of data structures and tools for exploratory data analysis (EDA). Pandas primarily facilitates acquisition, cleaning, formatting, and manipulating. Used in tandem with NumPy, Matplotlib, SciPy, and other Python libraries, Pandas is an integral part of practicing data science.

To gain some baseline familiarity with Pandas features and pre-requisites, in this lesson, you'll learn about:

Robust IO tools to reading from flat files (CSV and TXT), JSON, XML, Excel files, SQL tables, and other databases.
Inserting and deleting columns in DataFrame and higher dimensional objects
Handling missing data in both floating point and non-floating point data sets
Merging & joining datasets
Reshaping and pivoting datasets
Conditional data sorting and filtering
Iterating through data sets
Aggregating and transforming data sets with split-apply-combine operations from the group by engine
Automatic and explicit aligning and manipulating of high-dimensional data structures via hierarchical labeling and axis indexing
Subsetting, fancy indexing, and label-based slicing large data sets
Time-series functionality such as date range generation, date shifting, lagging, frequency conversions, moving window statistics, and moving window linear regressions.

Because Pandas is built on top of NumPy, new users should first understand one NumPy data object that often appears within Pandas objects - the ndarray.

An ndarray, or N-dimensional array, is a data type from the NumPy library. Ndarrays are deceptively similar to the more general Python list type we've been working with. An ndarray type is a group of elements, which can be accessed and updated using a zero-based index. Sounds exactly like a list, right? You can create and print an ndarray exactly like a list. You can even create an ndarray from a list like this:

import numpy as np

listA = [1, 2, 3]
arrayA = np.array([1, 2, 3])
print(listA) # [1, 2, 3]
print(arrayA) # [1 2 3]

listB = ['a', 'b', 'c']
arrayB = np.array(listB)
print(listB) # ['a', 'b', 'c']
print(arrayB) # ['a' 'b' 'c']

However, there are several important differences to remember:

First, all ndarrays are homogenous.* All elements in an ndarray must be the same data type (e.g. integers, floats, strings, booleans, etc.). If you try to enter data that is not homogenous, the .array() function will force unity of data type. Side note - notice that ndarrays get printed out without commas.

import numpy as np

arrayC = np.array([1, 'b', True])
print(arrayC) # ['1', 'b', 'True']

arrayD = np.array([1, False])
print(arrayD) # [1 0]

Second, ndarrays have a parameter called ndmin, which allows you to specify the number of dimensions you want for your array when you create it. Here are the three key takeaways from the examples of this below.

Notice how each dimension prints on its own line, so the ndarray looks more like a grid than a single list.
arrayE1 and arrayE2 above are identical. This illustrates that the nddim parameter is optional. In other words, you can directly pass in multi-dimensional data without having to enter an argument for it.
arrayF throws an error because it's missing one vital piece of syntax that arrayC1 has. Do you see it? The first parameter in the .array() method is the object (i.e. the values you want contained in the array). When you pass values for multiple dimensions of the array object into the .array() method, you separate them with commas. You have to make sure you group the dimensions and their values into a single group by adding () around them. If you don't, the .array() method might mistake the second dimension and its values for the second parameter of the .array() method.

import numpy as np

arrayE1 = np.array(([1, 2, 3], [4, 5, 6]))
print(arrayC1)
"""
[[1 2 3]
 [4 5 6]]
"""

arrayE2 = np.array(([1, 2, 3], [4, 5, 6]), ndmin = 2)
print(arrayC2)
"""
[[1 2 3]
 [4 5 6]]
"""

arrayF = np.array([1, 2, 3], [4, 5, 6])
print(arrayF) # Error

The third, and most important, difference between an array and a list is, ndarrays are designed to handle vectorized operations while a python list is not. In other words, if you apply a function to an ndarray object, the program will perform said function on each item in the array individually. If you apply a function to a list, the function to be performed on the list object as a whole.As a bonus, these vectorization capabilities also allow ndarrays take up less memory space and run faster.

import numpy as np

listG = [1, 2, 3]
arrayG = np.array(listA)

print(arrayG + 2) # [3 4 5]
print(listG + 2) # Error

There are a handful of other ways to create ndarrays, including random generation...

import numpy as np
import random

# Create an array of 5 random integers between 50 and 100. They will form a uniform distribution.
rand_array1 = np.random.randint(50,  100,  5)
print(rand_array1) # [54 86 91 61 90]

# Create a matrix of 2 rows and 3 columns, with all values between -1 and 1.
rand_array2 = np.random.rand(2, 3)
print(rand_array2)
"""
[[0.11298458 0.49065597 0.14219546]
 [0.27545168 0.87526704 0.93213146]]
"""

# Create a matrix of 2 rows and 3 columns, with all values between 0 and 1. They will form a normal distribution.
rand_array3 = np.random.randn(2, 3)
print(rand_array3)
"""
[[-0.24525306  1.9082735   0.55667231]
 [-1.17418436  0.12749887 -1.47157527]]
"""

...and via the .arange() method. This method takes the start point of the array, the end point, and (optionally) the step size. Remember that the final value will actually be one less than the specified end point.

range_array = np.arange(2, 8, 2)
print(range_array) # [2, 4, 6]

We know about the concept of an index from basic Python lists. Well, Pandas considers Index to be its own class of objects because you can customize an index in Pandas. As formally defined in the Pandas docs, an index object is an "immutable ndarray implementing an ordered, sliceable set" which is the default object for "storing axis labels for all pandas objects".

A Series is a 1-D array of data just like the Python list datatype we've been working with, but it's a bit more flexible. Some notable characteristics include:

A Series is like a dict in that you can get and set values by index label.
A Pandas Series acts very similarly to a NumPy ndarray:
- Just like ndarrays, looping through a Series value-by-value is usually not necessary because of its capability to handle vectorized operations.
The Pandas Series does have some distinct differences from an ndarray:
- A Series can only have one dimension.
- Operations between Series automatically align the data based on index label.

Here's the general syntax for creating a Series:

import numpy as np
import pandas as pd

s = pd.Series(data, index = index, dtype)

The data parameter can intake a Python dict, an ndarray, or a scalar value (like 5, 7.5, True, or 'a').
By default, the index parameter assigns an zero-based index to each element in data much like a regular Python list. Again though, you can pass custom index values to a Series to serve as axis labels for your data. Note that Pandas DOES support non-unique index values.
dtype specifies the type of data you're passing into your Series. If you leave this blank, the program will infer the dtype from the contents of the data parameter.

Using this syntax, you can instantiate a Series from a single scalar value, a list, an ndarray, or a dict. Note: If data is an ndarray, index must be the same length as data.

import numpy as np
import pandas as pd
import random

# From a single scalar value
s_scalar = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
"""
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
"""

# From a list
s_list = pd.Series(['red','orange','yellow','green','blue','purple'])
"""
0       red
1    orange
2    yellow
3     green
4      blue
5    purple
"""

# From an ndarray
s_ndarray = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s_ndarray)
"""
a   -0.901847
b   10.503150
c    2.060891
d   -0.367695
e    1.040442
"""

# From a dict
d = {'b': 1, 'a': 0, 'c': 2} ### use wines from data set
s_dict = pd.Series(d)
"""
b    1
a    0
c    2
"""

A DataFrame is a two-dimensional data matrix that stores data much like a spreadsheet does. It has labeled columns and rows with values for each column. Basically, it's virtual spreatsheet. It accepts many different data types as values, including strings, arrays (lists), dicts, Series, and even other DataFrames. The general syntax for creating a DataFrame is identical to that of a Series except it includes a second index parameter called columns parameter for adding the index values to the second dimension:

import numpy as np
import pandas as pd

df = pd.DataFrame (data, index, columns)

Creating a DataFrame is a little more complex than creating a Series because you have to consider both rows and columns. Aside from creating a dataframe indirectly by importing an existing data structure, you can create a DataFrame by:

Specifying column names (i.e. column index values) directly within the data parameter
Specifying column names separately in the columns parameter

import numpy as np
import pandas as pd

# Specify values for each column.
df = pd.DataFrame(
{"a" : [4 ,5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = [1, 2, 3])

# Specify values for each row.
df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])


# Both of these methods create a DataFrame with these values:
"""
   a   b   c
1  4   7   10
2  5   8   11
3  6   9   12
"""

Here are a few other examples:

import numpy as np
import pandas as pd

# From dict of Series or dicts
data1 = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df1 = pd.DataFrame(data1, index=['d', 'b', 'a'], columns=['two', 'three'])
"""
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN
"""

# From dict of ndarrays / lists
data2 = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}
df2 = pd.DataFrame(data2, index=['a', 'b', 'c', 'd'])
"""
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0
"""

# From a list of dicts
data3 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
df3 = pd.DataFrame(data3, index=['first', 'second'], columns=['a', 'b', 'c'])
"""
        a   b     c
first   1   2   NaN
second  5  10  20.0
"""

Before we dive into analysis, we have to make sure we set up a stable, organized environment. For our lesson on Pandas we'll be using this dataset:

Wine Reviews | Kaggle -- 130k wine reviews with variety, location, winery, price, & description

Instead of convoluting things with a specialized Data Science IDE, we're going to start simple -- working locally, straight in the terminal. We'll walk through how to spin this up together step by step:

1) On your Desktop, create a new folder called "WineReviews". In here, we want to split up our code files from our raw data files to keep things organized.

2) Within this parent directory, create an empty "main.py" file.

3) Now, create another folder called "raw_data". Drag the wine_reviews.csv file into it.

4) Go back to the main.py file. In practice, when we go to run the main.py file in terminal, the code we'll write here will open the csv file and give the program access to its full contents.

import numpy as np 
import pandas as pd

# B) Read csv file
wine_reviews = pd.read_csv('raw_data/winemag-data-130k.csv')

First, notice that the standard is to import numpy and pandas into your python program as np and pd. Second, when you write the command to open the file, make sure you put the file name in quotes and reference the path to its location in the project directory.

5) Open up your terminal and cd into our project's parent directory.

cd ~/Desktop/WineReviews

6) Create your virtual environment

python3 -m venv .env

7) Activate the virtual environment.

source .env/bin/activate

8) Install Pandas.

pip install pandas

There are a couple salient points to mention here:

Remember that we installed Python3 in our high-level system environment, but you don't want to do that with more specific libraries. It could cause you to run into issues if a certain version references older iterations of that library.

For the WineReviews project, you will only have to install Pandas once. Every time you reactivate this project's virtual environment, it will have it there.

Having NumPy installed is a pre-requisite for using Pandas. However, installing Pandas automatically installs NumPy. That's why we don't have to call pip install numpy explicitly.

9) Run the main.py file to make sure the code works!

python3 main.py

NOTE! Reading Files

We've just finished preparing our first dataset for analysis. This one was in .CSV format, but we also learned above that Pandas can handle many different file types. To open each of these in pandas we use a slightly customized version of the general method pd.read_<filetype>(<file_name>). Look here for a quick summary of commands for handling different file types in Pandas.

For today's lesson, we will leverage Pandas for exploratory data analysis (EDA). We will use Pandas to investigate, wrangle, munge, and clean our data.

In particular, we will examine how Pandas can be used to:

Investigate a dataset's integrity
Filter, sort, and manipulate a DataFrame's series

Additionally, the end portion of this section contains a glossary of methods and attributes provided by Pandas to handle data wrangling, selection, cleaning and organizing.

Wrangling Data
Selecting Data
- Single Values
- Subsetting & Slicing
Cleaning & Organizing Data
- Editing
- Null Values
- Duplicates
- Sorting

Wine Reviews | Kaggle
- 130k wine reviews with variety, location, winery, price, and description
Wine Reviews | Local
- You can download a version of the kaggle dataset directly from this Github Repo
Adventureworks Cycles | Local
- You can download a version of the Adventureworks Cycles dataset directly from this Github Repo

Our core focus will be using a dataset developed by Microsoft for training purposes in SQL server, known the Adventureworks Cycles 2014OLTP Database.

It is based on a fictitious company called Adventure Works Cycles (AWC), a multinational manufacturer and seller of bicycles and accessories.
The company is based in Bothell, Washington, USA and has regional sales offices in several countries.
We will be looking at a single table from this database, the Production.Product table, which outlines some of the products this company sells.

We can load our data as follows:

import pandas as pd
import numpy as np

prod = pd.read_csv('/raw_data/production.product.tsv', sep='\t')

Note the sep='\t'; this is because we are pulling in a tsv file, which is basically a csv file but with tabs as delimiters vs commas.

YOU DO: Download the tsv file into your local machine, create a python virtualenv and run the code above, but on your machine.

Every good dataset has a data dictionary. Essentially, it lists each field in the data and provides a contextual description. It serves as a good frame of reference for anyone not diving directly into the data.

cols = prod.columns
for idx, col in enumerate(cols):
  print(idx+1, col)

    1 ProductID
    2 Name
    3 ProductNumber
    4 MakeFlag
    5 FinishedGoodsFlag
    6 Color
    7 SafetyStockLevel
    8 ReorderPoint
    9 StandardCost
    10 ListPrice
    11 Size
    12 SizeUnitMeasureCode
    13 WeightUnitMeasureCode
    14 Weight
    15 DaysToManufacture
    16 ProductLine
    17 Class
    18 Style
    19 ProductSubcategoryID
    20 ProductModelID
    21 SellStartDate
    22 SellEndDate
    23 DiscontinuedDate
    24 rowguid
    25 ModifiedDate

prod.head(1)

The head method lets us read in the first n rows of a dataset. Run this in your machine, you should expect to see:

   ProductID             Name ProductNumber  MakeFlag  ...  SellEndDate DiscontinuedDate                                 rowguid                   ModifiedDate
0          1  Adjustable Race       AR-5381         0  ...          NaN              NaN  {694215B7-08F7-4C0D-ACB1-D734BA44C0C8}  2014-02-08 10:01:36.827000000

[1 rows x 25 columns]

YOU DO: Run the above code in your machine, but with n=5. What do you see?
YOU DO: What kind of object is prod? Run type(prod) and report back your findings.
YOU DO: What is the shape of this dataframe? Run prod.shape to find out.

This dataset is comprehensive! Let's see how we might be able to select a subset of this data for easier analysis.

Let's start with only 3 rows for now:

prod_subset = prod.head(3)

   ProductID             Name ProductNumber  MakeFlag  ...  SellEndDate DiscontinuedDate                                 rowguid                   ModifiedDate
0          1  Adjustable Race       AR-5381         0  ...          NaN              NaN  {694215B7-08F7-4C0D-ACB1-D734BA44C0C8}  2014-02-08 10:01:36.827000000
1          2     Bearing Ball       BA-8327         0  ...          NaN              NaN  {58AE3C20-4F3A-4749-A7D4-D568806CC537}  2014-02-08 10:01:36.827000000
2          3  BB Ball Bearing       BE-2349         1  ...          NaN              NaN  {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}  2014-02-08 10:01:36.827000000

[3 rows x 25 columns]

If we wanted to only pull in a few columns, we could do something like:

two_cols = prod_subset[['ProductID', 'Name']]
print(two_cols)

   ProductID             Name
0          1  Adjustable Race
1          2     Bearing Ball
2          3  BB Ball Bearing

YOU DO: Grab the first 5 rows of the dataset and save a subset df with the following columns: ProductID, Name, Color, and ListPrice.

We can leverage pandas to explore the column header names and associated datatypes of the headers as well.

print(prod.columns)

Index(['ProductID', 'Name', 'ProductNumber', 'MakeFlag', 'FinishedGoodsFlag',
       'Color', 'SafetyStockLevel', 'ReorderPoint', 'StandardCost',
       'ListPrice', 'Size', 'SizeUnitMeasureCode', 'WeightUnitMeasureCode',
       'Weight', 'DaysToManufacture', 'ProductLine', 'Class', 'Style',
       'ProductSubcategoryID', 'ProductModelID', 'SellStartDate',
       'SellEndDate', 'DiscontinuedDate', 'rowguid', 'ModifiedDate'],
      dtype='object')

If we wanted to view the columns and their types, we can do:

prod.dtypes

ProductID                  int64
Name                      object
ProductNumber             object
MakeFlag                   int64
FinishedGoodsFlag          int64
Color                     object
SafetyStockLevel           int64
ReorderPoint               int64
StandardCost             float64
ListPrice                float64
Size                      object
SizeUnitMeasureCode       object
WeightUnitMeasureCode     object
Weight                   float64
DaysToManufacture          int64
ProductLine               object
Class                     object
Style                     object
ProductSubcategoryID     float64
ProductModelID           float64
SellStartDate             object
SellEndDate               object
DiscontinuedDate         float64
rowguid                   object
ModifiedDate              object

YOU DO: What kind of python object is the prod.dtypes? How do you know?
YOU DO: How does pandas know the col datatypes? Don't code this, but how might you implement this feature in pure python?

IMPORTANT: depending on number of square brackets used, selection of a column may return a Series object or a DataFrame object. Depending on your usecase, you may want one or the other!

Consider the following:

prod['Name'].head(3)
type(prod['Name'].head(3))

0    Adjustable Race
1       Bearing Ball
2    BB Ball Bearing
Name: Name, dtype: object
<class 'pandas.core.series.Series'>

prod[["Name"]].head(3)
type(prod[['Name']].head(3))

              Name
0  Adjustable Race
1     Bearing Ball
2  BB Ball Bearing
<class 'pandas.core.frame.DataFrame'>

YOU DO: Select Name and ProductID columns from our Dataframe. Is this possible to do as a Series? Why or why not?

We can rename columns as needed, like so:

new_prod = prod.rename(columns={'Name': 'ProductName', 'ProductNumber':'Number'}, inplace=False).head(3)

A few things to note here:

inplace: this is a boolean that will update the original dataframe OR create us a new one
{'Name': 'ProductName'}: we may use this as a way to map a new col name to an existing one

REMEMBER: we can view all the columns of a dataframe with:

prod.columns

What is the datatype of this attribute?

type(prod.columns)

<class 'pandas.core.indexes.base.Index'>

The Index is an immutable ndarray implementing an ordered, sliceable set. It is the basic object storing axis labels for all pandas objects. Think of it as a 'row address' for your data frame (table). We can cast this Index to be something like, like say...a list.

list(prod.columns)

Now, we can do something like:

cols_list = list(prod.columns)
cols_list[0] = 'New Col'
prod.columns = cols_list

YOU DO: What will the code above do? Run it and report back.
YOU DO: Select the first three rows under New Col and return it as a dataframe.
YOU DO: First, copy prod to prod_cpy (look at references below to see how to copy a dataframe). Then, rename the columns above, but inplace meaning prod_cpy itself must be mutated.

Five Number Summary (all assumes numeric data):

Min: The smallest value in the column
Max: The largest value in the column
Quartile: A quartile is one fourth of our data
- First quartile: This is the bottom most 25 percent
- Median: The middle value. (Line all values biggest to smallest - median is the middle!) Also the 50th percentile
- Third quartile: This the the top 75 percentile of our data

The describe method allows us to achieve this with pandas:

# note - describe *default* only checks numeric datatypes
prod[['MakeFlag', 'SafetyStockLevel', 'StandardCost']].describe()

If we were to select cols as series, we could run additional Series object methods:

# show the most popular product colors (aggregated by count, descending by default)
prod['Color'].value_counts()

Black           93
Silver          43
Red             38
Yellow          36
Blue            26
Multi            8
Silver/Black     7
White            4
Grey             1
Name: Color, dtype: int64

YOU DO: Leveraging the unique Series method, print out the unique colors for this product.
YOU DO: Leveraging the nunique Series method, print out how many distinct colors are available.
YOU DO: Leveraging the dropna keyword arg of the nunique Series method, print out how many distinct colors are available including NULL values.

Filtering and sorting are key processes that allow us to drill into the 'nitty gritty' and cross sections of our dataset.

To filter, we use a process called Boolean Filtering, wherein we define a Boolean condition, and use that Boolean condition to filer on our DataFrame.

Recall: our given dataset has a column Color. Let's see if we can find all products that are Black. Let's take a look at the first 10 rows of the dataframe to see how it looks as-is:

colors = prod['Color'].head(10)

ProductID
1         NaN
2         NaN
3         NaN
4         NaN
316       NaN
317     Black
318     Black
319     Black
320    Silver
321    Silver
Name: Color, dtype: object

To find only the "Black" colored items, we can:

prod['Color'].head(10) == 'Black'

ProductID
1      False
2      False
3      False
4      False
316    False
317     True
318     True
319     True
320    False
321    False
Name: Color, dtype: bool

YOU DO: Without using the unique/nunique methods from above, can you apply an additional filter to the series above to determine how many Black colored products exist?

We can apply this filtering to our Dataframes as well, in a more interesting manner:

prod[prod['Color'] == 'Black'].head(3)

   ProductID         Name ProductNumber  MakeFlag  ...  SellEndDate DiscontinuedDate                                 rowguid                   ModifiedDate
5        317  LL Crankarm       CA-5965         0  ...          NaN              NaN  {3C9D10B7-A6B2-4774-9963-C19DCEE72FEA}  2014-02-08 10:01:36.827000000
6        318  ML Crankarm       CA-6738         0  ...          NaN              NaN  {EABB9A92-FA07-4EAB-8955-F0517B4A4CA7}  2014-02-08 10:01:36.827000000
7        319  HL Crankarm       CA-7457         0  ...          NaN              NaN  {7D3FD384-4F29-484B-86FA-4206E276FE58}  2014-02-08 10:01:36.827000000

[3 rows x 25 columns]

YOU DO: Slice the dataframe above and select only the Color column - is there any non black color items?
YOU DO: calculate the average ListPrice for the salable products (hint: use the FinishedGoodsFlag column to determine "salability") using the Series.mean() method
YOU DO: calculate the above again, but this time use describe and pull the mean from there.

Let's filter on multiple conditions. Before, we filtered on rows where Color was Black. We also filtered where FinishedGoodsFlag was equal to 1. Let's see what happens when we filter on both simultaneously.

The format for multiple conditions is:

df[ (df['col1'] == value1) & (df['col2'] == value2) ]

Or, more simply:

df[ (CONDITION 1) & (CONDITION 2) ]

Which eventually may evaluate to something like:

df[ True & False ]

...on a row-by-row basis. If the end result is False, the row is omitted.

Don't forget parentheses in your conditions!! This is a common mistake.

prod[ (prod['Color'] == 'Black') & (prod['FinishedGoodsFlag'] == 1) ].head(3)

     ProductID                       Name ProductNumber  MakeFlag  ...  SellEndDate DiscontinuedDate                                 rowguid                   ModifiedDate
209        680  HL Road Frame - Black, 58    FR-R92B-58         1  ...          NaN              NaN  {43DD68D6-14A4-461F-9069-55309D90EA7E}  2014-02-08 10:01:36.827000000
212        708    Sport-100 Helmet, Black       HL-U509         0  ...          NaN              NaN  {A25A44FB-C2DE-4268-958F-110B8D7621E2}  2014-02-08 10:01:36.827000000
226        722  LL Road Frame - Black, 58    FR-R38B-58         1  ...          NaN              NaN  {2140F256-F705-4D67-975D-32DE03265838}  2014-02-08 10:01:36.827000000

[3 rows x 25 columns]

Another example:

# Here we have an example of a list price of greater than 50, 
# OR a product size that is not equal to 'XL'

prod[ (prod['ListPrice'] > 50) | (prod['Size'] != 'XL') ].head(3)

   ProductID             Name ProductNumber  MakeFlag  ...  SellEndDate DiscontinuedDate                                 rowguid                   ModifiedDate
0          1  Adjustable Race       AR-5381         0  ...          NaN              NaN  {694215B7-08F7-4C0D-ACB1-D734BA44C0C8}  2014-02-08 10:01:36.827000000
1          2     Bearing Ball       BA-8327         0  ...          NaN              NaN  {58AE3C20-4F3A-4749-A7D4-D568806CC537}  2014-02-08 10:01:36.827000000
2          3  BB Ball Bearing       BE-2349         1  ...          NaN              NaN  {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}  2014-02-08 10:01:36.827000000

[3 rows x 25 columns]

YOU DO: Find all rows that have a NULL dataframe and is NOT finished. HINT: use pd.isna

Here's how we can sort a dataframe

prod.sort_values(by='StandardCost', ascending=False).head(3)

     ProductID              Name ProductNumber  MakeFlag  ...          SellEndDate DiscontinuedDate                                 rowguid                   ModifiedDate
253        749  Road-150 Red, 62    BK-R93R-62         1  ...  2012-05-29 00:00:00              NaN  {BC621E1F-2553-4FDC-B22E-5E44A9003569}  2014-02-08 10:01:36.827000000
254        750  Road-150 Red, 44    BK-R93R-44         1  ...  2012-05-29 00:00:00              NaN  {C19E1136-5DA4-4B40-8758-54A85D7EA494}  2014-02-08 10:01:36.827000000
255        751  Road-150 Red, 48    BK-R93R-48         1  ...  2012-05-29 00:00:00              NaN  {D10B7CC1-455E-435B-A08F-EC5B1C5776E9}  2014-02-08 10:01:36.827000000

[3 rows x 25 columns]

This one is a little more advanced, but it demonstrates a few things:

Conversion of a numpy.ndarray object (return type of pd.Series.unique()) into a pd.Series object
pd.Series.sort_values with the by= kwarg omitted (if only one column is the operand, by= doesn't need specified
Alphabetical sort of a string field, ascending=True means A->Z
Inclusion of nulls, NaN in a string field (versus omission with a float/int as prior example)

pd.Series(prod['Color'].unique()).sort_values(ascending=True)

1           Black
5            Blue
8            Grey
6           Multi
3             Red
2          Silver
9    Silver/Black
4           White
7          Yellow
0             NaN
dtype: object

YOU DO: Create a variable called rows and a variable called cols. Store the num rows and cols in dataframe into these variables, respectively
YOU DO: Print out the number of unique product lines that exist in this data set
YOU DO: Print out the values of these product lines, DROP NULLS
YOU DO: Using shape and a dataframe filter, print out how many R productlines exist.
Challenge: What are the top 3 most expensive list price product that are either in the Women's Mountain category, OR Silver in Color? Return your answer as a DataFrame object, with NewName relabeled as Name, and ListPrice columns. Perform the statement in one execution, and do not mutate the source DataFrame.


# basic DataFrame operations
df.head()
df.tail()
df.shape
df.columns
df.Index

# selecting columns
df.column_name
df['column_name']

# renaming columns
df.rename({'old_name':'new_name'}, inplace=True)
df.columns = ['new_column_a', 'new_column_b']

# notable columns operations
df.describe() # five number summary
df['col1'].nunique() # number of unique values
df['col1'].value_counts() # number of occurrences of each value in column

# filtering
df[ df['col1'] < 50 ] # filter column to be less than 50
df[ (df['col1'] == value1) & (df['col2'] > value2) ] # filter column where col1 is equal to value1 AND col2 is greater to value 2

# sorting
df.sort_values(by='column_name', ascending = False) # sort biggest to smallest

🐼 🐼 🐼

Please find below a list of useful dataframe properties and methods for use in your exploratory data analysis practice.

Given the following dataset:

wine_reviews = pd.read_csv('raw_data/winemag-data-130k.csv')

After your initial import of some dataset, you'll want to do a gut check to make sure everything is in place. Here are the kind of very basic properties you might want to check:

df.info() -- returns index, datatype and memory information
df.shape -- returns the number of rows and columns in a data frame
len(obj) -- returns # of rows in the object data (*S & df)
obj.size -- returns # of elements in the object (*S & df)
df.index -- returns index of the rows specifically (*S & df)
df.columns -- returns the column labels of the DataFrame.
df.head(n) -- returns last n rows of a data frame
df.tail(n) -- returns last n rows of a data frame
copy(obj) -- create a deep copy of the object (*S & df)
obj.empty -- returns booleans for whether object is empty or not

df.loc[row_label, col_label] -- select a single item in a DataFrame by its row and column labels
df.loc[start_row_label : end_row_label, start_col_label : end_col_label] -- select a slice of a DataFrame by starting and ending row/column labels
df.iloc[row_index,:] -- select a row in a DataFrame by index position
s.iloc[index] -- select a single item by its position
s.loc[index] -- select a slice of items from a Series

obj.get(key) -- returns an item from an object (e.g. a column from a DataFrame, a value from a Series, etc.)
df[col] -- select and name a column and return it as a Series
df.loc[label1, label2, ...] -- select one or more rows or columns in a DataFrame by its label
df[[col1, col2]] -- select and name multiple columns and return them as a new data frame
df.nlargest(n, key) -- Select and order top n entries.
df.nsmallest(n, key) -- Select and order bottom n entries
obj.where(cond, other = NaN, inplace = False, axis = None) -- replace values in the object where the condition is False (S or df)
df.iloc[row_index, col_index] -- select a single item in a DataFrame by the index position of its row and col
df.iloc[start_index : end_index, start_index : end_index] -- select a slice of a DataFrame by starting and ending index row/column positions; (ending index stop at index before it)

obj.truncate([before, after, axis) -- truncate an object before and after some index value (S & df)
df.drop(columns=[col1, col2, ...]) -- drops specified columns from the dataframe
s.replace(1,'one') -- replace all values equal to 1 with 'one'
s.replace([1,3],['one','three']) -- replace all values equal to 1 with 'one' and all values equal to 3 with 'three'
df.rename(columns={'old_name': 'new_ name'}) -- rename specific columns
df.set_index(keys) -- change the index of the data frame
df.reset_index(keys) -- Reset index of DataFrame to row numbers, moving index to columns.
shift([periods, freq, axis, fill_value]) -- Shift index by desired number of periods with an optional time freq.
df.set_axis(labels)

pd.isnull() -- checks for null (NaN values in the data and returns an array of booleans, where "True" means missing and "False" means present
pd.notnull() -- returns all values that are NOT null
pd.isnull().sum() -- returns a count of null (NaN)
df.dropna() -- Drops all rows that contain null values and returns a new df
df.dropna(axis=1) -- Drops all columns that contain null values and returns a new df
df.dropna(subset=[col1]) -- Drops all rows that contain null values in one or more specific columns and returns a new df
df.fillna(value=x) —- replace all missing values with some value x (S & df)
s.fillna(s.mean()) -- Replaces all null values with the mean (mean can be replaced with almost any function from the statistics section)

df.duplicated([subset, keep]) -- Rrturn boolean Series denoting duplicate rows; can choose to consider a subset of columns
drop_duplicates([subset, keep, inplace]) -- returns DataFrame with duplicate rows removed, optionally only considering certain columns.

df.transform(func[, axis]) -- return DataFrame with transformed values
df.transpose(*args, **kwargs) -- transpose rows and columns
df.sort_values(col1) -- sort values in a certain column in ascending order
df.sort_index(axis=1) -- sort axis values by index in ascending order
df.sort_values(col2,ascending=False) -- sort values in a certain column in descending order
df.sort_index(axis=1, ascending=False) -- sort axis values by index in descending order
df.sort_values([col1,col2],ascending=[True,False]) -- sort values in a col1 in ascending order, then sort values in col2 in descending order

In this lesson, we'll continue exploring Pandas for EDA. Specifically:

Identify and handle missing values with Pandas.
Implement groupby statements for specific segmented analysis.
Use apply functions to clean data with Pandas.

Adventureworks Cycles | Local
- You can download a version of the Adventureworks Cycles dataset directly from this Github Repo
OMDB Movies | Local
- You can download a version of the Adventureworks Cycles dataset directly from this Github Repo

Here's the Production.Product table data dictionary, which is a description of the fields (columns) in the table (the .csv file we will import below):

ProductID - Primary key for Product records.
Name - Name of the product.
ProductNumber - Unique product identification number.
MakeFlag - 0 = Product is purchased, 1 = Product is manufactured in-house.
FinishedGoodsFlag - 0 = Product is not a salable item. 1 = Product is salable.
Color - Product color.
SafetyStockLevel - Minimum inventory quantity.
ReorderPoint - Inventory level that triggers a purchase order or work order.
StandardCost - Standard cost of the product.
ListPrice - Selling price.
Size - Product size.
SizeUnitMeasureCode - Unit of measure for the Size column.
WeightUnitMeasureCode - Unit of measure for the Weight column.
DaysToManufacture - Number of days required to manufacture the product.
ProductLine - R = Road, M = Mountain, T = Touring, S = Standard
Class - H = High, M = Medium, L = Low
Style - W = Womens, M = Mens, U = Universal
ProductSubcategoryID - Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID.
ProductModelID - Product is a member of this product model. Foreign key to ProductModel.ProductModelID.
SellStartDate - Date the product was available for sale.
SellEndDate - Date the product was no longer available for sale.
DiscontinuedDate - Date the product was discontinued.
rowguid - ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample.
ModifiedDate - Date and time the record was last updated.

We can load our data as follows:

import pandas as pd
import numpy as np

prod = pd.read_csv('/raw_data/production.product.tsv', sep='\t')

Note the sep='\t'; this is because we are pulling in a tsv file, which is basically a csv file but with tabs as delimiters vs commas.

YOU DO: Download the tsv file into your local machine, create a python virtualenv and run the code above, but on your machine.

Recall missing data is a systemic, challenging problem for data scientists. Imagine conducting a poll, but some of the data gets lost, or you run out of budget and can't complete it! 😮

"Handling missing data" itself is a broad topic. We'll focus on two components:

Using Pandas to identify we have missing data
Strategies to fill in missing data (known in the business as imputing)
Filling in missing data with Pandas

Before handling, we must identify we're missing data at all!

We have a few ways to explore missing data, and they are reminiscient of our Boolean filters.

# True when data isn't missing
prod.notnull().head(3)
# True when data is missing
prod.isnull().head(3)

OUTPUT: notnull

   ProductID  Name  ProductNumber  MakeFlag  FinishedGoodsFlag  Color  ...  ProductModelID  SellStartDate  SellEndDate  DiscontinuedDate  rowguid  ModifiedDate
0       True  True           True      True               True  False  ...           False           True        False             False     True          True
1       True  True           True      True               True  False  ...           False           True        False             False     True          True
2       True  True           True      True               True  False  ...           False           True        False             False     True          True

[3 rows x 25 columns]

OUTPUT: isnull

   ProductID   Name  ProductNumber  MakeFlag  FinishedGoodsFlag  Color  ...  ProductModelID  SellStartDate  SellEndDate  DiscontinuedDate  rowguid  ModifiedDate
0      False  False          False     False              False   True  ...            True          False         True              True    False         False
1      False  False          False     False              False   True  ...            True          False         True              True    False         False
2      False  False          False     False              False   True  ...            True          False         True              True    False         False

[3 rows x 25 columns]

YOU DO: count the number of nulls in Name column
YOU DO: count the number of notnulls in Name column

We can also access missing data in aggregate, as follows:

# here is a quick and dirty way to do it
prod.isnull().sum()

Name                       0
ProductNumber              0
MakeFlag                   0
FinishedGoodsFlag          0
Color                    248
SafetyStockLevel           0
ReorderPoint               0
StandardCost               0
ListPrice                  0
Size                     293
SizeUnitMeasureCode      328
WeightUnitMeasureCode    299
Weight                   299
DaysToManufacture          0
ProductLine              226
Class                    257
Style                    293
ProductSubcategoryID     209
ProductModelID           209
SellStartDate              0
SellEndDate              406
DiscontinuedDate         504
rowguid                    0
ModifiedDate               0
dtype: int64

YOU DO: Wrap the result from above, but into a dataframe. Sort the dataframe by column with the column with most missing data to column on top and the column with least amount of missing data on bottom.

How we fill in data depends largely on why it is missing (types of missingness) and what sampling we have available to us.

We may:

Delete missing data altogether
Fill in missing data with:
- The average of the column
- The median of the column
- A predicted amount based on other factors
Collect more data:
- Resample the population
- Followup with the authority providing data that is missing

In our case, let's focus on handling missing values in Color. Let's get a count of the unique values in that column. We will need to use the dropna=False kwarg, otherwise the pd.Series.value_counts() method will not count NaN (null) values.

prod['Color'].value_counts(dropna=False)

NaN             248
Black            93
Silver           43
Red              38
Yellow           36
Blue             26
Multi             8
Silver/Black      7
White             4
Grey              1
Name: Color, dtype: int64

We have 248 null values for Colors!

To delete the null values, we can:

prod.dropna(subset=['Color']).head(3)

This will remove all NaN values in the color column

We can fill in the missing data with a sensible default, for instance:

prod.fillna(value={'Color': 'NoColor'})

This will swap all NaN values in Color column with NoColor.

We can swap the Color column's null values with essentially anything we want - for instance:

prod.fillna(value={'Color': prod['ListPrice'].mean() })

YOU DO: Run the code above. What will it do? Does it make sense for this column? Why or why not?

YOU DO: Copy the prod dataframe, call it prod_productline_sanitized
YOU DO: In prod_productline_sanitized drop all NA values from the ProductLine column, inplace
YOU DO: Copy the prod dataframe, call it prod_productline_sanitized2
YOU DO: In prod_productline_sanitized2, fill all NA values with boolean False

In Pandas, groupby statements are similar to pivot tables in that they allow us to segment our population to a specific subset.

For example, if we want to know the average number of bottles sold and pack sizes per city, a groupby statement would make this task much more straightforward.

To think how a groupby statement works, think about it like this:

Split: Separate our DataFrame by a specific attribute, for example, group by Color
Combine: Put our DataFrame back together and return some aggregated metric, such as the sum, count, or max.

Let's group by Color, and get a count of products for each color.

prod.groupby('Color')

Notice how this doesn't actually do anything - or at least, does not print anything.

Things get more interesting when we start using methods such as count:

prod.groupby('Color').count().head(5)

It is worth noting that count will always return non-null values, and the only way to force groupby().count() to ack null values is to fill nulls with fillna or something to that effect.

Let's do something a tad more interesting:

prod[['Color', 'ListPrice']].groupby('Color').max().sort_values('ListPrice', ascending=False)

YOU DO: Run this code in your machine. What does it do?
YOU DO: instead of max, find the min ListPrice by Color
YOU DO: instead of min, find the mean ListPrice by Color
YOU DO: instead of mean, find the count of ListPrice by Color

We can also do multi-level groupbys. This is referred to as a Multiindex dataframe. Here, we can see the following fields in a nested group by, with a count of Name (with nulls filled!); effectively giving us a count of the number of products for every unique Class/Style combination:

Class - H = High, M = Medium, L = Low
Style - W = Womens, M = Mens, U = Universal

prod.fillna(value={'Name': 'x'}).groupby(by=['Class', 'Style']).count()[['Name']]

             Name
Class Style
H     U        64
L     U        68
M     U        22
      W        22

YOU DO: groupby MakeFlag and FinishedGoodsFlag and return counts of ListPrice

We can also use the .agg() method with multiple arguments, to simulate a .describe() method like we used before:

prod.groupby('Color')['ListPrice'].agg(['count', 'mean', 'min', 'max'])

              count         mean     min      max
Color
Black            93   725.121075    0.00  3374.99
Blue             26   923.679231   34.99  2384.07
Grey              1   125.000000  125.00   125.00
Multi             8    59.865000    8.99    89.99
Red              38  1401.950000   34.99  3578.27
Silver           43   850.305349    0.00  3399.99
Silver/Black      7    64.018571   40.49    80.99
White             4     9.245000    8.99     9.50
Yellow           36   959.091389   53.99  2384.07

YOU DO: groupby MakeFlag and FinishedGoodsFlag and return agg of ListPrice by ['count', 'mean', 'min', 'max'].
YOU DO: do the results from above make sense? print out the dataframe of MakeFlag, FinishedGoodsFlag and ListPrice to see if they do or not.

Apply functions allow us to perform a complex operation across an entire columns or rows highly efficiently.

For example, let's say we want to change our colors from a word, to just a single letter. How would we do that?

The first step is writing a function, with the argument being the value we would receive from each cell in the column. This function will mutate the input, and return the result. This result will then be applied to the source dataframe (if desired).

def color_to_letter(col):
    if  pd.isna(col['Color']):
        return 'N'

    return col['Color'][0].upper()

prod[['Color']].apply(color_to_letter, axis=1).head(10)

0    N
1    N
2    N
3    N
4    N
5    B
6    B
7    B
8    S
9    S
Name: Color, dtype: object

The axis=1 refers to a row operation. Consider the following:

df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])

Using apply functions, we can do:

df.apply(np.sqrt)

which would give us:

     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

We can also apply to either axis, 1 for rows and 0 for columns.

YOU DO: using np.sum as apply function, run along rows of df above.
YOU DO: using np.sum as apply function, run along columns of df above.

We've covered even more useful information! Here are the key takeaways:

Missing data comes in many shapes and sizes. Before deciding how to handle it, we identify it exists. We then derive how the missingness is affecting our dataset, and make a determination about how to fill in values.

# pro tip for identifying missing data
df.isnull().sum()

Grouby statements are particularly useful for a subsection-of-interest analysis. Specifically, zooming in on one condition, and determining relevant statstics.

# group by 
df.groupby('column').agg['count', 'mean', 'max', 'min']

Apply functions help us clean values across an entire DataFrame column. They are like a for loop for cleaning, but many times more efficient. They follow a common pattern:

Write a function that works on a single value
Test that function on a single value
Apply that function to a whole column

Import the data CSV as dataframe (See above for link to dataset)
Print first 5 rows
Print out the num rows and cols in the dataset
Print out column names
Print out the column data types
How many unique genres are available in the dataset?
How many movies are available per genre?
What are the top 5 R-rated movies? (hint: Boolean filters needed! Then sorting!)
What is the average Rotten Tomatoes score for all available films?
Same question as above, but for the top 5 films
What is the Five Number Summary like for top rated films as per IMDB?
Find the ratio between Rotten Tomato rating vs IMDB rating for all films. Update the dataframe to include a Ratings Ratio column (inplace).
Find the top 3 ratings ratio movies (rated higher on IMBD compared to Rotten Tomatoes)
Find the top 3 ratings ratio movies (rated higher on IMBD compared to Rotten Tomatoes)

At a high-level, this section will will cover:

df1.append(df2) -- add the rows in df1 to the end of df2 (columns should be identical)
df.concat([df1, df2],axis=1) —- add the columns in df1 to the end of df2 (rows should be identical)
df1.join(df2,on=col1,how='inner') —- SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values. how can be equal to one of: 'left', 'right', 'outer', 'inner'
df.merge() -- merge two datasets together into one by aligning the rows from each based on common attributes or columns. how can be equal to one of: 'left', 'right', 'outer', 'inner'

df.transform(func[, axis]) -- return DataFrame with transformed values
df.transpose(*args, **kwargs) -- transpose rows and columns
df.rank() -- rank every variable according to its value
pd.melt(df) -- gathers columns into rows
df.pivot(columns='var', values='val') -- spreads rows into columns

df.groupby(col) -- returns groupby object for values from a single, specific column
df.groupby([col1,col2]) -- returns a groupby object for values from multiple columns, which you can specify

df[col1].unique() -- returns an ndarray of the distinct values within a given series
df[col1].nunique() -- return # of unique values within a column
.value_counts() -- returns count of each unique value
df.sample(frac = 0.5) - randomly select a fraction of rows of a DataFrame
df.sample(n=10) - randomly select n rows of a DataFrame
mean() -- mean
median() -- median
min() -- minimum
max() -- maximum
quantile(x) -- quantile
var() -- variance
std() -- standard deviation
mad() -- mean absolute variation
skew() -- skewness of distribution
sem() -- unbiased standard error of the mean
kurt() -- kurtosis
cov() -- covariance
corr() -- Pearson Correlation coefficent
autocorr() -- autocorelation
diff() -- first discrete difference
cumsum() -- cummulative sum
comprod() -- cumulative product
cummin() -- cumulative minimum

In this section, we'll go over example code for different types of common visualizations.

Describe why data visualization is important for communicating results.
Identify how to select the correct visualization to use based on the data being presented.
Identify characteristics to clearly communicate through data visualizations.

We're only looking at 1/3 of this data set! While all the data we need is here, it is difficult to make sense of and draw any meaning from.

A quick, easy way to convey concepts that from from large data sets.
We can use these charts, graphs, or illustrations to visualize large amounts of complex data.

Visualizations should follow three (plus one) rules. They should be:

Simplified
Easy to Interpret
Clearly Labeled
(Bonus) Interactive

With so many chart types, it can be difficult to know how best to display your data.

When creating a visualization first think about the variables you are showing (words, categories, numbers, etc., the volume of data, and the central point you are hoping to communicate through your visualization.

Bar charts are one of the most simple and frequently used chart types. They are useful for illustrating either one string or one numeric variable, quickly comparing information, or for show exact values.

When thinking about using a bar chart consider:

Will you use vertical or horizontal bars?
How will you number your axis (it is always best to start at zero)?
How will you order your bars?

As you can see from this example pie charts can be effective for proportions or percentages.

Pie charts are commonly misused. They show a part-to-whole relationship when the total amount is one of your variable and you'd like to show the subdivision of variables.

When thinking about using a pie chart consider:

The more variables you have, as in the more slices of your pie you'll have, the harder it is to read.
Area is very difficult for the eye to read, so if any of your wedges are similarly sized think about a different chart type.
If you want to compare data, leave it to bars or stacked bars. If your viewer has to work to translate pie wedges into relevant data or compare pie charts to one another, the key points you're trying to convey might go unnoticed.

This scatter plot uses a combination of text, coloring, and labelling to describe the data. What is clear or unclear from this chart about the data set?

Scatterplots are great for data dense visualizations and clusters. They are most effective for trends, concentrations, and outliers. They can be especially useful to see what you want to investigate further.

When thinking about using a scatter plot consider:

This chart type is not as common so can me more difficult for an audience to read.
If dots are covering up each other, consider a different chart type.
A bubble chart is one variation on the scatter plot.
Scatter plots are a great way to give you a sense of trends, concentrations, and outliers, and are great to use while exploring your data. This will provide a clear idea of what you may want to investigate further.

Annual sales in each state for a grocery store chain?

Bar chart.
Pie chart.
Scatterplot.

Effective for distribution across groups.

Histograms are useful when you want to see how your data are distributed across groups. Important: histograms are not the same thing as a bar chart! Histograms look similar to bar charts, but with bar charts, each column represents a group defined by a categorical variable; and with histograms, each column represents a group defined by a continuous, quantitative variable.
One implication of this distinction: with a histogram, it can be appropriate to talk about the the tendency of the observations to fall more on the low end or the high end of the X axis.
With bar charts, however, the X axis does not have a low end or a high end; because the labels on the X axis are categorical - not quantitative.

The main difference between a bar chart and histogram is that histograms are used to show distributions of variables while bar charts are used to compare variables.

Relationship of average income to education level?

Bar chart.
Pie chart.
Scatterplot.
Histogram.

Line graphs are an excellent way to show change over time. While bar charts can also show time, they don't show it in a continuous way like a line chart.

Line charts are particularly good at showing how a variable change over time. They work best if you have one date variable and one number variable.

When thinking about using a line chart consider:

How many lines you'll need on your graph, the more overlapping lines there are, the harder your chart will be to read.
Consider how many colors you need to use for your lines. Giving each line its own color forces the viewer to scan back and forth from the key to the graph.
Individual data points can be hard to read, but line charts are good for showing overall trends.
Similar to bar charts, try and start at 0 on your x axis.

Change in average income since 1960 for American adults?

Bar chart.
Pie chart.
Scatterplot.
Line chart.
Histogram.

Check out this series of charts: https://i.redd.it/e7alp8yrnb711.png

Which is easiest to view the data?

It's subjective! There are pros and cons to each. Choosing a chart type depends firstly on the data you have. Secondly, it depends on the clearest way to convey your message. The alignment of these two aspects will help you decide what type of visualization to use.

There is an increasing array of libraries and tools to allow us to use code to create visualize data in compelling and approachable ways.

Check out this complex chart that was made using Python!

Source: u/dx034 on Reddit

Get in small groups of 2-3.

Go to https://www.reddit.com/r/dataisbeautiful/top/. These are all data visualizations created by people like you!

Pick one that you think is particularly good and one that is particular bad. Why? What are the characteristics?

Some attributes affect our brain more strongly.

In order of focus:

Position
Color
Size

Summary

The chart type you select should accurately represent the variables you are pulling from data in a way that is clearly readable for your audience.
Visual considerations include: position, color, order, size. What else?
With data visualizations becoming increasingly popular, a clean and clear chart goes a long way in conveying a message from a data set.

Solution

colors = ['red', 'yellow', 'green'] #strings
grades = [100, 99, 65, 54, 19] #numbers
bools = [True, False, True, True] #booleans

grades = [100, 99, 65, 54, 19]
grades[0] # 100
len(grades) # 5
sum(grades) # 337

ascending = sorted(grades) # [19, 54, 65, 99, 100]
descending = sorted(grades, reverse=True) # [100, 99, 65, 54, 19]

# UPDATE
my_class = ['Brandi', 'Zoe', 'Steve', 'Dayton', 'Dasha', 'Sonyl']
my_class[3] = "Aleksander"
# changes in place to ['Brandi', 'Zoe', 'Steve', 'Aleksander', 'Dasha', 'Sonyl']

# REMOVE
student_that_left = my_class.pop() # Sonyl
    # or
student_that_left = my_class.pop(3) # Steve
print(my_class) # ['Brandi', 'Zoe', 'Aleksander', 'Dasha']


# ADD
new_students = ["Raju", "Chloe"]
my_class.extend(new_students)
# changes in places to ['Brandi', 'Zoe', 'Aleksander', 'Dasha', 'Raju', 'Chloe']

my_class.insert(1, "Phoebe")
# changes in places to ['Brandi', 'Zoe', 'Aleksander', 'Dasha', 'Raju', 'Chloe']

# JOIN
words = ['this', 'is', 'fun']
sentence = ' '.join(words)
print(sentence) # 'this is fun'

words = ['this', 'is', 'fun']
sentence = ' '.join(words)
print(f'{sentence}.') # 'this is fun.'

# SPLIT
person = 'Sandra,hi@email.com,646-212-1234,8 Cherry Lane,Splitsville,FL,58028'
contact_info = person.split(',')
print(data) # ['Sandra', 'hi@email.com', '646-212-1234', '8 Cherry Lane', 'Splitsville', 'FL', '58028']

Creating Dicts:

names = ['Taq', 'Zola', 'Valerie', 'Valerie']
scores = [[98, 89, 92, 94], [86, 45, 98, 100], [100, 100, 100, 100], [76, 79, 80, 82]]

grades = dict(zip(names,scores))
print(grades) # {'Taq': [98, 89, 92, 94], 'Zola': [86, 45, 98, 100], 'Valerie': [76, 79, 80, 82]}

Accessing Dict Data:

state_capitals = {
    'NY': 'Albany',
    'NJ': 'Trenton',
    'CT': 'Hartford',
    'MA': 'Boston',
    'CA': 'Sacramento'
}

MAcap = state_capitals['MA'] # Boston
print(state_capitals.get('PA', []))
# PA is not in our dict, so .get() returns []

state_capitals.keys()
# dict_keys(['NY', 'NJ', 'CT', 'MA'])

state_capitals.values()
# dict_values(['Albany', 'Trenton', 'Hartford', 'Boston'])

state_capitals.items()
# dict_items([('NY', 'Albany'), ('NJ', 'Trenton'), ('CT', 'Hartford'), ('MA', 'Boston')])

more_states = {
    'WA': 'Olympia',
    'OR': 'Salem',
    'AZ': 'Phoenix',
    'GA': 'Atlanta'
}

# Add or update group of key/value pairs
state_capitals.update(more_states)

# Remove item by key
state_capitals.pop('AZ', [])

speed_limit = 65
my_speed = 32

my_speed < speed_limit # True
my_speed > speed_limit # False
my_speed <= speed_limit # True
my_speed >= speed_limit # False
(speed_limit == my_speed) # False
(speed_limit != my_speed) # True

if temp < 65 and is_it_raining:
    print('wear a raincoat and bring an umbrella!')
elif temp > 65 and is_it_raining:
    print('bring an umbrella!')
elif temp < 65:
    print('wear a jacket!')
else:
    print('the weather is beautiful!')

temp = 41
is_it_raining = True
# wear a raincoat and bring an umbrella!

temp = 73
is_it_raining = True
# bring an umbrella!

temp = 56
is_it_raining = False
# wear a jacket!

temp = 80
is_it_raining = False
# the weather is beautiful!

While Loops:

s = ''
n = 5

while n > 0:
    n -= 1
    if (n % 2) == 0:
        continue

    a = ['foo', 'bar', 'baz']
    while a:
        s += str(n) + a.pop(0)
        if len(a) < 2:
            break

print(s) # '3foo3bar1foo1bar'

###############################

a = ['foo', 'bar', 'baz', 'qux', 'corge']
while a:
    if len(a) < 3:
        break
    print(a.pop())
print('Done.')

## This loop will output...
"""
corge
qux
baz
Done.
"""

For Loops:

transaction = {
  "amount": 10.00,
  "payee": "Joe Bloggs",
  "account": 1234
}

for key, value in transaction.items():
    print("{}: {}".format(key, value))

# Output:
account: 1234
payee: Joe Bloggs
amount: 10.0

###############################

# else DOES execute
for i in ['foo', 'bar', 'baz', 'qux']:
  print(i)
else:
  print('Done.') # foo, bar, baz, qux, Done.

# else DOES NOT execute
for i in ['foo', 'bar', 'baz', 'qux']:
  if i == 'bar':
    break
  print(i)
else:
  print('Done.') # foo

Infinite Loops (Yikes!)

# Infinite Loop
a = ['foo', 'bar', 'baz', 'qux', 'corge']
while a:
    if len(a) < 3:
        continue
    print(a.pop())
print('Done.')

# Fixing the Infinite Loop
while a:
    if len(a) < 3:
        break
    print(a.pop())
print('Done.', a) # Done. ['foo', 'bar']

def function_name(parameters):
    """docstring"""
    # statement(s)

def num_squared(num):
    """Find the square of some number passed in"""
    square = num*num # code to find the square
    return square

sq12 = num_squared(12)
print(sq12) # 144

Parent class:

class Animal:
    def __init__(self, species = '', diet= ''):
        self.species = species
        self.diet = diet

    kingdom = 'Animalia'

    def my_kingdom(self):
        print(self.kingdom)

    def feed_me(self):
        if self.diet == 'omnivore':
            food = 'plants and meat'
        elif self.diet == 'carnivore':
            food = 'meat'
        elif self.diet == 'herbivore':
            food = 'plants'
        print(f'{self.species} eat {food}!')
        return None

Child class w. inheritance:

class Elephant(Animal):
    def __init__(self, name, genus = '', species = '', habitat = '', age = None):
        self.name = name
        self.genus = genus
        self.species = species
        self.habitat = habitat
        self.age = age
        self.taxonomy = {'Kingdom': Animal.kingdom, 'Class': self.common_taxonomy['Class'], 'Family': self.common_taxonomy['Family'], 'Genus': self.genus, 'Species': self.species} # C.

    diet = 'Herbivore'

    common_taxonomy = {
    'Class': 'Mammalia',
    'Family': 'Elephantidae',
    }

    def summary(self):
      print(f'All about {self.name} -')
      print(f'Elephant, age {self.age}\nHabitat: {self.habitat}\nDiet: {self.diet}\n\nTaxonomy:')
      for k,v in self.taxonomy.items():
        print(f'{k}: {v}')

More coming soon...

Your final project should address a data-related problem in a professional field that interests you. Pick any subject that you're passionate about! Your project should reflect significant original work inn applying data science techniques to an interesting problem. Although final projects are individual assignments, peer code review is strongly encouraged.

To help spark ideas, we put together a smorgasbord of cool public data sources. Using public data is the most common choice. If you have access to private data, that's also an option, though you'll have to be careful about what results you can release.

You are responsible for creating a project paper and a project presentation. The paper should be written with a technical audience in mind, while the presentation should target a more general audience. You will deliver your presentation (including slides) during the final week of class.

Here are the components you should aim to cover in your paper:

Problem statement and hypothesis
Data dictionary
Description of your data set and how it was obtained
Description of any pre-processing steps you took (i.e. wrangling & cleaning)
What you learned from exploring the data, including visualizations
How you chose which features to use in your analysis
Your challenges and successes
Conclusions and key learnings
Possible extensions or business applications of your project

Your presentation should cover summarize the above components and instead focus on creating an engaging, clear, and informative story about your project.

Deliver your project presentation and submit all required deliverables (paper, slides, code, data, and data dictionary).

Your project paper, presentation slides, and code should be included in a GitHub repository, along with all of your data and a data dictionary. If it's not possible or practical to include your data, you should link to your data source and provide a sample of the data (anonymized if necessary).

Question and Data Set(s)

What is the question you hope to answer? What data are you planning to use to answer that question? What do you know about the data so far? Why did you choose this topic?

Example:

I'm planning to predict passenger survival on the Titanic.
I have Kaggle's Titanic dataset with 10 passenger characteristics.
I know that many of the fields have missing values, that some of the text fields are messy and will require cleaning, and that about 38% of the passengers in the training set survive.
I chose this topic because I'm fascinated by the history of the Titanic.

What data have you gathered, and how did you gather it? What steps have you taken to explore the data? Which areas of the data have you cleaned, and which areas still need cleaning? What insights have you gained from your exploration? Will you be able to answer your question with this data, or do you need to gather more data (or adjust your question)? How might you use modeling to answer your question?

Example:

I've created visualizations and numeric summaries to explore how survivability differs by passenger characteristic, and it appears that gender and class have a large role in determining survivability.
I estimated missing values for age using the titles provided in the Name column.
I created features to represent "spouse on board" and "child on board" by further analyzing names.
I think that the fare and ticket columns might be useful for predicting survival, but I still need to clean those columns.
I analyzed the differences between the training and testing sets, and found that the average fare was slightly higher in the testing set.
Since I'm predicting a binary outcome, I plan to use a classification method such as logistic regression to make my predictions.

Please find homework details here.

Homework 1

From the Class PSETs, solve: