February 10, 2022

Introduction to Python Data Science Tools

The goal of this introductory Python Data Science tutorial is to illustrate the most commonly used concepts and tools for beginners.

Python Collection Data Types

There are four collection data types in the Python: List/Tuple/Set/Dictionary

The difference lies in:

The definitions are as follows. You can check usage examples in Python Cheatsheet

NOTE: Python 3.7+, dictionaries are ordered.

Back to Top

Numpy

Numpy is the core library for scientific computing in Python, which provides a high-performance multidimensional array objects and related tools to work with the array.

Import convention:

import numpy as np

Creating Numpy Arrays from Python Lists

Let’s try to create the following arrays from lists. Axes are the dimensions of the array. The default axis is 0, which refers to row and axis 1 refers to column, etc.

axis

# 1D array
a = np.array([2, 5, 6, 9])  

# 2D array
b = np.array([
    [3.5, 4.0, 6.5], 
    [0.4, 0.9, 4.7],
])

# 3D array
c = np.array([
    [
        [7, 1],
        [9, 4],
        [2, 3],
    ],
    [
        [4, 5],
        [0, 6],
        [8, 0],
    ],
    [
        [1, 3],
        [3, 9],
        [6, 9],
    ],
    [
        [5, 8],
        [8, 8],
        [4, 7],
    ],
])
print(a.ndim, a.size, a.shape, a.dtype)
print(b.ndim, b.size, b.shape, b.dtype)
print(c.ndim, c.size, c.shape, c.dtype)

1 4 (4,) int64
2 6 (2, 3) float64
3 24 (4, 3, 2) int64

Back to Top

Load from Text File

Numbers can be loaded from a text file using .loadtxt().

For example, the 3d array above can be loaded from numbers.txt.zip.

By default, the delimiter is whitespace. The example below shows how to specify comma as the delimiter.

a = np.loadtxt('numbers.txt', delimiter=',')
a = a.reshape(4, 3, 2)
print(a)

[[[7. 1.]
  [9. 4.]
  [2. 3.]]

 [[4. 5.]
  [0. 6.]
  [8. 0.]]

 [[1. 3.]
  [3. 9.]
  [6. 9.]]

 [[5. 8.]
  [8. 8.]
  [4. 7.]]]

Back to Top

Numpy Array vs. Python List

# element-wise computing
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
print(a + 2)
print(a + b)

[3 4 5 6]
[ 6  8 10 12]
a = np.array([1, 3, 5, 9])
b = np.array([
    [3.5, 4.0, 6.5], 
    [0.4, 0.9, 4.7],
])
print(a.sum(), a.min(), a.max(), a.mean(), np.median(a), np.std(a))
print(b.sum())
 # adding rows (axis=0) and adding columns (axis=1)
print(b.sum(axis=0), b.sum(axis=1))

18 1 9 4.5 4.0 2.958039891549808
20.0
[ 3.9  4.9 11.2] [14.  6.]

For list:

# slices of a Python list are copies of the list
# changing slices DOES NOT change the original list
a = [1, 2, 3, 4, 5]
b = a[2:4]
print(a, b)

b[1] = 9  # change b
print(b)
print(a)  # a does not change

[1, 2, 3, 4, 5] [3, 4]
[3, 9]
[1, 2, 3, 4, 5]

For Numpy array:

# slices of a Numpy array are views of the array
# changing the slices will change the original array!!!
c = np.array([1, 2, 3, 4, 5])
d = c[2:4]  # d is a view of c
print(d)

d[1] = 9  # change d
print(d)
print(c)  # c changed!

[3 4]
[3 9]
[1 2 3 9 5]

use np.copy() to make a copy of a numpy array:

c = np.array([1, 2, 3, 4, 5])
f = np.copy(c)  # make a copy of c
f = f[2:4]
print(f)
f[1] = 20  # change f
print(f)
print(c)  # c is not changed

[3 4]
[ 3 20]
[1 2 3 4 5]

Back to Top

Initial Placeholder Arrays

Create evenly spaced values by step value: (start, stop, step):

>>>> np.arange(1, 10, 2) 
array([1, 3, 5, 7, 9])

Create evenly spaced values by number of samples: (start, stop, num-of-samples):

>>>> np.linspace(1, 10, 5)
array([ 1.  ,  3.25,  5.5 ,  7.75, 10.  ])

Create arrays with random numbers

>>>> np.random.random((2, 3))
array([[0.65107703, 0.91495968, 0.85003858],
       [0.44945067, 0.09541012, 0.37081825]])

Create arrays with zeros and ones:

np.zeros ((2, 3))  # 2 rows 3 columns
np.ones ((2, 3, 4))  # 3D array with all ones

See more: DataCamp Numpy Cheatsheet

Back to Top

Pandas

Pandas is a fast, powerful, and flexible data analysis and manipulation tool.

Use the following import convention:

import pandas as pd

Pandas Series

A pandas series is a one dimensional labeled array (a Python array is only indexed but not labeled):

# a series is an array with labeled index, default to 0, 1, 2...
s = pd.Series([7, 9, 8])

0    7
1    9
2    8
dtype: int64

You can specify index labels:

# series is array with labeled index
s = pd.Series([7, 9, 8], index=['a', 'b', 'c'])

a    7
b    9
c    8
dtype: int64

You can use the label to access the element, e.g. s['c']

Back to Top

Pandas DataFrame

A pandas DataFrame is a two dimensional labeled data structure with columns of potentially different types:

Create a pandas dataframe from a Python dictionary:

# column names by default are keys in the dict
countries = {
    'country': ['USA', 'China', 'Japan'],
    'gdp': [20.49, 13.40, 4.97],
    'population':[331, 1439, 126],
}
df = pd.DataFrame(countries)

  country    gdp  population
0     USA  20.49         331
1   China  13.40        1439
2   Japan   4.97         126

Choose certain columns from keys:

df = pd.DataFrame(countries, columns=['country', 'gdp'])

  country    gdp
0     USA  20.49
1   China  13.40
2   Japan   4.97

Read data from CSV (Comma-separated values) files: (countries-2021.csv.zip ):

The data file is top 10 countries by GDP (in trillions) in 2021 with information on population (in millions), area (in millions square kilometer $km^2$), capital, continent.

df = pd.read_csv('countries-2021.csv')
df

          country    gdp  population  area          capital      continent
0   United States  20.49      331.00  9.53  WASHINGTON D.C.  North America
1           China  13.40     1439.32  9.60          Beijing           Asia
2           Japan   4.97      126.48  0.38            Tokyo           Asia
3         Germany   4.00       83.78  0.36           Berlin         Europe
4  United Kingdom   2.83       67.89  0.24           London         Europe
5          France   2.78       65.27  0.64            Paris         Europe
6           India   2.72     1380.00  3.29        New Delhi           Asia
7           Italy   2.07       60.46  0.30             Rome         Europe
8          Brazil   1.87      212.56  8.52         Brasilia  South America
9          Canada   1.71       37.74  9.98           Ottawa  North America
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     10 non-null     object 
 1   gdp         10 non-null     float64
 2   population  10 non-null     float64
 3   area        10 non-null     float64
 4   capital     10 non-null     object 
 5   continent   10 non-null     object 
dtypes: float64(3), object(3)
memory usage: 608.0+ bytes
df.describe()

             gdp   population      area
count  10.000000    10.000000  10.00000
mean    5.684000   380.450000   4.28400
std     6.243788   549.780564   4.51295
min     1.710000    37.740000   0.24000
25%     2.232500    65.925000   0.36500
50%     2.805000   105.130000   1.96500
75%     4.727500   301.390000   9.27750
max    20.490000  1439.320000   9.98000
df.head()

           country    gdp  population  area          capital      continent
0   United States  20.49      331.00  9.53  WASHINGTON D.C.  North America
1           China  13.40     1439.32  9.60          Beijing           Asia
2           Japan   4.97      126.48  0.38            Tokyo           Asia
3         Germany   4.00       83.78  0.36           Berlin         Europe
4  United Kingdom   2.83       67.89  0.24           London         Europe

Back to Top

Selecting data in DataFrames using []

# slice by rows
df[1:3]

  country    gdp  population  area  capital continent
1   China  13.40     1439.32  9.60  Beijing      Asia
2   Japan   4.97      126.48  0.38    Tokyo      Asia
# select one column
df['country']  # same as df.country

0     United States
1             China
2             Japan
3           Germany
4    United Kingdom
5            France
6             India
7             Italy
8            Brazil
9            Canada
Name: country, dtype: object
# select multiple columns
df[['country','gdp']]

          country    gdp
0   United States  20.49
1           China  13.40
2           Japan   4.97
3         Germany   4.00
4  United Kingdom   2.83
5          France   2.78
6           India   2.72
7           Italy   2.07
8          Brazil   1.87
9          Canada   1.71

Selecting data in DataFrames using .loc[] and .iloc[]

[] can only do simple data selections by rows and columns. More complicated data selection can be done by using the following methods:

# select rows with index 1, 4, and 6
df.iloc[[1, 4, 6]]

# here 1 and 4 are treated as indexes
# row 4 is NOT included, i.e. UK
df.iloc[1:4, :2]

   country    gdp
1    China  13.40
2    Japan   4.97
3  Germany   4.00
# here 1 adn 4 are treated as labels
# row 4 is included, i.e. UK
df.loc[1:4, ['country', 'gdp']]  

          country    gdp
1           China  13.40
2           Japan   4.97
3         Germany   4.00
4  United Kingdom   2.83

Back to Top

Filtering Rows using Masking Expressions

# returns countries with GDP greater than 3 trillions
df[df['gdp'] > 3]

         country    gdp  population  area          capital      continent
0  United States  20.49      331.00  9.53  WASHINGTON D.C.  North America
1          China  13.40     1439.32  9.60          Beijing           Asia
2          Japan   4.97      126.48  0.38            Tokyo           Asia
3        Germany   4.00       83.78  0.36           Berlin         Europe

# returns countries with GDP greater than 2 trillions and population less than 100 millions
df[(df['gdp'] > 2) & (df['population'] < 100)]

          country   gdp  population  area capital continent
3         Germany  4.00       83.78  0.36  Berlin    Europe
4  United Kingdom  2.83       67.89  0.24  London    Europe
5          France  2.78       65.27  0.64   Paris    Europe
7           Italy  2.07       60.46  0.30    Rome    Europe

Derived Columns

You can derive new columns from existing ones:

# calculate GDP per capita (GDP divided by population)
# a trillion has 12 zeros, a million has 6 zeros
df['gdp_per_capita'] = df.gdp * 1000000 / df.population
df

          country    gdp  population  area          capital  gdp_per_capita
0   United States  20.49      331.00  9.53  WASHINGTON D.C.    61903.323263
1           China  13.40     1439.32  9.60          Beijing     9309.951922
2           Japan   4.97      126.48  0.38            Tokyo    39294.750158
3         Germany   4.00       83.78  0.36           Berlin    47744.091669
4  United Kingdom   2.83       67.89  0.24           London    41685.078804
5          France   2.78       65.27  0.64            Paris    42592.308871
6           India   2.72     1380.00  3.29        New Delhi     1971.014493
7           Italy   2.07       60.46  0.30             Rome    34237.512405
8          Brazil   1.87      212.56  8.52         Brasilia     8797.515995
9          Canada   1.71       37.74  9.98           Ottawa    45310.015898

Back to Top

Pandas Functions

df.max()

country             United States
gdp                         20.49
population                1439.32
area                         9.98
capital           WASHINGTON D.C.
continent           South America
df.continent.value_counts()  # this is a Series

Europe           4
Asia             3
North America    2
South America    1
Name: continent, dtype: int64
df.continent.value_counts()['Asia']

3
df.continent.value_counts(normalize=True)

Europe           0.4
Asia             0.3
North America    0.2
South America    0.1
Name: continent, dtype: float64
df.nunique()

country           10
gdp               10
population        10
area              10
capital           10
continent          4
dtype: int64

Back to Top

Sorting

.sort_values() can be used to sort the values, which returns a sorted DataFrame with original DataFrame unchanged.

# after the following line the df is NOT changed
# the index is NOT reset 
df.sort_values(by='population', ascending=False)

          country    gdp  population  area          capital      continent
1           China  13.40     1439.32  9.60          Beijing           Asia
6           India   2.72     1380.00  3.29        New Delhi           Asia
0   United States  20.49      331.00  9.53  WASHINGTON D.C.  North America
8          Brazil   1.87      212.56  8.52         Brasilia  South America
2           Japan   4.97      126.48  0.38            Tokyo           Asia
3         Germany   4.00       83.78  0.36           Berlin         Europe
4  United Kingdom   2.83       67.89  0.24           London         Europe
5          France   2.78       65.27  0.64            Paris         Europe
7           Italy   2.07       60.46  0.30             Rome         Europe
9          Canada   1.71       37.74  9.98           Ottawa  North America
df1 = df.sort_values(by='population', ascending=False, ignore_index=True)
df1  # the index is reset 

          country    gdp  population  area          capital      continent
0           China  13.40     1439.32  9.60          Beijing           Asia
1           India   2.72     1380.00  3.29        New Delhi           Asia
2   United States  20.49      331.00  9.53  WASHINGTON D.C.  North America
3          Brazil   1.87      212.56  8.52         Brasilia  South America
4           Japan   4.97      126.48  0.38            Tokyo           Asia
5         Germany   4.00       83.78  0.36           Berlin         Europe
6  United Kingdom   2.83       67.89  0.24           London         Europe
7          France   2.78       65.27  0.64            Paris         Europe
8           Italy   2.07       60.46  0.30             Rome         Europe
9          Canada   1.71       37.74  9.98           Ottawa  North America

Back to Top

GroupBy

.groupby() can be used to group the data by different values of one or more categorical columns, then various aggregation functions can be applied to each group.

df.groupby('continent')  # this returns a DataFrameGroupBy object

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11bff9310>

Sum of GDP, Population, and Area for countries in each continent:

df.groupby('continent').sum()

                 gdp  population   area
continent                                              
Asia           21.09     2945.80  13.27
Europe         11.68      277.40   1.54
North America  22.20      368.74  19.51
South America   1.87      212.56   8.52
# this returns the max in each group for each column
# for Strings max() returns highest alphabetical character
# for example: Japan's GDP is NOT 13.40
df.groupby('continent').max()

                     country    gdp  population  area          capital 
continent                                                                 
Asia                    Japan  13.40     1439.32  9.60            Tokyo   
Europe         United Kingdom   4.00       83.78  0.64             Rome   
North America   United States  20.49      331.00  9.98  WASHINGTON D.C.   
South America          Brazil   1.87      212.56  8.52         Brasilia

df.groupby('continent') returns a tuple with two elements:

The following example shows how to loop over the groups:

for group, group_df in df.groupby('continent'):
    print(group)
    print('#' * 15)
    print(group_df)
    print('-' * 60)

Asia
###############
  country    gdp  population  area    capital continent
1   China  13.40     1439.32  9.60    Beijing      Asia
2   Japan   4.97      126.48  0.38      Tokyo      Asia
6   India   2.72     1380.00  3.29  New Delhi      Asia
------------------------------------------------------------
Europe
###############
          country   gdp  population  area capital continent
3         Germany  4.00       83.78  0.36  Berlin    Europe
4  United Kingdom  2.83       67.89  0.24  London    Europe
5          France  2.78       65.27  0.64   Paris    Europe
7           Italy  2.07       60.46  0.30    Rome    Europe
------------------------------------------------------------
North America
###############
         country    gdp  population  area          capital      continent
0  United States  20.49      331.00  9.53  WASHINGTON D.C.  North America
9         Canada   1.71       37.74  9.98           Ottawa  North America
------------------------------------------------------------
South America
###############
  country   gdp  population  area   capital      continent
8  Brazil  1.87      212.56  8.52  Brasilia  South America
------------------------------------------------------------

Deletion

.drop() can be used to delete rows or columns (use inplace=True if you want the df to be altered):

df.drop([1, 3, 4])  # delete discrete rows with index 1, 3, 4, default axis=0
df.drop(df.index[3:5])  # delete middle rows
df.drop(['gdp', 'area'], axis=1)  # delete 'gdp' and 'area' columns
df.drop(['population'], axis=1, inplace=True)  # delete 'population' column in place

.reset_index(drop=True) can be used to reset the index from 0 after deletion if needed. drop=True will delete the old index.

Back to Top

Matplotlib

Matplotlib is a comprehensive library for creating visualizations in Python. Version 3.5.1 is used for this tutorial.

Check the version:

import matplotlib
matplotlib.__version__

Basic Matplotlib Concepts

The following picture shows the basic Matplotlib concepts:

matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB.

import matplotlib.pyplot as plt

Create a figure with one axes:

fig, ax = plt.subplots()  # Create a figure containing a single axes.
x = [1, 2, 3, 4]
y = [1, 4, 2, 3]
ax.plot(x, y);  # Plot some data on the axes.

In the next example, I illustrate the followings:

import numpy as np
plt.style.use('seaborn')  # seaborn style, which looks better than the default one
# plt.style.use('default')  # reset to default style

x = np.linspace(1, 10, 20)  # return 20 evenly spaced numbers between 0 and 10
y1 = 2 * x + 3  # a linear function
y2 = x**2 - 5 * x + 3  # a quadratic function

fig, ax = plt.subplots(figsize=(5, 5))  # one axes of size (5, 5)
ax.plot(x, y1, label='y = 2x + 3')
ax.plot(x, 
        y2, 
        label='$y = x^2 - 5x + 3$',  # legend label using Latex
        linestyle='dashed',  # line style
        marker='s',  # marker style
        color='red', # line color
        )

ax.set_title('Two Functions')  # plot title
ax.set_xlabel('x')  # x label
ax.set_ylabel('y')  # y label
# ax.set_xticks([1, 3, 7, 8], ['a', 'b', 'c', 'd'])  # change the x ticks and labels
ax.legend()  # show legend

Back to Top

Multiple Axes

A figure can have multiple axes:

fig, ax = plt.subplots(2)  # two rows
ax[0].plot(x, y1)
ax[1].plot(x, y2)

fig, ax = plt.subplots(1, 2)  # one row, two columns
ax[0].plot(x, y1)
ax[1].plot(x, y2)

fig, ax = plt.subplots(2, 3)  # 2 rows and 3 columns
ax[0, 1].plot(x, y1)
ax[1, 2].plot(x, y2)

Back to Top

Pandas Charting

Pandas charting is built on top of Matplotlib.

I will use Country GDP dataset (countries-2021.csv.zip) and California housing dataset (housing.csv.zip) to illustrate the charts. You can learn more about the dataset from my Kaggle dataset page.

We introduce the following plots:

Line Plot

.plot() (same as .plot.line()) is used to create line plots for all numerical columns with x-axis being the number of rows

import pandas as pd
df_country = pd.read_csv('countries-2021.csv')
df_country.plot()

df_country.plot() is same as the following (figure and axis are implicitly created and used):

fig, ax = plt.subplots()
df_country.plot(ax=ax)

Change the style sheet and only plot one column (country area) with custom ticks:

import matplotlib.pyplot as plt
plt.style.use('seaborn')

df_country.area.plot().set_xticks(df_country.index, df_country.country, rotation=60)

Create a new sorted DataFrame and plot the area with tick labels (ignore_index=True reset the index from 0):

df_sorted_area = df_country.sort_values(by='area', ignore_index=True)
df_sorted_area.area.plot().set_xticks(df_sorted_area.index, df_sorted_area.country, rotation=60)

Read the California housing dataset:

import pandas as pd
df_housing = pd.read_csv('housing.csv')
df_housing.info()

RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)

Create a line plot for the median house price - think about what patterns you can see from this line chart?

df_housing.median_house_value.plot()

Back to Top

Bar Plot

Bar plot shows a bar for each data point (whereas a line plot uses lines to connect the data points).

We can show the DataFrame on countries sorted by area as a bar plot:

df_sorted_area.area.plot.bar().set_xticks(df_sorted_area.index, df_sorted_area.country, rotation=60)

Back to Top

Histogram

.hist() (same as .plot.hist()) is often used to create histograms to check the followings:

bins=50 can be used to change the number of bins, default is 10

# numerical data
df_housing.median_income.hist(bins=50)

# categorical data
df_country.continent.hist()

Back to Top

Boxplot

A box plot .plot.box() shows:

The box plot is constructed in the following steps:

  1. find the median (the middle value of the dataset or Q2/50th Percentile) of the dataset and divide the data into lower half (smaller numbers) and upper half (larger numbers)
  2. find the median of the lower half of the dataset, which is called Q1 (First quartile/25th Percentile)
  3. find the median of the upper half of the dataset, which is called Q3 (Third quartile/75th Percentile)
  4. Calculate IQR (Interquartile Range): IQR = Q3 - Q1
  5. Identify outliers (if any): data points that are 1.5 IQR away from the median are considered outliers
  6. remove the outliers from the dataset and minimum/maximum is the smallest/largest number in the remaining dataset

The following example shows how a boxplot is constructed step by step:

  1. 13 numbers in total, median is 6
  2. Q1 is 3: the median of [-10, 1, 2, 3, 4, 5]
  3. Q3 is 9: the median of [7, 8, 9, 10, 20, 22]
  4. IQR is 9 - 3 = 6 and 1.5 IQR is 1.5 * 6 = 9
  5. 1.5 IQR from the median is -3 (6 - 9) and 15 (6 + 9), so -10, 20, 22 are outliers
  6. minimum/maximum is then 1/10
s = pd.Series([-10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 22])
fig, ax = plt.subplots(figsize=(4, 6))
s.plot.box()
ax.set_yticks(s)

The following example also shows how to have multiple axes for different pandas charts:

fig, ax = plt.subplots(1, 3, figsize=(20,8))
df_housing.median_income.plot.box(ax=ax[0])  # income
df_housing.median_house_value.plot.box(ax=ax[1])  # house price
df_housing.housing_median_age.plot.box(ax=ax[2])  # house age

You can also show the boxplots for one feature/column for different groups (like using groupby):

df_housing.boxplot(column=['median_house_value'], by='ocean_proximity', figsize=(10, 8), rot=60)

Back to Top

Scatter Plot

Scatter plot .plot.scatter() is often used for correlation analysis between different features. Correlation coefficient is between -1 and 1, representing negative and positive correlations. 0 means there is no liner correlation. Correlation is said to be linear if the ratio of change is constant, otherwise is non-linear.

What insights you can get from the following scatter plot?

df_housing.plot.scatter(x='median_income', y='median_house_value')

Create a scatter matrix (histogram plots in the diagonal):

pd.plotting.scatter_matrix(df_housing, figsize=(25,25))

Back to Top

References

Back to Top