Python #4: Pandas

5 minute read

This is the fourth in a series of Python notes I made during the Kubrick Data Engineering training course.
#1: Basics
#2: Advanced
#3: Scraping
#4: Pandas
#5: Matplotlib


Pandas is an open source Python library for data handling and analysis derived from numpy. It allows for flexible table manipulation among many other features.


Series

Pandas series are 1-dimensional arrays of indexed data of a single data type.

import pandas as pd

s = pd.Series([1,2,3], index = ['a','b','c']) # Construct series explicitly
s = pd.Series({'a':1, 'b':2, 'c':3}) # Construct series from dictionary

type(s.values)
# numpy.ndarray

Indexes

A Pandas index in an object itself and can be thought of as a generated immutable array, they can be created on their own or as part of a Series or DataFrame.

idx = pd.Index([1,2,3])
print(idx)
# Int64Index([1, 2, 3], dtype='int64')

Pandas also includes DatetimeIndex:

from datetime import datetime
pd.Index([datetime(2019,1,1), datetime(2019,1,2)])
# DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)


DataFrames

DataFrames are 2-dimensional arrays indexed by columns and rows where each column references a Series object. A DataFrame can be defined relatively easily from many other types of tabular data objects.

Data source Format Code
Series s = pd.Series([[1,2],[3,4]]) df = pd.DataFrame(s)
List of lists s = [[1, 2],[3, 4]] df = pd.DataFrame(s, columns=['a', 'b'])
List of dictionaries s = [{'a':1, 'b':2}, {'a':3, 'b':4}] df = pd.DataFrame(s)
Dictionary of lists s = {'a':[1,3], 'b':[2,4]} df = pd.DatFrame(s)
Dictionary of series s1 = pd.Series([1,3]) s2 = pd.Series([2,4]) pd.DataFrame({'a':s1, 'b':s2})

Basic information about a DataFrame or Series can be returned using df.index, df.columns, df.values, & df.shape.


Indexing

.loc

loc is the explicit indexing method, it references the actual index value (‘row1’/’2019-01-01’ etc.). Either use s.loc[1] for Series or df.loc[1, 'col1'] for DataFrame.

.iloc

iloc is the implicit indexing method, it references the index order value (0, 1, 2, 3 …). Either use s.iloc[0] for Series or df.iloc[0, 0] for DataFrame.

Slicing

loc and iloc can be used with slices, the output will depend on which is used.

s = pd.Series(['a', 'b', 'c', 'd'], index=[1,2,3,4])

print(s.loc[1:3])   # When explicit indexing, value at 'stop' is included
# 1  a
# 2  b
# 3  c

print(s.iloc[1:3])   # When implicit indexing, value at 'stop' is NOT included (normal python)
# 2  b
# 3  c

DataFrame columns can also be selected using a list of column names using df[colslist].

Boolean masks

Boolean masks are truth masks used to filter data, they can be defined explicitly or derived from an operator expression

mask = [True, True, False, False, True] # Explicit definition
mask = (s>=3) # Operator definition
mask = s.isin([1,4]) # Mask using .isin with a list

s[mask] # Apply to series or DataFrame

Manipulating data

Missing Data

df.dropna(axis=0) # Drop all rows with nan
df['c1'].isnull() # Returns boolean series showing which locations are null
df.fillna(999) # Fill all null values with 999
df.fillna(method='ffill')  # Replaces nulls with previous value in column (also 'bfill')
df.loc[df['c1'].isnull(), 'c1'] = 999 # For all rows where c1 is null, change null to 999

Combining DataFrames

DataFrames can be combined using either .concat, .append, or .merge.

pd.concat([df1, df2], axis=0, sort=True) # Concat rows (UNION)
pd.concat([df1, df2], axis=1, sort=True) # Concat columns (JOIN)

df1.append(df2, sort=True) # Append rows (same output as concat rows)

pd.merge(df1, df2, on='a', how='inner') # Inner join on column 'a'
pd.merge(df1, df2, left_on='a', right_on='c', how='inner', suffixes=['_left', '_right']) # Inner join on df1['a'] = df2['c']

Sorting

DataFrames can be sorted by index or by column values.

df.sort_index()  # Index sort
df.sort_values(['a', 'b'], ascending=[True, False])  # Sort on col a ascending then col b descending

Operations

Universal Functions

Pandas uses numpy universal functions which offer index-aligned, fast element wise operations for Pandas objects.

# Operate on each value in a column:
df['col1'] + 3  

# Operate on each value in a DataFrame:
df + 3

# .apply applies a function to each value referenced, it's much slower than ufuncs:
df['col1'].apply(add3func)  

# If using on a DataFrame, the axis must be specified:
df.apply(add3func, axis=1)
# axis=1 -> cols  (apply row wise)
# axis=0 -> rows  (apply col wise)

# Perform series wise operation (indexes will be aligned before operation):
df['col1'] * df['col2']

# Perform DataFrame wise operation:
df + 2*df

# Perform column wise operation on DataFrame:
df + [1,2]

# DataFrame aggregation
df.sum(axis=0)  # Sum each column 

Aggregation Functions

Pandas includes all the standard aggregation functions .mean(), .median(), quantile(q), .sum(). When using these on a DataFrame the axis must be specified.
Additional Pandas aggregation functions incluce .describe() which gives all the key aggregate stats from a DataFrame.
.T returns the transpose of the DataFrame.

Grouping

Grouping allows for more specific aggregation statements on particular keys and columns. The main syntax is .groupby(<groupcol>) which will group all rows with the same <groupcol> value, <groupcol> will then be set as the new index unless .reset_index() is added to the end.

df.groupby('key').sum() # Group by 'key' column and sum other columns for those groups
df.groupby('key')['value1'].sum() # Sum groups only for 'value1' column
df.groupby('key').agg(['sum', 'mean']) # Creates columns for each aggregation under each existing column
df.groupby('key').agg({'value1':np.sum, 'value2':np.min}) # Apply sum to value1 col and min to value2 column
df.groupby('key').filter(lambda x: max(x['value1']) > 5)  # Gives the whole group if that group contains a max value > 5
pd.concat([df, df.groupby('key').transform('mean')], axis=1) # Displays the mean for the group in all group values (like an SQL window function)

Pandas also supports pivot tables which work like .groupby but on 2 or more dimensions.

# Group by gender and is child, aggregate on height
df.pivot_table(index='gender', columns='is_child', values='height', aggfunc=np.mean)

# Equivalent using groupby
df.groupby(['gender', 'is_child']).mean()

String Methods

As well as the standard .len(), .upper(), & .lower() Pandas includes .startswith() & .endswith() that return boolean masks, and .split() which splits a series into distinct list objects.

Time Series Methods

Date and time data can be altered using the datetime module or with Pandas DateTime methods.

### datetime date objects ###
import datetime as dt

dt.date(2019,3,7)
# datetime.date(2019, 3, 7)

dt.timedelta(days=5)
# datetime.timedelta(5)

### Pandas datetime objects ###
pd.to_datetime('2019-3-7')
# Timestamp('2019-03-07 00:00:00')

pd.Timestamp(2019,3,7)
# Timestamp('2019-03-07 00:00:00')

pd.Timestamp(2019,3,7) + pd.offsets.MonthEnd()
# Timestamp('2019-03-31 00:00:00')

df.resample('M', convention='end').mean() # Downsample data to monthly showing means
df.resample('D').mean() # Upsample to daily (then .ffill or .interpolate to fill values)

Updated: