12 Pandas

Pandas is a library that is useful for working with different types of datasets. A dataframe is a pandas object that has a variety of function to analyze and visualize data. It arranges the data in rows and columns. There are multiple ways in which a dataframe can be created. For example, a dictionary can be converted to a datafraem such that the keys become headers and values (list) are entries in the dataframe. The orientation of the dataframe by default is columns ie keys are considered as column header and values are rows. This behaviour can be changed using the orient argument. When orientation is index, an addtional argument columns can be used to specify column headers.

import pandas as pd

input_dict = {'Column1': ['A','B','C','D','E'], \
              'Column2':[1,2,3,4,5]}
df1 = pd.DataFrame.from_dict(input_dict)
df2 = pd.DataFrame.from_dict(input_dict, orient='index', columns=['Val1','Val2','Val3','Val4','Val5',])
display(df1)
display(df2)

	Column1	Column2
0	A	1
1	B	2
2	C	3
3	D	4
4	E	5

	Val1	Val2	Val3	Val4	Val5
Column1	A	B	C	D	E
Column2	1	2	3	4	5

print(df2.index)

Index(['Column1', 'Column2'], dtype='object')

12.1 Basic information about the dataframe

Pandas offers set of commands to get some basic information about the content of dataframes. Below are some of these command along with their corresponding output.

	Column1	Column2
0	A	1
1	B	2
2	C	3
3	D	4
4	E	5

Information about this dataframe 
df1.shape      (5, 2)
df1.size       10
df1.ndim       2
df1.values     [['A' 1]
 ['B' 2]
 ['C' 3]
 ['D' 4]
 ['E' 5]]

12.2 Reading data from a csv file

The read_csv() function can be used to create a dataframe from a csv file. To use one of the columns as indices for the dataframe add the index_col keyword attribute.

# %load test.csv
Name,Age,Country
Sohan,22,India
Sam,21,USA

df3 = pd.read_csv("test.csv")
df4 = pd.read_csv("test.csv", index_col="Country")
display(df3)
display(df4)

	Name	Age	Country
0	Sohan	22	India
1	Sam	21	USA

	Name	Age
Country
India	Sohan	22
USA	Sam	21

12.3 Combining dataframes - join, merge, and concat

Concat is used to combine dataframes across rows or columns. Merge is used to combine dataframes on common columns or indices. Join is used to combine based on a key column or index.

import numpy as np
df_1 = pd.DataFrame(np.random.uniform(1,2,size=(5, 4)), columns=list('ABCD'))
df_2 = pd.DataFrame(np.random.uniform(2,3,size=(5, 4)), columns=list('ABCD'))

display(df_1)
display(df_2)

	A	B	C	D
0	1.356623	1.477763	1.569974	1.778819
1	1.532714	1.464930	1.382634	1.496065
2	1.565586	1.537660	1.558265	1.089754
3	1.526300	1.392927	1.798358	1.181237
4	1.693383	1.225141	1.411091	1.804210

	A	B	C	D
0	2.246880	2.632981	2.831306	2.088775
1	2.990886	2.369731	2.697256	2.048426
2	2.030963	2.141273	2.001073	2.406086
3	2.295191	2.721775	2.267560	2.300182
4	2.182948	2.941006	2.654403	2.478437

df_new = pd.concat([df_1,df_2],ignore_index=True)
display(df_new)

	A	B	C	D
0	1.356623	1.477763	1.569974	1.778819
1	1.532714	1.464930	1.382634	1.496065
2	1.565586	1.537660	1.558265	1.089754
3	1.526300	1.392927	1.798358	1.181237
4	1.693383	1.225141	1.411091	1.804210
5	2.246880	2.632981	2.831306	2.088775
6	2.990886	2.369731	2.697256	2.048426
7	2.030963	2.141273	2.001073	2.406086
8	2.295191	2.721775	2.267560	2.300182
9	2.182948	2.941006	2.654403	2.478437

df_new = pd.concat([df_1,df_2],axis=1)
display(df_new)

	A	B	C	D	A	B	C	D
0	1.356623	1.477763	1.569974	1.778819	2.246880	2.632981	2.831306	2.088775
1	1.532714	1.464930	1.382634	1.496065	2.990886	2.369731	2.697256	2.048426
2	1.565586	1.537660	1.558265	1.089754	2.030963	2.141273	2.001073	2.406086
3	1.526300	1.392927	1.798358	1.181237	2.295191	2.721775	2.267560	2.300182
4	1.693383	1.225141	1.411091	1.804210	2.182948	2.941006	2.654403	2.478437

df_new = pd.concat([df_1,df_2],keys=["First","Second"])
display(df_new)

		A	B	C	D
First	0	1.356623	1.477763	1.569974	1.778819
	1	1.532714	1.464930	1.382634	1.496065
	2	1.565586	1.537660	1.558265	1.089754
	3	1.526300	1.392927	1.798358	1.181237
	4	1.693383	1.225141	1.411091	1.804210
Second	0	2.246880	2.632981	2.831306	2.088775
	1	2.990886	2.369731	2.697256	2.048426
	2	2.030963	2.141273	2.001073	2.406086
	3	2.295191	2.721775	2.267560	2.300182
	4	2.182948	2.941006	2.654403	2.478437

display(df_new.loc["First"])

	A	B	C	D
0	1.356623	1.477763	1.569974	1.778819
1	1.532714	1.464930	1.382634	1.496065
2	1.565586	1.537660	1.558265	1.089754
3	1.526300	1.392927	1.798358	1.181237
4	1.693383	1.225141	1.411091	1.804210

Merge is used combine dataframe on one or more columns

df3 = pd.read_csv("test.csv")

display(df3)
df4 = df3.copy(deep=True)
df4.loc[2]=["Peter", 20, "UK"] 
df4.loc[len(df4.index)] = ["Mohan", 25, "India"]
display(df4)

df_merged1 = pd.merge(df3,df4)
display(df_merged1)

df_merged2 = pd.merge(df3,df4,on=["Country","Name"],\
                      suffixes=('_df3', '_df4'))
display(df_merged2)

		df3
	Name	Age	Country
0	Sohan	22	India
1	Sam	21	USA

	df4
	Name	Age	Country
0	Sohan	22	India
1	Sam	21	USA
2	Peter	20	UK
3	Mohan	25	India

	df_merged1
	Name	Age	Country
0	Sohan	22	India
1	Sam	21	USA

	df_merged2
	Name	Age_df3	Country	Age_df4
0	Sohan	22	India	22
1	Sam	21	USA	21

Join is used to combine dataframes along a specific column.

display(df3.join(df4,lsuffix='_df3', rsuffix='_df4'))

	Name_df3	Age_df3	Country_df3	Name_df4	Age_df4	Country_df4
0	Sohan	22	India	Sohan	22	India
1	Sam	21	USA	Sam	21	USA

display(df3)

	Name	Age	Country
0	Sohan	22	India
1	Sam	21	USA

display(df3.join(df4.set_index("Country"),on="Country", lsuffix='_df3', rsuffix='_df4'))

	Name_df3	Age_df3	Country	Name_df4	Age_df4
0	Sohan	22	India	Sohan	22
0	Sohan	22	India	Mohan	25
1	Sam	21	USA	Sam	21

display(df3.join(df4.set_index("Country"),on="Country", lsuffix='_df3', rsuffix='_df4', how="outer"))

	Name_df3	Age_df3	Country	Name_df4	Age_df4
0.0	Sohan	22.0	India	Sohan	22
0.0	Sohan	22.0	India	Mohan	25
1.0	Sam	21.0	USA	Sam	21
NaN	NaN	NaN	UK	Peter	20

12.4 Groupby

We can create groups for same values in a column to apply a function to all rows having a particular value.

students = [["Sam","Peter","Mohan", "Mike"], ["UG","PG","UG","PG"], [70,80,90,70]]
df_students = pd.DataFrame(students).T

df_students.columns=["Name","Program","Marks"]
display(df_students)

	Name	Program	Marks
0	Sam	UG	70
1	Peter	PG	80
2	Mohan	UG	90
3	Mike	PG	70

df_students.set_index("Program", inplace=True)
display(df_students)

	Name	Marks
Program
UG	Sam	70
PG	Peter	80
UG	Mohan	90
PG	Mike	70

df_students.groupby(level="Program")["Marks"].mean()

Program
PG    75.0
UG    80.0
Name: Marks, dtype: float64

12.4.1 Ploting

Dataframe has a plot() function to do basic visualization. The kind attribute for this function can be used to change the plot type.

df_col1 = pd.DataFrame(np.array(range(1,6))**2)
df_col2 = pd.DataFrame(np.array(range(1,6))**3)

df_comb = pd.concat([df_col1,df_col2], axis=1, ignore_index=True)
df_comb.columns = ["Squares", "Cubes"]
df_comb.index = range(1,6)
display(df_comb)

	Squares	Cubes
1	1	1
2	4	8
3	9	27
4	16	64
5	25	125

plot1 = df_comb.plot(title="Line Plot")
plot2 = df_comb.plot(kind="bar", title="Bar Plot")

12.5 The iris dataset

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# using the attribute information as the column names
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
iris =  pd.read_csv(csv_url, names = col_names)

iris

	Sepal_Length	Sepal_Width	Petal_Length	Petal_Width	Class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

150 rows × 5 columns

print("Shape", iris.shape)
print(iris.dtypes)

Shape (150, 5)
Sepal_Length    float64
Sepal_Width     float64
Petal_Length    float64
Petal_Width     float64
Class            object
dtype: object

iris.describe()

	Sepal_Length	Sepal_Width	Petal_Length	Petal_Width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

iris.set_index("Class").groupby(level="Class").describe()

	Sepal_Length								Sepal_Width		...	Petal_Length		Petal_Width
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Class
Iris-setosa	50.0	5.006	0.352490	4.3	4.800	5.0	5.2	5.8	50.0	3.418	...	1.575	1.9	50.0	0.244	0.107210	0.1	0.2	0.2	0.3	0.6
Iris-versicolor	50.0	5.936	0.516171	4.9	5.600	5.9	6.3	7.0	50.0	2.770	...	4.600	5.1	50.0	1.326	0.197753	1.0	1.2	1.3	1.5	1.8
Iris-virginica	50.0	6.588	0.635880	4.9	6.225	6.5	6.9	7.9	50.0	2.974	...	5.875	6.9	50.0	2.026	0.274650	1.4	1.8	2.0	2.3	2.5

3 rows × 32 columns

plot3 = iris.plot.scatter(x="Sepal_Length",y="Sepal_Width")

import matplotlib.pyplot as plt
import seaborn as sns

plot4 = iris.set_index("Class").groupby(level="Class").plot.scatter(x="Sepal_Length",y="Petal_Length")

plot5 = sns.jointplot(data=iris,x="Sepal_Length",y="Petal_Length",hue="Class")