Pandas is a library that is useful for working with different types of datasets. A dataframe is a pandas object that has a variety of function to analyze and visualize data. It arranges the data in rows and columns. There are multiple ways in which a dataframe can be created. For example, a dictionary can be converted to a datafraem such that the keys become headers and values (list) are entries in the dataframe. The orientation of the dataframe by default is columns
ie keys are considered as column header and values are rows. This behaviour can be changed using the orient
argument. When orientation is index, an addtional argument columns
can be used to specify column headers.
input_dict = {'Column1' : ['A' ,'B' ,'C' ,'D' ,'E' ], \
'Column2' :[1 ,2 ,3 ,4 ,5 ]}
df1 = pd.DataFrame.from_dict(input_dict)
df2 = pd.DataFrame.from_dict(input_dict, orient= 'index' , columns= ['Val1' ,'Val2' ,'Val3' ,'Val4' ,'Val5' ,])
display(df1)
display(df2)
Column1
Column2
0
A
1
1
B
2
2
C
3
3
D
4
4
E
5
Val1
Val2
Val3
Val4
Val5
Column1
A
B
C
D
E
Column2
1
2
3
4
5
Index(['Column1', 'Column2'], dtype='object')
Reading data from a csv file
The read_csv()
function can be used to create a dataframe from a csv file. To use one of the columns as indices for the dataframe add the index_col
keyword attribute.
# %load test.csv
Name,Age,Country
Sohan,22 ,India
Sam,21 ,USA
df3 = pd.read_csv("test.csv" )
df4 = pd.read_csv("test.csv" , index_col= "Country" )
display(df3)
display(df4)
Name
Age
Country
0
Sohan
22
India
1
Sam
21
USA
Name
Age
Country
India
Sohan
22
USA
Sam
21
Combining dataframes - join, merge, and concat
Concat is used to combine dataframes across rows or columns. Merge is used to combine dataframes on common columns or indices. Join is used to combine based on a key column or index.
import numpy as np
df_1 = pd.DataFrame(np.random.uniform(1 ,2 ,size= (5 , 4 )), columns= list ('ABCD' ))
df_2 = pd.DataFrame(np.random.uniform(2 ,3 ,size= (5 , 4 )), columns= list ('ABCD' ))
display(df_1)
display(df_2)
A
B
C
D
0
1.356623
1.477763
1.569974
1.778819
1
1.532714
1.464930
1.382634
1.496065
2
1.565586
1.537660
1.558265
1.089754
3
1.526300
1.392927
1.798358
1.181237
4
1.693383
1.225141
1.411091
1.804210
A
B
C
D
0
2.246880
2.632981
2.831306
2.088775
1
2.990886
2.369731
2.697256
2.048426
2
2.030963
2.141273
2.001073
2.406086
3
2.295191
2.721775
2.267560
2.300182
4
2.182948
2.941006
2.654403
2.478437
df_new = pd.concat([df_1,df_2],ignore_index= True )
display(df_new)
A
B
C
D
0
1.356623
1.477763
1.569974
1.778819
1
1.532714
1.464930
1.382634
1.496065
2
1.565586
1.537660
1.558265
1.089754
3
1.526300
1.392927
1.798358
1.181237
4
1.693383
1.225141
1.411091
1.804210
5
2.246880
2.632981
2.831306
2.088775
6
2.990886
2.369731
2.697256
2.048426
7
2.030963
2.141273
2.001073
2.406086
8
2.295191
2.721775
2.267560
2.300182
9
2.182948
2.941006
2.654403
2.478437
df_new = pd.concat([df_1,df_2],axis= 1 )
display(df_new)
A
B
C
D
A
B
C
D
0
1.356623
1.477763
1.569974
1.778819
2.246880
2.632981
2.831306
2.088775
1
1.532714
1.464930
1.382634
1.496065
2.990886
2.369731
2.697256
2.048426
2
1.565586
1.537660
1.558265
1.089754
2.030963
2.141273
2.001073
2.406086
3
1.526300
1.392927
1.798358
1.181237
2.295191
2.721775
2.267560
2.300182
4
1.693383
1.225141
1.411091
1.804210
2.182948
2.941006
2.654403
2.478437
df_new = pd.concat([df_1,df_2],keys= ["First" ,"Second" ])
display(df_new)
A
B
C
D
First
0
1.356623
1.477763
1.569974
1.778819
1
1.532714
1.464930
1.382634
1.496065
2
1.565586
1.537660
1.558265
1.089754
3
1.526300
1.392927
1.798358
1.181237
4
1.693383
1.225141
1.411091
1.804210
Second
0
2.246880
2.632981
2.831306
2.088775
1
2.990886
2.369731
2.697256
2.048426
2
2.030963
2.141273
2.001073
2.406086
3
2.295191
2.721775
2.267560
2.300182
4
2.182948
2.941006
2.654403
2.478437
display(df_new.loc["First" ])
A
B
C
D
0
1.356623
1.477763
1.569974
1.778819
1
1.532714
1.464930
1.382634
1.496065
2
1.565586
1.537660
1.558265
1.089754
3
1.526300
1.392927
1.798358
1.181237
4
1.693383
1.225141
1.411091
1.804210
Merge is used combine dataframe on one or more columns
df3 = pd.read_csv("test.csv" )
display(df3)
df4 = df3.copy(deep= True )
df4.loc[2 ]= ["Peter" , 20 , "UK" ]
df4.loc[len (df4.index)] = ["Mohan" , 25 , "India" ]
display(df4)
df_merged1 = pd.merge(df3,df4)
display(df_merged1)
df_merged2 = pd.merge(df3,df4,on= ["Country" ,"Name" ],\
suffixes= ('_df3' , '_df4' ))
display(df_merged2)
df3
Name
Age
Country
0
Sohan
22
India
1
Sam
21
USA
df4
Name
Age
Country
0
Sohan
22
India
1
Sam
21
USA
2
Peter
20
UK
3
Mohan
25
India
df_merged1
Name
Age
Country
0
Sohan
22
India
1
Sam
21
USA
df_merged2
Name
Age_df3
Country
Age_df4
0
Sohan
22
India
22
1
Sam
21
USA
21
Join is used to combine dataframes along a specific column.
display(df3.join(df4,lsuffix= '_df3' , rsuffix= '_df4' ))
Name_df3
Age_df3
Country_df3
Name_df4
Age_df4
Country_df4
0
Sohan
22
India
Sohan
22
India
1
Sam
21
USA
Sam
21
USA
Name
Age
Country
0
Sohan
22
India
1
Sam
21
USA
display(df3.join(df4.set_index("Country" ),on= "Country" , lsuffix= '_df3' , rsuffix= '_df4' ))
Name_df3
Age_df3
Country
Name_df4
Age_df4
0
Sohan
22
India
Sohan
22
0
Sohan
22
India
Mohan
25
1
Sam
21
USA
Sam
21
display(df3.join(df4.set_index("Country" ),on= "Country" , lsuffix= '_df3' , rsuffix= '_df4' , how= "outer" ))
Name_df3
Age_df3
Country
Name_df4
Age_df4
0.0
Sohan
22.0
India
Sohan
22
0.0
Sohan
22.0
India
Mohan
25
1.0
Sam
21.0
USA
Sam
21
NaN
NaN
NaN
UK
Peter
20
Groupby
We can create groups for same values in a column to apply a function to all rows having a particular value.
students = [["Sam" ,"Peter" ,"Mohan" , "Mike" ], ["UG" ,"PG" ,"UG" ,"PG" ], [70 ,80 ,90 ,70 ]]
df_students = pd.DataFrame(students).T
df_students.columns= ["Name" ,"Program" ,"Marks" ]
display(df_students)
Name
Program
Marks
0
Sam
UG
70
1
Peter
PG
80
2
Mohan
UG
90
3
Mike
PG
70
df_students.set_index("Program" , inplace= True )
display(df_students)
Name
Marks
Program
UG
Sam
70
PG
Peter
80
UG
Mohan
90
PG
Mike
70
df_students.groupby(level= "Program" )["Marks" ].mean()
Program
PG 75.0
UG 80.0
Name: Marks, dtype: float64
Ploting
Dataframe has a plot()
function to do basic visualization. The kind
attribute for this function can be used to change the plot type.
df_col1 = pd.DataFrame(np.array(range (1 ,6 ))** 2 )
df_col2 = pd.DataFrame(np.array(range (1 ,6 ))** 3 )
df_comb = pd.concat([df_col1,df_col2], axis= 1 , ignore_index= True )
df_comb.columns = ["Squares" , "Cubes" ]
df_comb.index = range (1 ,6 )
display(df_comb)
Squares
Cubes
1
1
1
2
4
8
3
9
27
4
16
64
5
25
125
plot1 = df_comb.plot(title= "Line Plot" )
plot2 = df_comb.plot(kind= "bar" , title= "Bar Plot" )
The iris dataset
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# using the attribute information as the column names
col_names = ['Sepal_Length' ,'Sepal_Width' ,'Petal_Length' ,'Petal_Width' ,'Class' ]
iris = pd.read_csv(csv_url, names = col_names)
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
Class
0
5.1
3.5
1.4
0.2
Iris-setosa
1
4.9
3.0
1.4
0.2
Iris-setosa
2
4.7
3.2
1.3
0.2
Iris-setosa
3
4.6
3.1
1.5
0.2
Iris-setosa
4
5.0
3.6
1.4
0.2
Iris-setosa
...
...
...
...
...
...
145
6.7
3.0
5.2
2.3
Iris-virginica
146
6.3
2.5
5.0
1.9
Iris-virginica
147
6.5
3.0
5.2
2.0
Iris-virginica
148
6.2
3.4
5.4
2.3
Iris-virginica
149
5.9
3.0
5.1
1.8
Iris-virginica
150 rows × 5 columns
print ("Shape" , iris.shape)
print (iris.dtypes)
Shape (150, 5)
Sepal_Length float64
Sepal_Width float64
Petal_Length float64
Petal_Width float64
Class object
dtype: object
Sepal_Length
Sepal_Width
Petal_Length
Petal_Width
count
150.000000
150.000000
150.000000
150.000000
mean
5.843333
3.054000
3.758667
1.198667
std
0.828066
0.433594
1.764420
0.763161
min
4.300000
2.000000
1.000000
0.100000
25%
5.100000
2.800000
1.600000
0.300000
50%
5.800000
3.000000
4.350000
1.300000
75%
6.400000
3.300000
5.100000
1.800000
max
7.900000
4.400000
6.900000
2.500000
iris.set_index("Class" ).groupby(level= "Class" ).describe()
Sepal_Length
Sepal_Width
...
Petal_Length
Petal_Width
count
mean
std
min
25%
50%
75%
max
count
mean
...
75%
max
count
mean
std
min
25%
50%
75%
max
Class
Iris-setosa
50.0
5.006
0.352490
4.3
4.800
5.0
5.2
5.8
50.0
3.418
...
1.575
1.9
50.0
0.244
0.107210
0.1
0.2
0.2
0.3
0.6
Iris-versicolor
50.0
5.936
0.516171
4.9
5.600
5.9
6.3
7.0
50.0
2.770
...
4.600
5.1
50.0
1.326
0.197753
1.0
1.2
1.3
1.5
1.8
Iris-virginica
50.0
6.588
0.635880
4.9
6.225
6.5
6.9
7.9
50.0
2.974
...
5.875
6.9
50.0
2.026
0.274650
1.4
1.8
2.0
2.3
2.5
3 rows × 32 columns
plot3 = iris.plot.scatter(x= "Sepal_Length" ,y= "Sepal_Width" )
import matplotlib.pyplot as plt
import seaborn as sns
plot4 = iris.set_index("Class" ).groupby(level= "Class" ).plot.scatter(x= "Sepal_Length" ,y= "Petal_Length" )
plot5 = sns.jointplot(data= iris,x= "Sepal_Length" ,y= "Petal_Length" ,hue= "Class" )