Data analysis is one of the most empowered part of Data Science, and when it comes to data analysis then Python’s Pandas Library just couldn’t be skipped.
Pandas library is a open source tool for data analysis and its manipulation, which comes handy for every data analyst. This provides some significant data structures and is in fact build over the numpy package (learn about numpy here), thus giving it more reliability and strength.
Pandas | Datastructures
Pandas library provides us with two important and unique data structures, namely Series and Dataframes.
Its nothing but a one-dimensional data structure which stores data just like a python list, or array; however, we could provide data as well as label / index of any data type to the series object, which makes it unique and powerful. Also, the data of the series object could be a python dictionary, tuple, list, numpy array, scalar value, etc.
Dataframes are two-dimensional objects which have rows and columns (just like an excel sheet). Think of a dataframe as a pandas Series object (i.e. columns) stacked together on the basis of same index / label values. Since, a dataframe is build over a series object therefore this is much used than the former data structure.
Similarly, you could try creating series and dataframes with different datatypes mentioned above.
We could also import or export the data in form of csv (comma-separated files), json, html, sql, etc. with the help of pandas’ built-in functions.
Pandas provide us with various functions like pd.read_filetype(‘filename’) to read/import data from various file types (click here to know about all supported file types). Similarly, you could write the final results into several supported file types using dataframe.to_filetype(‘path/filename’). Both the functions comes with more parameters which could change the index of our data, or which takes in various delimiters (symbol by which the text is separated by).
Selecting Rows and/or Columns
When dealing with a dataset, we could select a particular column from our dataset by just passing the label of the column in square brackets, as shown under.
However to select a particular row from our dataset, the process is quite different; you need to call the function dataframe.loc[‘row_label’, ‘column_label’] or dataframe.iloc[‘row_index’, ‘column_index’], where the column label is just optional or as per the choice of user.
Note : To select combination of rows or columns, you can pass a tuple with the desired row/column label or index (based of the function you’re using).
Also, dataframe.ix[row, column] can also be used as a replacement, which can take either index or label (user’s choice) and prints out the desired row or column.
Just like selecting a row or column (as illustrated above), we can add new rows (columns) into the dataset by using .loc or .iloc and passing the new label of the row (column) with the data as well.
>>> dataframe[‘new_column_label’] = data
And to drop a particular row or column, dataframe.drop() method is used, which takes in parameters like labels, index, axis, inplace etc.
Aggregate Functions, GroupBy and Sorting
For numerical data, the need to find statistical details (max, min, standard deviation, etc.) is always there.
Pandas assigns this responsibility to various functions and procedures as under :
In the above snippet, describe() method is used. That basically prints out each statistical details out of the numerical data implicitly; however, we can also call aggregate functions like
- abs() – returns absolute value
- mode(), etc.
Also, there are methods to sort the data of a dataframe/series; they take parameters like axis (0 : rows, 1: columns), inplace, ascending, and many more to apply sort accordingly.
- sort_index() – sort by labels along an axis
- sort_values(by = ’Chemistry’) – Sort by the values along an axis
Also, groupby() is used to groupby data on the basis of some label values.
Dealing with Missing Data
Pandas comes loaded with functions to deal with missing data values in a dataset. Missing data in a dataframe is of no importance and to deal with it, we can either drop the axis having missing data or we can fill it in.
- dataframe.fillna( value = ‘ ‘ )
Each method takes in parameters like axis (0 : row, 1 : column), inplace (True/False), and many more.
Note : Fill null values in your dataset with some values (may it be mean or mode or any other relevant value) rather than dropping it. As it can be an important parameter for your observations.
Application of user-defined functions
Pandas gives us the privilege to apply user-defined functions over our dataset using DataFrame.apply() method.
Consider the above snippet which illustrates one such instance. Similarly, you can apply built-in functions (for eg, np.mean) as well.
Pandas | Useful Methods
There are several useful methods in pandas however, some basic and most used ones while doing data analysis as as follows –
Merging, Joining and Concatenation
To join various dataframe objects or to add more columns to the original dataframe object, we can use either of the three methods which are fulfilled by pandas library :
- dataframe1.append( dataframe2 ) – appends/concatenates the second dataframe object at the end of first, given that both have same size of axis. They only concatenates along axis=0 (namely the index) and have been existing before concat method.
- pandas.concat( [ dataframe1 , dataframe2 ] , axis=1 ) – concatenates two dataframes on row-basis or column-basis, as specified.
- dataframe1.join( dataframe2 , how = ’inner’ , on = column1 ) – much like SQL joining opeartion, it joins or merges two or more dataframe objects.
- pd.merge(left, right, on=None) – very much similar to join method , merge behaves as the entry point for all standard database join operations between DataFrame objects.
That is very much it to give you all the basic intuition of most versatile state-of-the-art data manipulation library in python : Pandas.
If you’ve found it much useful then like and share the post. Also, ping your doubts (if any) in the comments section below, or you can also reach us at the contact forum of the site. Don’t forget to subscribe to my news feeds, so that you stay updated of the various blog posts.