Pandas Library for Data Manipulation


Data Analysis, Machine Learning / Friday, March 30th, 2018

Data analysis is one of the most empowered part of Data Science, and when it comes to data analysis then Python’s Pandas Library just couldn’t be skipped.

Pandas library is a open source tool for data analysis and its manipulation, which comes handy for every data analyst. This provides some significant data structures and is in fact build over the numpy package (learn about numpy here), thus giving it more reliability and strength.


Pandas | Datastructures

Pandas library provides us with two important and unique data structures, namely Series and Dataframes.

Series –

Its nothing but a one-dimensional data structure which stores data just like a python list, or array; however, we could provide data as well as label / index of any data type to the series object, which makes it unique and powerful. Also, the data of the series object could be a python dictionary, tuple, list, numpy array, scalar value, etc.

Dataframe –

Dataframes are two-dimensional objects which have rows and columns (just like an excel sheet). Think of a dataframe as a pandas Series object (i.e. columns) stacked together on the basis of same index / label values. Since, a dataframe is build over a series object therefore this is much used than the former data structure.

Similarly, you could try creating series and dataframes with different datatypes mentioned above.


Pandas Input/Output

We could also import or export the data in form of csv (comma-separated files), json, html, sql, etc. with the help of pandas’ built-in functions.

Pandas provide us with various functions like pd.read_filetype(‘filename’) to read/import data from various file types (click here to know about all supported file types). Similarly, you could write the final results into several supported file types using dataframe.to_filetype(‘path/filename’). Both the functions comes with more parameters which could change the index of our data, or which takes in various delimiters (symbol by which the text is separated by).


Selecting Rows and/or Columns

When dealing with a dataset, we could select a particular column from our dataset by just passing the label of the column in square brackets, as shown under.

However to select a particular row from our dataset, the process is quite different; you need to call the function dataframe.loc[‘row_label’, ‘column_label’] or dataframe.iloc[‘row_index’, ‘column_index’], where the column label is just optional or as per the choice of user.

Note : To select combination of rows or columns, you can pass a tuple with the desired row/column label or index (based of the function you’re using).

dataframe and selection

Also, dataframe.ix[row, column] can also be used as a replacement, which can take either index or label (user’s choice) and prints out the desired row or column.


Creation/deletion

Just like selecting a row or column (as illustrated above), we can add new rows (columns) into the dataset by using .loc or .iloc  and passing the new label of the row (column) with the data as well.

>>>   dataframe[‘new_column_label’] = data 

And to drop a particular row or column,  dataframe.drop() method is used, which takes in parameters like labels, index, axis, inplace etc.


Aggregate Functions, GroupBy and Sorting

For numerical data, the need to find statistical details (max, min, standard deviation, etc.) is always there.

Pandas assigns this responsibility to various functions and procedures as under :

In the above snippet, describe() method is used. That basically prints out each statistical details out of the numerical data implicitly; however, we can also call aggregate functions like

  • abs() –  returns absolute value
  • max()
  • min()
  • count()
  • mean()
  • median()
  • mode(), etc.

Also, there are methods to sort the data of a dataframe/series; they take parameters like axis (0 : rows, 1: columns), inplace, ascending, and many more to apply sort accordingly.

  • sort_index() – sort by labels along an axis
  • sort_values(by = ’Chemistry’)  –  Sort by the values along an axis

Also, groupby() is used to groupby data on the basis of some label values.


Dealing with Missing Data

Pandas comes loaded with functions to deal with missing data values in a dataset. Missing data in a dataframe is of no importance and to deal with it, we can either drop the axis having missing data or we can fill it in.

  • dataframe.dropna()
  • dataframe.fillna( value = ‘  ‘ ) 

Each method takes in parameters like axis (0 : row, 1 : column), inplace (True/False), and many more.

Note : Fill null values in your dataset with some values (may it be mean or mode or any other relevant value) rather than dropping it. As it can be an important parameter for your observations.


Application of user-defined functions

Pandas gives us the privilege to apply user-defined functions over our dataset using DataFrame.apply()  method.

Consider the above snippet which illustrates one such instance. Similarly, you can apply built-in functions (for eg, np.mean) as well.


Pandas | Useful Methods

There are several useful methods in pandas however, some basic and most used ones while doing data analysis as as follows –


Merging, Joining and Concatenation

To join various dataframe objects or to add more columns to the original dataframe object, we can use either of the three methods which are fulfilled by pandas library :

  • dataframe1.append( dataframe2 )   –   appends/concatenates the second dataframe object at the end of first, given that both have same size of axis. They only concatenates along axis=0 (namely the index) and have been existing before concat method.
  • pandas.concat( [ dataframe1 , dataframe2 ] , axis=1 )   –   concatenates two dataframes on row-basis or column-basis, as specified.
  • dataframe1.join( dataframe2 , how = ’inner’ , on = column1 )      much like SQL joining opeartion, it joins or merges two or more dataframe objects.
  • pd.merge(left, right, on=None)   –   very much similar to join method , merge behaves as the entry point for all standard database join operations between DataFrame objects.

That is very much it to give you all the basic intuition of most versatile state-of-the-art data manipulation library in python  : Pandas.

If you’ve found it much useful then like and share the post. Also, ping your doubts (if any) in the comments section below, or you can also reach us at the contact forum of the site. Don’t forget to subscribe to my news feeds, so that you stay updated of the various blog posts.

Happy Learning!

9 Replies to “Pandas Library for Data Manipulation”

  1. This is really attention-grabbing, You are an excessively skilled blogger.
    I’ve joined your rss feed and look ahead to in quest of extra of your great post.
    Additionally, I have shared your web site in my
    social networks

  2. Wonderful goods from you, man. I’ve understand your stuff previous to and you are just too fantastic.

    I really like what you have acquired here,
    certainly like what you are stating and the way in which you say it.
    You make it entertaining and you still take care of to keep it smart.
    I cant wait to read far more from you. This is
    actually a great website.

Leave a Reply

Your email address will not be published. Required fields are marked *