Programs

Top 5 Pandas Functions Essential for Data Scientists [2022]

Pandas is clearly one of the most used and loved libraries when it comes to Data Science and Data Analysis with Python. What makes it special? In this tutorial, we will go over 5 such functions that make Pandas an extremely useful tool in a Data Scientist’s tool kit.

By the end of this tutorial, you’ll have the knowledge of the below functions in Pandas and how to use them for your applications:

  • value_counts
  • groupby
  • loc and iloc
  • unique and nunique
  • Cut and qcut

Top Pandas Functions For Data Scientists

1. value_counts()

Pandas’ value_counts() function is used to show the counts of all unique elements in columns of a dataframe. 

Pro Tip: While Pandas gives the output as plain text, you can easily plot the values using the inbuilt bar plot in Pandas for a graphical representation of the same information.

To demonstrate, I’ll be using the Titanic Dataset. 

Now, to find the counts of classes in the Embarked feature, we can call the value_counts function:

train[‘Embarked’].value_counts()

 

#Output:
S      644
C      168
Q       77

Also, if these number don’t make much sense, you can view their percentages instead:

train[‘Embarked’].value_counts(normalize=True)

 

#Output:
S    0.724409
C    0.188976
Q    0.086614

Moreover, value_counts doesn’t consider the NaN or the missing values by default which is very essential to check. To do that, you can set the parameter dropna as false.

train[‘Embarked’].value_counts(dropna=False)

 

#Output:
S      644
C      168
Q       77
NaN      2

2. group_by()

With Pandas group_by, we can split and group our dataframe by certain columns to be able to view patterns and details in the data. Group_by involves 3 main steps: splitting, applying and combining.

train.groupby(‘Sex’).mean()

Output:

As you see, we grouped the data frame by the feature ‘sex’ and aggregated using the means.

You can also plot it using Pandas’ built-in visualization:

df.groupby(‘Sex’).sum().plot(kind=‘bar’)

We can also group by using multiple features for a hierarchical splitting.

df.groupby([‘Sex’, ‘Survived’] )[‘Survived’].count()

Must Read: Pandas Interview Questions

3. loc and iloc

Indexing in Pandas is one of the most basic operations and the best way to do it is using either loc or iloc. “Loc” stands for location and the “i” stands for indexed location. In other words, when you want to index a dataframe using names or labels of columns/rows, you’d use loc. And when you want to index columns or rows using the positions, you’d use the iloc function. Let’s check out loc first.

train.loc[2, ‘sex’]

The above operation gives us the element of row index 2 and column ‘sex’. Similarly, if you’d needed all the values of the sex column, you’d do:

train.loc[:, ‘sex’]

Also, you can filter out multiple columns like:

train.loc[:, ‘sex’, ‘Embarked’]

You can also filter out using boolean conditions within the loc like:

train.loc[train.age >= 25]


To only view certain rows, you can slice the dataframe using loc:

train.loc[100:200]

Moreover, you can slice the dataframe on the column axis as:

train.loc[:, ‘sex’ : ‘fare’]

 

The above operation will slice the dataframe from the column ‘sex’ to ‘fare’ for all the rows.

Now, let’s move on to iloc. iloc only indexes using index numbers or the positions. You can slice dataframes like:

train.iloc[100:200, 2:9]


The above operations will slice rows from 100 to 199 and the columns 2 through 8. Similarly, if you’d want to split your data horizontally, you can do:

train.iloc[:300, :]

4. unique() and nunique()

Pandas unique is used to get all the unique values from any feature. This is mostly used to get the categories in categorical features in the data. Unique shows all the unique values including NaNs. It treats it as a different unique value. Let’s take a look:

train[‘sex’].unique()

 

#Output:
[‘female’, ‘male’]

As we see, it gives us the unique values in the ‘sex’ feature.

Similarly, you can also check the number of unique values as there might be a lot of unique values in some features.

train[‘sex’].nunique()

 

#Output:
2

However, you should keep in mind that nunique() doesn’t consider NaNs as unique values. If there are any NaNs in your data then you’d need to pass the dropna parameter as False to make sure Pandas gives you the count including the NaNs too.

train[‘sex’].nunique(dropna=False)

 

#Output:
3

5. cut() and qcut()

Pandas cut is used to bin values in ranges in order to discretize the features. Let’s dive down into it. Binning means converting a numerical or continuous feature into a discrete set of values, based on the ranges of the continuous values. This comes in handy when you want to see the trends based on what range the data point falls in.

Let’s understand this with a small example.

Suppose, we have marks for 7 kids ranging from 0-100. Now, we can assign every kid’s marks to a particular “bin”. 

df = pd.Dataframe(data= {
‘Name’: [‘Ck’, ‘Ron’, ‘Mat’, ‘Josh’, ‘Tim’, ‘SypherPK’, ‘Dew’, ‘Vin’],
‘Marks’:[37, 91, 66, 42, 99, 81, 45, 71]
})

df[‘marks_bin’] = pd.cut(df[‘Marks’], bins=[0, 50, 70, 100], labels=[1, 2, 3])

Then we can just append the output as a new feature, and the Marks feature can be dropped. The new dataframe looks something like:

#Output:
      Name     Marks    marks_bin
0        Ck       37         1
1       Ron       91         3
2       Mat       66         2
3      Josh       42         1
4       Tim       99         3
5  SypherPK       81         3
6       Dew       45         1
7       Vin       71         3

So, when I say bins = [0, 50, 70, 100], it means that there are 3 ranges:

0 to 50 for bin 1,

51 to 70 for bin 2, and 

71 to 100 belonging to bin 3.

So, now our feature doesn’t contain the marks but the range or the bin to which the marks for that student are.

Similar to cut(), Pandas also offers its brother function called qcut(). Pandas qcut takes in the number of quantiles, and divides the data points to each bin based on the data distribution. So, we can just change the cut function in the above to qcut:

df[‘marks_bin’] = pd.qcut(df[‘Marks’], q=3, labels=[1, 2, 3])

In the above operation, we tell Pandas to cut the feature into 3 equal parts and assign them the labels. The output comes as:

        Name   Marks    marks_bin
0        Ck     37         1
1       Ron     91         3
2       Mat     66         2
3      Josh     42         1
4       Tim     99         3
5  SypherPK     81         3
6       Dew     45         1
7       Vin     71         2

Notice how the last value changed from 3 to 2. 

Our learners also read: Learn Python Online Course Free

Before you go

We saw some most used Pandas functions. But these are not the only ones that are important and we’d encourage you to learn more of Pandas mostly used functions. This is a good and efficient approach as you might not be using all the functions that Pandas has, but only a few of them. 

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Why is the Pandas library so popular?

This library is indeed quite popular among data scientists and data analysts. The reason for this is its great support of a large number of file formats and a rich collection of features to manipulate the extracted data. It can easily integrate with other libraries and packages such as NumPy.

This powerful library provides various useful functions for manipulating huge data sets in a flexible manner. Once you have mastered it, you can easily achieve great tasks with a few lines of code.

What is the merge function and why is it used?

The merge function is a special function of Pandas data frame that is used to merge multiple rows or columns of 2 data frames. It is a high-memory join operation and resembles relational databases. You can use on = Column Name to merge data frames on the common column.

You can update left_on = Column Name or right_on = Column Name to align tables using columns from the left or right data frame as keys.

Apart from Pandas library, what are the other Python libraries for data science?

Apart from Pandas library, there are a bunch of Python libraries that are considered to be some of the best libraries for data science. These include PySpark, TensorFlow, Matplotlib, Scikit Learn, SciPy and many more. Each one of them is widely used for its unique and amazing features and functions.

Every library has its own significance like SciKit Learn is more often used when you have to deal with statistical data. Apart from analysing the data, you can also create dashboards and visual reports using the functions provided by these amazing libraries.

Want to share this article?

Plan Your Data Science Career Today

UPGRAD AND IIIT-BANGALORE'S PG DIPLOMA IN DATA SCIENCE
Apply for Executive PG Programme in Data Science from IIIT-B

Leave a comment

Your email address will not be published. Required fields are marked *

Leave a comment

Your email address will not be published. Required fields are marked *

×
Get Free career counselling from upGrad experts!
Book a session with an industry professional today!
No Thanks
Let's do it
Get Free career counselling from upGrad experts!
Book a Session with an industry professional today!
Let's do it
No Thanks