Pandas library has been an all-time favorite for all Data Scientists or analysts because of its easy-to-use nature, a wide range of functionalities, and better interpretation of the results. Any individual starting their Data Science journey is advised to have a good command over pandas, come up with pipelines to reduce the manual effort of cleaning and preprocessing the data.
Pandas is built over Numpy which allows faster execution of commands and getting the work done in less time. In this article, we will share some underrated pandas functions that can enrich your project’s code quality.
Before moving ahead, here is a quick legend:
- All the commands mentioned assume that the data frame is named as ‘df’ which is an object of pd.DataFrame()
- The Pandas library has been imported as an alias as ‘pd’.
String or text data contributes a major part to a dataset. Whether it is information related to the author, title, publication of a book, or tweets made for a particular hashtag, we have a lot of text data and this data comes in handy when cleaned properly and feed to any classifier like Naive Bayes, etc. Here are some tricks you can apply:
- To access the string type data, use the ‘str’ accessor. For example, df[‘column_name’].str
- This makes it possible to do all the string operations on the column selected.
- Some common operations include,
- df[‘column_name’].str.len(): length of each string
- .str.split(): Splitting at particular character
- .str.contains(): Returns T/F about whether the particular word is present in the string
- .str.count(): Returns the count of rows satisfying the regular expression passed.
- .str.findall(): Returns the results which match the expression passed.
- .str.replace(): Same as findall but here replacement of matched items occur
- All string operations such as .title, .isalpha, .isalnum, .isdecimal etc are supported.
Also Read: Pandas Dataframe Astype
Dates and time are commonly present in datasets in the form of timestamps, start time, end time, or any other timing associated with that event. It is useful to parse this data properly as it gives trends along a timeline that can be put out to predict future events or we call quote it as time-series analysis. Let’s see some useful commands:
- To access the DateTime data, convert the current data type (date values are parsed as string or object) to DateTime using the pd.to_datetime() function.
- Now, using the ‘.dt’ accessor, we can access any DateTime information required such as :
- df[‘column_name’].dt.day: Returns the day of the date.
- .dt.time: Time
- .dt.year: Year of the date
- .dt.month: Month of the date
- .dt.weekday: Whether it is Sunday, Monday… in the numerical form where 0 represents Monday. If you want day names, then use .dt.day_name
- .dt.is_month_start: Returns T/F depending on whether the date is the first of the month.
- .dt.is_month_end Same functionality as month_start but here the last date of the month is verified.
- .dt.quater: Returns in which quarter the date lies
- .dt.is_quater_start: Returns T/F whether the date is the first day of the quarter
- .dt.is_quater_end: whether it is the last day of the quarter
- .dt.normalize: When the time component does not add a valuable contribution to the analysis, it can be ignored. This command rounds off the time to midnight i.e., 00:00:00.
Plotting visualizations is one of the key components of Data Analysis and plays a major role while performing feature engineering. For example, outliers in a dataset can be detected using box plots which represents the median and interquartile range, leaving outliers at the extreme ends.
Plotting is done mostly via other libraries such as seaborn, plotly, bokeh, matplotlib, but when you want to instantly visualize data without explicitly defining the libraries? Pandas got the solution. Using the pd.plot() function, you can directly plot graphs that are invoked internally using matplotlib. Various options available for this:
- df.plot() or df[‘column_name’].plot() (depending upon type of graph)
- df.plot() has parameter ‘kind’ which defines the graph. By default, it is a ‘line’ plot but other options available are ‘bar’, ‘barh’, ‘box’, ‘hist’, ‘kde’ etc.
- It invokes matplotlib backend that means we can access its arguments via an ‘ax’ accessor.
- .plot() function can also take arguments such as ‘title’, ‘xticks’, ‘xlim’, ‘xlabel’, ‘fontsize’, ‘colormap’ which eradicates the need of defining external libraries up to some extent.
- pd.get_dummies(): While preprocessing data, sometimes we are encountered with categorical data that needs to be converted into numerical form to be fed to the model. When these categories are fairly low, one-hot encoding is preferred, but doing this manually takes along. This dummies function not only transforms the values but, if drop_first set to True, drops the previous column containing all the categories.
- df.query(): It is the function that allows you to apply the conditional mask over the data frame. The basic difference between this and normal masking is that this function directly returns the values instead of the boolean mask, reducing the effort of creating the mask and applying it to the data frame.
- df.select_dtypes(): Sometimes we need to perform some specific tasks on one type of data type. For example, while reading data from external files, some data types are defined as objects. While cleaning the data, the dataset must have all the correct data types, and doing it manually by df.astype(‘data-type’) would be tedious when the number of such data types is large. This function selects the specified data type and it can be combined with the .apply() function. A sample code would look like this:
Must Read: Pandas Interview Questions
This assignment is referred to as chaining, and it is very common while doing data science tasks to reduce the effort of defining variables for every step to be performed.
If you are curious to learn about Pandas, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms., to_datetime(), value_counts(). These functions are extremely important for Data Scientists and Data Analysts. The functions help to view data, edit values, return outcomes, cast, access datasets, change formats, find unique and duplicate values, merge data, and sort data. ” image-2=”” count=”3″ html=”true” css_class=””]