pandas is an open source data analysis package developed for Python. It is designed to be easy to use, efficient, and convenient for real-world, practical data analysis. For this reason, it is one of the more powerful and widely used tools amongst data scientists. The library can be found here, where you will also find additional documentation and installation instructions.
pandas is designed for tabular datasets (similar to those used in SQL or Excel), that contain observational data. It makes cleaning data and extracting statistical significance relatively easy. Additionally, pandas allows you to merge, filter, group, order, and join with simple, intuitive syntax. This article is intended to be an introduction to using the pandas library, where I will demonstrate several of these capabilities, and their relative simplicity. Before starting, I recommend downloading the pandas cheat sheet that can be found here. It contains most of the basic functions and the corresponding syntax. As a data scientist, this makes your life a lot easier, as there is no need to memorize everything. As you become more familiar with pandas, the intuitive structure will become more apparent.
Exploring Data
First off, we’ll import some data. The UCI Machine Learning Repository has a myriad of datasets ready to use. The wine dataset is what we will be using today. It contains 178 observations of wine grown in the same region in Italy. Each observation is from one of three cultivars (the ‘Class’ feature), and also has 13 constituent features that are the result of a chemical analysis.
import pandas as pd wine_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315', 'Proline'] wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names = wine_names) wine_df = pd.DataFrame(wine_data)
In this first chunk, we’ll import the pandas library and the wine dataset. We’ll then convert the dataset in a pandas DataFrame. There are two primary types of data structures in pandas: a Series (one-dimensional) and a DataFrame (two-dimensional). Nearly all datasets can utilize these two data structures. One of the more useful methods is .head(n). This concatenates the output to the first n observations, allowing you to view any number of observations you desire. You can also use .sample(n), which will output n random observations.
wine_df.head(5)
Some other useful functions:
len() and .nunique()
len(wine_df) 178 wine_df['Class'].nunique() 3
The .describe() method will output quick and basic statistical information on all of the features within the DataFrame. This is quite useful when dealing with numerical data.
wine_df.describe()
Manipulating Data
One of the more efficient features of pandas is called method chaining. For users that are familiar with R, this is similar to the pipe operator %>%. This allows you to chain methods together. For example, if you wanted to find the mean values of each feature for each of the 3 cultivars, you would first use groupby(‘Class’), then .mean(). This can be done all in one line.
wine_df.groupby('Class').mean()
You can then chain additional methods, such as .plot()
wine_df.groupby('Class').mean().plot.scatter(x = 'Alcohol', y = 'Proline')
pandas has built-in plotting features that can assist the user with the data exploration process. This is often useful with much larger datasets, where the physical significance of the features is unknown. With pandas, it is easy to sort through data (numerically or alphabetically) and rename the features.
wine_df.sort_values('Proline',ascending=False).head(5)
wine_df.rename(columns= {'Alcohol':'ABV'}).head(5)
There are a number of ways to extract specific features, either specifying the names directly or using the column index.
wine_df.loc[:,'Class':'Magnesium'].head(5)
wine_df.iloc[:,[1,2,5]].head(5)
pandas also allows conditional-based extraction. Here, we extract observations that have alcohol levels above 14%, and we can view only the Ash and Alcalinity of ash features.
wine_df.loc[wine_df['Alcohol'] > 14, ['Ash','Alcalinity of ash']].head(5)
These are just a few of the basic capabilities of the pandas library. As a data scientist, they make the job of data wrangling a much simpler task. Capable of performing sophisticated grouping, filtering, and joining operations in single lines, pandas is a powerful tool for preparing data for a diverse set of machine learning algorithms.