Pandas vs. Polars - Which Library Holds the Edge?
A head-to-head comparison of speed and memory efficiency between two data manipulation libraries✨
I have been working with Python libraries for a long time, and one of my go-to libraries for data manipulation and analysis has always been Pandas. Since 2021, Pandas has been my trusted companion, helping me tackle countless data-related challenges. However, a new player has entered the field — Polars. Throughout the post, I will compare and share my experience with both libraries these two libraries, discussing their features, performance, and ease of use. So, which one will come out on top? Let’s find out!
data:image/s3,"s3://crabby-images/a8a7b/a8a7bb6870cdfe8ab70285bc90f6ba77c185f94e" alt=""
Introduction to Pandas and Polars
Pandas: My Old Friend
Pandas’ intuitive API and flexibility have made it an indispensable tool in my data analysis workflow. With Pandas, I can effortlessly handle missing data, perform complex joins, and reshape datasets.
import pandas as pd
# Create a sample dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
'Age': [25, None, 30, 35]}
df = pd.DataFrame(data)
# Fill missing values with the mean age
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
Polars: The New Challenger
Polars, on the other hand, is a newer library that’s been gaining traction. It boasts faster performance and more efficient memory usage. Polars’ API is similar to Pandas, making it easy for Pandas users to transition.
import polars as pl
# Create a sample dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
'Age': [25, None, 30, 35]}
df = pl.DataFrame(data)
# Fill missing values with the mean age
df = df.fill_null(pl.col("Age").mean())
print(df)
Polars is available on PyPI, and you can install it with pip
. Open a terminal or command prompt, create a new virtual environment, and then run the following command to install Polars:
$ python -m pip install polars
Loading data with Pandas vs. polars
In pandas, we use pd.read_csv
to load the dataset.
# load data
df_pandas = pd.read_csv("data.csv")
# print the head
print(df_pandas.head())
In Polars, we use pl.read_csv
to load the dataset.
# load data
df_polars = pl.read_csv("data.csv")
# print head
print(df_polars.head())
Creating a DataFrame
Pandas: data
Dictionary store data in a dictionary and pd.DataFrame(data)
will convert the dictionary into a DataFrame.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
Polars: Similar to Pandas, this pl.DataFrame(data)
converts the dictionary into a DataFrame.
import polars as pl
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pl.DataFrame(data)
Filtering Rows
Pandas: df['Age'] > 30
creates a boolean Series where each element is True
if the corresponding Age
is greater than 30 and False
otherwise. DataFrame is filtered by selecting only those rows where the condition is True
.
df = df[df['Age'] > 30]
Polars: pl.col("Age") > 30
is the Polars equivalent of the condition used in Pandas. It selects the "Age" column and applies the greater-than condition. The filter()
method in Polars applies the condition and returns a new DataFrame containing only the rows where the condition is met.
df = df.filter(pl.col("Age") > 30)
Grouping and Aggregating
Pandas: df.groupby('Name')
groups the DataFrame by the "Name" column, so all rows with the same "Name" are grouped. ['Age'].mean()
calculates the mean of the "Age" column within each group. The result is a Series with the "Name" as the index and the mean age as the values.
df.groupby('Name')['Age'].mean()
Polars: df.groupby("Name")
groups the DataFrame by the "Name" column, similar to Pandas. The agg()
function in Polars is used to perform aggregation. Inside it, pl.col("Age").mean()
specifies that the mean of the "Age" column should be calculated for each group. The result is a new DataFrame with the group names and the corresponding aggregated values.
df.groupby("Name").agg(pl.col("Age").mean())
Handling Missing Values
Pandas: The fillna()
method fills in missing values in a DataFrame. Inside fillna()
, df.mean()
calculates the mean of each column. The missing values in each column are replaced with the column's mean.
df.fillna(df.mean())
Polars: In Polars, fill_null()
is used to fill missing (null) values. This expression selects all columns (*
) and computes their means. Missing values in each column are replaced by the mean of that column.
df.fill_null(pl.col("*").mean())
Merging DataFrames
Pandas: Creates two DataFrames df1
and df2
. The merge()
function combines df1
and df2
based on the "Name" column.
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Country': ['USA', 'UK']})
pd.merge(df1, df2, on='Name')
Polars: In Polars, the join()
method merges df1
and df2
based on the "Name" column, similar to Pandas' merge()
.
df1 = pl.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pl.DataFrame({'Name': ['Alice', 'Bob'], 'Country': ['USA', 'UK']})
df1.join(df2, on="Name")
Sorting
Pandas: The Pandas sort_values()
method sorts the DataFrame based on the "Age" column. By default, it sorts in ascending order.
df.sort_values(by='Age')
Polars: In Polars, the sort()
method is used to sort the DataFrame by the "Age" column. Like Pandas, the sorting is done in ascending order by default.
df.sort(by="Age")
Selecting Columns
Pandas: In Pandas, you can select multiple columns by passing a list of column names within double brackets. This returns a new DataFrame containing only the “Name” and “Age” columns.
df[['Name', 'Age']]
Polars: Polars uses the select()
method to choose specific columns. The method takes a list of column names and returns a DataFrame with only those columns.
df.select(["Name", "Age"])
Appending Rows
Pandas: A new DataFrame with one row containing “Dave” and his age is created. The append()
method adds new_row
to the existing DataFrame df
. However, append()
does not modify df
in place but returns a new DataFrame with the appended row.
new_row = pd.DataFrame({'Name': ['Dave'], 'Age': [40]})
df.append(new_row)
Polars: Similar to Pandas, a new DataFrame with one row is created. In Polars, the vstack()
method is used to vertically stack new_row
onto the original DataFrame df
. This operation returns a new DataFrame with the additional row.
new_row = pl.DataFrame({'Name': ['Dave'], 'Age': [40]})
df.vstack(new_row)
Pivoting
Pandas: The pivot()
function in Pandas reshapes the DataFrame. The index
parameter sets the new index, columns
set the columns, and values set the data to be populated in the new DataFrame.
df.pivot(index='Name', columns='Age', values='Country')
Polars: Pivoting in Polars is very similar to Pandas. The pivot()
method takes index
, columns
, and values
parameters to reshape the DataFrame.
df.pivot(index="Name", columns="Age", values="Country")
Reshaping Data
Pandas:
pd.melt(): The
melt()
function in Pandas is used to transform the DataFrame from wide to long format.id_vars: The column(s) you want to keep as identifiers in the long format (in this case, ‘Name’).
value_vars: The columns you want to unpivot into rows (in this case, ‘Age’ and ‘Country’).
import pandas as pd
# Example DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Country': ['USA', 'UK', 'Canada']
})
# Melting the DataFrame
melted_df = pd.melt(df, id_vars='Name', value_vars=['Age', 'Country'])
print(melted_df)
Polars:
df.melt(): Polars provides a similar
melt()
function to transform a DataFrame from wide to long format.id_vars: The column(s) to keep as identifiers (‘Name’ in this example).
value_vars: The columns that are melted into a single column (‘Age’ and ‘Country’ here).
import polars as pl
# Example DataFrame
df = pl.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Country': ['USA', 'UK', 'Canada']
})
# Melting the DataFrame
melted_df = df.melt(id_vars="Name", value_vars=["Age", "Country"])
print(melted_df)
I have shown the similarities and differences between Pandas and Polars with practical examples. Pandas has been my trusted companion for years, and its extensive community support and intuitive syntax make it hard to let go. However, Polars offers impressive performance improvements and a modern approach to data handling.