Pandas vs. Polars - Which Library Holds the Edge?

A head-to-head comparison of speed and memory efficiency between two data manipulation libraries✨

Aug 30, 2024

I have been working with Python libraries for a long time, and one of my go-to libraries for data manipulation and analysis has always been Pandas. Since 2021, Pandas has been my trusted companion, helping me tackle countless data-related challenges. However, a new player has entered the field — Polars. Throughout the post, I will compare and share my experience with both libraries these two libraries, discussing their features, performance, and ease of use. So, which one will come out on top? Let’s find out!

Introduction to Pandas and Polars

Pandas: My Old Friend

Pandas’ intuitive API and flexibility have made it an indispensable tool in my data analysis workflow. With Pandas, I can effortlessly handle missing data, perform complex joins, and reshape datasets.

import pandas as pd

# Create a sample dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
        'Age': [25, None, 30, 35]}
df = pd.DataFrame(data)

# Fill missing values with the mean age
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)

Polars: The New Challenger

Polars, on the other hand, is a newer library that’s been gaining traction. It boasts faster performance and more efficient memory usage. Polars’ API is similar to Pandas, making it easy for Pandas users to transition.

import polars as pl

# Create a sample dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
        'Age': [25, None, 30, 35]}
df = pl.DataFrame(data)

# Fill missing values with the mean age
df = df.fill_null(pl.col("Age").mean())
print(df)

Polars is available on PyPI, and you can install it with pip. Open a terminal or command prompt, create a new virtual environment, and then run the following command to install Polars:

$ python -m pip install polars

Loading data with Pandas vs. polars

In pandas, we use pd.read_csv to load the dataset.

# load data
df_pandas = pd.read_csv("data.csv")

# print the head
print(df_pandas.head())

In Polars, we use pl.read_csv to load the dataset.

# load data
df_polars = pl.read_csv("data.csv")

# print head
print(df_polars.head())

Creating a DataFrame

Pandas: data Dictionary store data in a dictionary and pd.DataFrame(data) will convert the dictionary into a DataFrame.

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

Polars: Similar to Pandas, this pl.DataFrame(data)converts the dictionary into a DataFrame.

import polars as pl
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pl.DataFrame(data)

Filtering Rows

Pandas: df['Age'] > 30 creates a boolean Series where each element is True if the corresponding Age is greater than 30 and False otherwise. DataFrame is filtered by selecting only those rows where the condition is True.

df = df[df['Age'] > 30]

Polars: pl.col("Age") > 30 is the Polars equivalent of the condition used in Pandas. It selects the "Age" column and applies the greater-than condition. The filter() method in Polars applies the condition and returns a new DataFrame containing only the rows where the condition is met.

df = df.filter(pl.col("Age") > 30)

Grouping and Aggregating

Pandas: df.groupby('Name') groups the DataFrame by the "Name" column, so all rows with the same "Name" are grouped. ['Age'].mean() calculates the mean of the "Age" column within each group. The result is a Series with the "Name" as the index and the mean age as the values.

df.groupby('Name')['Age'].mean()

Polars: df.groupby("Name") groups the DataFrame by the "Name" column, similar to Pandas. The agg() function in Polars is used to perform aggregation. Inside it, pl.col("Age").mean() specifies that the mean of the "Age" column should be calculated for each group. The result is a new DataFrame with the group names and the corresponding aggregated values.

df.groupby("Name").agg(pl.col("Age").mean())

Handling Missing Values

Pandas: The fillna() method fills in missing values in a DataFrame. Inside fillna(), df.mean() calculates the mean of each column. The missing values in each column are replaced with the column's mean.

df.fillna(df.mean())

Polars: In Polars, fill_null() is used to fill missing (null) values. This expression selects all columns (*) and computes their means. Missing values in each column are replaced by the mean of that column.

df.fill_null(pl.col("*").mean())

Merging DataFrames

Pandas: Creates two DataFrames df1 and df2. The merge() function combines df1 and df2 based on the "Name" column.

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Country': ['USA', 'UK']})
pd.merge(df1, df2, on='Name')

Polars: In Polars, the join() method merges df1 and df2 based on the "Name" column, similar to Pandas' merge().

df1 = pl.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pl.DataFrame({'Name': ['Alice', 'Bob'], 'Country': ['USA', 'UK']})
df1.join(df2, on="Name")

Sorting

Pandas: The Pandas sort_values() method sorts the DataFrame based on the "Age" column. By default, it sorts in ascending order.

df.sort_values(by='Age')

Polars: In Polars, the sort() method is used to sort the DataFrame by the "Age" column. Like Pandas, the sorting is done in ascending order by default.

df.sort(by="Age")

Selecting Columns

Pandas: In Pandas, you can select multiple columns by passing a list of column names within double brackets. This returns a new DataFrame containing only the “Name” and “Age” columns.

df[['Name', 'Age']]

Polars: Polars uses the select() method to choose specific columns. The method takes a list of column names and returns a DataFrame with only those columns.

df.select(["Name", "Age"])

Appending Rows

Pandas: A new DataFrame with one row containing “Dave” and his age is created. The append() method adds new_row to the existing DataFrame df. However, append() does not modify df in place but returns a new DataFrame with the appended row.

new_row = pd.DataFrame({'Name': ['Dave'], 'Age': [40]})
df.append(new_row)

Polars: Similar to Pandas, a new DataFrame with one row is created. In Polars, the vstack() method is used to vertically stack new_row onto the original DataFrame df. This operation returns a new DataFrame with the additional row.

new_row = pl.DataFrame({'Name': ['Dave'], 'Age': [40]})
df.vstack(new_row)

Pivoting

Pandas: The pivot() function in Pandas reshapes the DataFrame. The index parameter sets the new index, columns set the columns, and values set the data to be populated in the new DataFrame.

df.pivot(index='Name', columns='Age', values='Country')

Polars: Pivoting in Polars is very similar to Pandas. The pivot() method takes index, columns, and values parameters to reshape the DataFrame.

df.pivot(index="Name", columns="Age", values="Country")

Reshaping Data

Pandas:

pd.melt(): The melt() function in Pandas is used to transform the DataFrame from wide to long format.
id_vars: The column(s) you want to keep as identifiers in the long format (in this case, ‘Name’).
value_vars: The columns you want to unpivot into rows (in this case, ‘Age’ and ‘Country’).

import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Country': ['USA', 'UK', 'Canada']
})

# Melting the DataFrame
melted_df = pd.melt(df, id_vars='Name', value_vars=['Age', 'Country'])

print(melted_df)

Polars:

df.melt(): Polars provides a similar melt() function to transform a DataFrame from wide to long format.
id_vars: The column(s) to keep as identifiers (‘Name’ in this example).
value_vars: The columns that are melted into a single column (‘Age’ and ‘Country’ here).

import polars as pl

# Example DataFrame
df = pl.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Country': ['USA', 'UK', 'Canada']
})

# Melting the DataFrame
melted_df = df.melt(id_vars="Name", value_vars=["Age", "Country"])

print(melted_df)

I have shown the similarities and differences between Pandas and Polars with practical examples. Pandas has been my trusted companion for years, and its extensive community support and intuitive syntax make it hard to let go. However, Polars offers impressive performance improvements and a modern approach to data handling.