Pandas vs Polars¶
In this notebook, we compare the performance of Pandas and Polars in handling data. Pandas is a popular data manipulation library for Python, while Polars is a newer data processing library that is optimized for performance.
To compare the performance of the two libraries, we have created two functions that perform common data manipulation operations using Pandas and Polars. The operations we will be measuring include mean, sum, median, standard deviation, max, min, variance, count, and product.
import seaborn as sns
import pandas as pd
import polars as pl
import time
import matplotlib.pyplot as plt
def measure_polars_performance(df_p):
times_polars = {'Operation': [], 'Polars Time (s)': []}
for operation in [('mean', 'groupby mean'), ('sum', 'groupby sum'),
('median', 'groupby median'), ('std', 'groupby std'),
('max', 'groupby max'), ('min', 'groupby min'),
('var', 'groupby var'), ('count', 'groupby count'),
('prod', 'groupby prod'),
]:
start = time.monotonic()
df_p.groupby('name').agg({'acceleration': operation[0]})
times_polars['Operation'].append(operation[1])
times_polars['Polars Time (s)'].append(round(time.monotonic() - start,6))
return pd.DataFrame(times_polars)
def measure_pandas_performance(df_pd):
times_pandas = {'Operation': [], 'Pandas Time (s)': []}
operations = ['mean', 'sum', 'median', 'std', 'max', 'min', 'var', 'count', 'prod']
for operation in operations:
start = time.monotonic()
df_pd.groupby('name')['acceleration'].agg(operation)
times_pandas['Operation'].append(f'groupby {operation}')
times_pandas['Pandas Time (s)'].append(round(time.monotonic() - start,6))
return pd.DataFrame(times_pandas)
Dataset¶
We will use a dataset called "df" to test the performance of Pandas and Polars. This dataset contains information about car models, including the car’s name, miles per gallon (mpg), horsepower, acceleration, and year. We will use the acceleration column to measure the performance of the two libraries.
After measuring the performance of both libraries, we will merge the results into a single dataframe and calculate the percentage difference between the two libraries. By doing this, we can determine which library is faster and more efficient in processing the data.
df = sns.load_dataset("mpg").sample(n=5_000_000, replace=True)
df_p = pl.DataFrame(df)
df_pd = pd.DataFrame(df)
times_polars = measure_polars_performance(df_p)
times_pandas = measure_pandas_performance(df_pd)
times_df = pd.merge(times_polars, times_pandas, on='Operation')
times_df['% Diff'] = round(((times_df['Polars Time (s)']-times_df['Pandas Time (s)'])/times_df['Polars Time (s)']) * 100, 2)
times_df
Operation | Polars Time (s) | Pandas Time (s) | % Diff | |
---|---|---|---|---|
0 | groupby mean | 0.303844 | 0.396879 | -30.62 |
1 | groupby sum | 0.222383 | 0.305479 | -37.37 |
2 | groupby median | 0.201515 | 0.399898 | -98.45 |
3 | groupby std | 0.196416 | 0.330360 | -68.19 |
4 | groupby max | 0.202380 | 0.316174 | -56.23 |
5 | groupby min | 0.198285 | 0.302402 | -52.51 |
6 | groupby var | 0.193137 | 0.307263 | -59.09 |
7 | groupby count | 0.195764 | 0.305431 | -56.02 |
8 | groupby prod | 0.163312 | 0.305416 | -87.01 |
Plots¶
fig, ax = plt.subplots(figsize=(12, 6))
times_df.plot(x="Operation", y=["Polars Time (s)", "Pandas Time (s)"], kind="bar", ax=ax)
plt.title("Polars vs Pandas Performance")
plt.ylabel("Time (s)")
plt.show()