Pandas Data Export Comparison

In this post, we look at the common data export options in Pandas using Python and compare them on the basis of execution time and storage size. The file formats used for comparison are XLSX, CSV, Pickle and HDF5.

First, let us consider a dataset with more than 1 million records to perform this task. The dataset I am using has the following structure.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1009989 entries, 0 to 1034143
Data columns (total 12 columns):
FEATURE1         1009989 non-null int64
FEATURE2         1009989 non-null int64
FEATURE3         1009989 non-null object
FEATURE4         1009989 non-null object
FEATURE5         1009989 non-null object
FEATURE6         1009989 non-null object
FEATURE7         1009989 non-null float64
FEATURE8         1009989 non-null float64
FEATURE9         1009989 non-null int64
FEATURE10        1009989 non-null datetime64[ns]
FEATURE11        1009989 non-null datetime64[ns]
FEATURE12        1009989 non-null datetime64[ns]
dtypes: datetime64[ns](3), float64(2), int64(3), object(4)
memory usage: 100.2+ MB

The dataset above has columns with different data types (int, object, float, datetime) that a data analyst or data scientist typically encounters. The following lines can be used to export your dataset in different formats.

import pandas as pd
from pandas import ExcelWriter

# Excel
writer = ExcelWriter('test-x.xlsx')
x.to_excel(writer, index = False, sheet_name = 'Sheet1')
writer.save()

# CSV
x.to_csv('test-c.csv')

# Pickle
x.to_pickle('test-p.pickle')

# HDF5
x.to_hdf('test-h.h5', key = 's')

Now, to export them as efficiently as possible, let us have a look at some numbers.

XLSX CSV Pickle HDF5
Execution Time 4m48s 22.172s 786ms 1.221s
Storage Size 54.5MB 107.9MB 77.1MB 78.1MB

Pickle formats perform the best in terms of time but XLSX in terms of storage. However, loading the entire dataset on Microsoft Excel or other Excel applications can at times be infeasible due to large size. I usually save my datasets in Pickle format as it fits my needs better compared to HDF5, and use CSV only when I have to share the dataset with someone who is working in a different environment.