7 November 2018

Pandas Data Export Comparison

In this post, we look at the common data export options in Pandas using Python and compare them on the basis of execution time and storage size. The file formats used for comparison are XLSX, CSV, Pickle and HDF5.

First, let us consider a dataset with more than 1 million records to perform this task. The dataset I am using has the following structure.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1009989 entries, 0 to 1034143
Data columns (total 12 columns):
FEATURE1         1009989 non-null int64
FEATURE2         1009989 non-null int64
FEATURE3         1009989 non-null object
FEATURE4         1009989 non-null object
FEATURE5         1009989 non-null object
FEATURE6         1009989 non-null object
FEATURE7         1009989 non-null float64
FEATURE8         1009989 non-null float64
FEATURE9         1009989 non-null int64
FEATURE10        1009989 non-null datetime64[ns]
FEATURE11        1009989 non-null datetime64[ns]
FEATURE12        1009989 non-null datetime64[ns]
dtypes: datetime64[ns](3), float64(2), int64(3), object(4)
memory usage: 100.2+ MB

The dataset above has columns with different data types (int, object, float, datetime) that a data analyst or data scientist typically encounters. The following lines can be used to export your dataset in different formats.

import pandas as pd
from pandas import ExcelWriter

# Excel
writer = ExcelWriter('test-x.xlsx')
x.to_excel(writer, index = False, sheet_name = 'Sheet1')
writer.save()

# CSV
x.to_csv('test-c.csv')

# Pickle
x.to_pickle('test-p.pickle')

# HDF5
x.to_hdf('test-h.h5', key = 's')

Now, to export them as efficiently as possible, let us have a look at some numbers.

	XLSX	CSV	Pickle	HDF5
Execution Time	4m48s	22.172s	786ms	1.221s
Storage Size	54.5MB	107.9MB	77.1MB	78.1MB

Pickle formats perform the best in terms of time but XLSX in terms of storage. However, loading the entire dataset on Microsoft Excel or other Excel applications can at times be infeasible due to large size. I usually save my datasets in Pickle format as it fits my needs better compared to HDF5, and use CSV only when I have to share the dataset with someone who is working in a different environment.

Pandas Data Export Comparison

Get all records in SQL, Python and R

Reading Pickle File in R