Faster Data Loading for Pandas on S3

  • Using a fuse-mount via Goofys is faster than s3fs for basic Pandas reads.
  • Parallelization frameworks for Pandas increase S3 reads by 2x.
  • Boto3 performance is a bottleneck with parallelized loads.
  • Replacing Pandas with scalable frameworks PySpark, Dask, and PyArrow results in up to 20x improvements on data reads of a 5GB CSV file.
  • PySpark has the best performance, scalability, and Pandas-compatibility tradeoff.

Dataset and Test Scenario Introduction

project url                                       count bytes
aa %CE%92%CE%84_%CE%95%CF%80%CE%B9%CF%83%CF… 1 4854
aa %CE%98%CE%B5%CF%8C%CE%B4%CF%89%CF%81%CE… 1 4917
aa %CE%9C%CF%89%CE%AC%CE%BC%CE%B5%CE%B8_%CE… 1 4832
aa %CE%A0%CE%B9%CE%B5%CF%81_%CE%9B%27_%CE%91…1 4828
aa %CE%A3%CE%A4%CE%84_%CE%A3%CF%84%CE%B1%CF… 1 4819
pip install boto3 pandas s3fs
import pandas as pdENDPOINT_URL=""
storage_opts = {'client_kwargs': {'endpoint_url': ENDPOINT_URL}}
df = pd.read_csv("s3://" + BUCKETPATH, storage_options=storage_opts)
  1. Using parallelization frameworks for data loading
  2. Using frameworks that read from S3 using compiled languages (not Python!)
  3. Switching to non-Pandas APIs which do not require all data in memory

Step 1, Using Parallelization

Details on How to Run Each Test

pip install boto3 dask dask[distributed] pandas s3fs
import dask.dataframe as ddENDPOINT_URL=""
storage_opts = {'client_kwargs': {'endpoint_url': ENDPOINT_URL}}
ddf = dd.read_csv("s3://"+BUCKETPATH, storage_options=storage_opts)
df = ddf.compute(scheduler='processes')
wget -N
chmod a+x goofys
sudo mkdir -p /mnt/fuse_goofys && sudo chown $USER /mnt/fuse_goofys./goofys --endpoint=http://$FLASHBLADE_IP $BUCKETNAME /mnt/fuse_goofys
apt install -y openjdk-11-jdk
pip install pandas pyspark
spark-submit \
--packages org.apache.hadoop:hadoop-aws:3.2.2 \
--conf spark.hadoop.fs.s3a.endpoint= \
--conf \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
import pyspark.pandas as pspdf = ps.read_csv("s3a://" + BUCKETPATH)with pdf.spark.persist(pyspark.StorageLevel.MEMORY_ONLY) as df:
pip install pandas pyarrow
import pyarrow.dataset as dsENDPOINT = ""
fbs3 = pyarrow.fs.S3FileSystem(access_key=ACCESS_KEY, secret_key=SECRET_KEY, endpoint_override=ENDPOINT, scheme='http')
dataset = ds.dataset(BUCKETPATH, filesystem=fbs3)
df = dataset.to_table().to_pandas()

Is Storage the Bottleneck?

AWS Error [code 99]: curlCode: 28, Timeout was reached

Step 2, Beyond Pandas

# pyspark
c = df.filter(df["count"] == 1).count()
# pyarrow
tab = dataset.scanner(filter=ds.field("count") == 1).to_table()
# dask
c = len(df[df['count'] == 1].index)





Data science, software engineering, hacking

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

This week in DevOps #42 — GitOps Issue #24

Using the Faker Gem to Seed Your Database

HeadStarter Pre-Liquidity IDO Round 2

3.3) Implementing Apriori Algorithm using Python Programming

Modern C++ for Python people

How to deploy laravel 6.X project easily on c panel

Introduction to Git & GitHub

Amazon interview experience

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joshua Robinson

Joshua Robinson

Data science, software engineering, hacking

More from Medium

Creating an Automated Data Processing Pipeline with Apache Airflow, Kubernetes, and R — Part 2

A jet engine on a commercial airplane

Easy Local PySpark Environment Setup Using Docker

Airflow For Data Extraction

Docker inside Airflow when running via Docker Compose

Logos of Airflow and Docker