Faster Data Loading for Pandas on S3

  • Using a fuse-mount via Goofys is faster than s3fs for basic Pandas reads.
  • Parallelization frameworks for Pandas increase S3 reads by 2x.
  • Boto3 performance is a bottleneck with parallelized loads.
  • Replacing Pandas with scalable frameworks PySpark, Dask, and PyArrow results in up to 20x improvements on data reads of a 5GB CSV file.
  • PySpark has the best performance, scalability, and Pandas-compatibility tradeoff.

Dataset and Test Scenario Introduction

project url                                       count bytes
aa %CE%92%CE%84_%CE%95%CF%80%CE%B9%CF%83%CF… 1 4854
aa %CE%98%CE%B5%CF%8C%CE%B4%CF%89%CF%81%CE… 1 4917
aa %CE%9C%CF%89%CE%AC%CE%BC%CE%B5%CE%B8_%CE… 1 4832
aa %CE%A0%CE%B9%CE%B5%CF%81_%CE%9B%27_%CE%91…1 4828
aa %CE%A3%CE%A4%CE%84_%CE%A3%CF%84%CE%B1%CF… 1 4819
pip install boto3 pandas s3fs
import pandas as pdENDPOINT_URL=""
storage_opts = {'client_kwargs': {'endpoint_url': ENDPOINT_URL}}
df = pd.read_csv("s3://" + BUCKETPATH, storage_options=storage_opts)
  1. Using parallelization frameworks for data loading
  2. Using frameworks that read from S3 using compiled languages (not Python!)
  3. Switching to non-Pandas APIs which do not require all data in memory

Step 1, Using Parallelization

Details on How to Run Each Test

pip install boto3 dask dask[distributed] pandas s3fs
import dask.dataframe as ddENDPOINT_URL=""
storage_opts = {'client_kwargs': {'endpoint_url': ENDPOINT_URL}}
ddf = dd.read_csv("s3://"+BUCKETPATH, storage_options=storage_opts)
df = ddf.compute(scheduler='processes')
wget -N
chmod a+x goofys
sudo mkdir -p /mnt/fuse_goofys && sudo chown $USER /mnt/fuse_goofys./goofys --endpoint=http://$FLASHBLADE_IP $BUCKETNAME /mnt/fuse_goofys
apt install -y openjdk-11-jdk
pip install pandas pyspark
spark-submit \
--packages org.apache.hadoop:hadoop-aws:3.2.2 \
--conf spark.hadoop.fs.s3a.endpoint= \
--conf \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
import pyspark.pandas as pspdf = ps.read_csv("s3a://" + BUCKETPATH)with pdf.spark.persist(pyspark.StorageLevel.MEMORY_ONLY) as df:
pip install pandas pyarrow
import pyarrow.dataset as dsENDPOINT = ""
fbs3 = pyarrow.fs.S3FileSystem(access_key=ACCESS_KEY, secret_key=SECRET_KEY, endpoint_override=ENDPOINT, scheme='http')
dataset = ds.dataset(BUCKETPATH, filesystem=fbs3)
df = dataset.to_table().to_pandas()

Is Storage the Bottleneck?

AWS Error [code 99]: curlCode: 28, Timeout was reached

Step 2, Beyond Pandas

# pyspark
c = df.filter(df["count"] == 1).count()
# pyarrow
tab = dataset.scanner(filter=ds.field("count") == 1).to_table()
# dask
c = len(df[df['count'] == 1].index)





Data science, software engineering, hacking

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Coding standards for Unity3D C# scripting

Zoo App part 2

PMD meets Git (Part II)

How to develop your first digital currency transfer

Build AI for Generating Quant Trading Strategies in C# (Part 5)

TryHackMe — Brooklyn Nine Nine Walkthrough(Beginner)

The Growing Swift Cheat Sheet


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joshua Robinson

Joshua Robinson

Data science, software engineering, hacking

More from Medium

Creating an Automated Data Processing Pipeline with Apache Airflow, Kubernetes, and R — Part 2

A jet engine on a commercial airplane

Python’s Concurrency Model

Applied Apache Airflow- Pros/Cons

Apache Airflow components with Open Source Technologies

Graph Databases and Object Graph Mapping with neo4j and python