S5cmd for High Performance Object Storage

Why Use S5cmd

Created as a modern tool for interacting with S3, s5cmd provides numerous benefits including dramatically better performance. There are two reasons why s5cmd has superior performance:

  1. Written in a high-performance, concurrent language, Go, instead of Python. This means the application can take better advantage of multiple threads and is faster to run because it is compiled and not interpreted.
  2. Better utilization of multiple tcp connections to transfer more data to and from the object store, resulting in higher throughput transfers.

Installation and Usage:

Create a credentials file containing your keys at “~/.aws/credentials”

[default]
aws_access_key_id = XXXXXXX
aws_secret_access_key = YYYYYYYY
go get -u github.com/peakgames/s5cmd
> s5cmd --endpoint-url http://10.62.64.200 ls s3://joshuarobinson/
+ DIR backup/
+ 2017/10/13 10:31:29 73 people.json
+ 2019/07/10 12:39:43 53687091200 two.txt
2019/08/02 03:08:13 +OK “ls s3://joshuarobinson” (13)
> s5cmd --endpoint-url http://10.62.64.200:80 -uw 64 cp /source s3://joshuarobinson/dest
alias s5cmd='s5cmd --endpoint-url http://10.62.64.200 -dw 32 -uw 32'

Test Environment

I will focus testing against a FlashBlade using a single server which is a 32 core bare metal machine connected with 40Gbps networking and running CentOS. The storage system is a 15-blade FlashBlade with 17TB blades running Purity//FB v2.4.2

FROM golang:alpine
RUN apk add git
RUN go get -u github.com/peakgames/s5cmd
docker run -it --rm \
--entrypoint=time \
-v /home/ir/.aws/credentials:/root/.aws/credentials \
-v /tmp/one.txt:/tmp/one.txt \
$IMGNAME s5cmd --endpoint-url http://10.62.64.200:80 -uw 32 cp /tmp/one.txt s3://joshuarobinson/one.txt

Alternative Tools

This section introduces alternatives and predecessors to s5cmd that I will compare against in performance testing.

S3cmd

S3cmd, originally created in 2008, uses custom Python code to directly create and send S3 requests and does not use one of the standard AWS-provided SDKs. Manual RPC handling is lower performance and has challenges in keeping up with changes and additions to the S3 API.

FROM ubuntu:18.04
RUN apt-get update && apt-get install -y s3cmd --no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
access_key = XXXXXXXX
proxy_host = $FB_DATAVIP
proxy_port = 80
secret_key = YYYYYYYYY

S4cmd

S4cmd is an updated version of s3cmd which uses the AWS sdk boto3 to handle sending and receiving S3 requests. As my results show, using boto3 provides significantly better performance than the manual RPC processing in s3cmd.

FROM ubuntu:18.04RUN apt-get update && apt-get install -y git python3 python3-pip python3-setuptools --no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
ARG S4RELEASE=2.1.0
RUN git clone git://github.com/bloomreach/s4cmd.git \
&& cd s4cmd && git checkout tags/$S4RELEASE -b release \
&& pip3 install pytz boto3 && python3 setup.py install
docker run -it --rm -v /tmp:/tmp \
$IMGNAME s4cmd --endpoint-url http://10.62.64.200:80 --num-threads=128 put /tmp/one.txt s3://joshuarobinson/one.txt

Goofys

Goofys is a fuse filesystem backed by S3 written in Golang and with performance in mind. As with its predecessor, s3fs, Goofys allows the client to use an object store as a filesystem.

> export GOPATH=$HOME/work
> go get github.com/kahing/goofys
> go install github.com/kahing/goofys
> goofys --endpoint http://10.62.64.200 joshuarobinson /mountpoint/
> time cp /tmp/one.txt /mountpoint/one.txt
> rm /mountpoint/one.txt

Aws-cli

The aws-cli command is a multi-purpose tool for interacting with AWS services, including S3, and is written in Python using boto3.

[default]
s3 =
max_concurrent_requests = 1000
max_queue_size = 10000
multipart_threshold = 64MB
multipart_chunksize = 16MB
FROM ubuntu:18.04
RUN apt-get update && apt-get install -y awscli --no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
COPY config /root/.aws/config

Large Object Performance

The first performance test is a simple upload (PUT) and download (GET) of a single large object. The source data for this test is a 50GB chunk of a wikipedia text dump.

upload failed: tmp/one.txt to s3://joshuarobinson/one.txt filedescriptor out of range in select()

Small Object Upload

The next test focuses on upload speed of small (1MB) objects. For this, I use the same data as above but split the data into 50k files of 1MB each. The difference between s5cmd and the other tools is even larger in this scenario, up to 100x faster. In fact, S5cmd achieves the same performance as the single large object upload, whereas all other tools were slower.

> split -C 1M /mnt/joshua/one.txt prefix-
> s5cmd cp /tmp/prefix-* s3://joshuarobinson/tempdata/
ls /src/data-* | xargs -n1 -i -P 64 s3cmd -q put {} s3://bucketname/some/

Performance Comparison in AWS

In order to find the maximum performance, I also tested these tools using an EC2 instance with the highest measured performance to s3, c5n.18xlarge. The on-prem machine used previously has 32 cores and 40Gbps, whereas this c5n.18xlarge instance has 72 cores and 100Gbps networking. The EC2 cost of this instance is high at $3.88 per hour on-demand.

Summary

Browsing and accessing an object store often requires a command line tool to make interaction and scripting easy. Traditionally, those tools have been s3cmd and aws-cli, which are both python-based and serialized, resulting in ease-of-use without performance. S5cmd is a command line tool written in Golang with an emphasis on performance, resulting in dramatic improvements of up to 32x versus s3cmd and 10x versus s4cmd. S5cmd pairs well with a high performance object store like FlashBlade, with both client and server built for concurrency and performance.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joshua Robinson

Joshua Robinson

Data science, software engineering, hacking