S5cmd for High Performance Object Storage

Or Why Friends Don’t Let Friends Use s3cmd

FlashBlade is a high performance object store but many of the basic tools used for browsing and accessing object stores lack parallelization and performance. Common tools include s3cmd and the aws-cli, but neither are built for performance. Pairing a high performance storage backend with an inefficient client application leads to significant under-utilization of the storage system.

S5cmd enables browsing and transferring data to/from an object store with very high performance when paired with FlashBlade. For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli. For downloads, s5cmd can saturate a 40Gbps link (~4.3GB/s), whereas s3cmd and aws-cli can only reach 85MB/s and 375MB/s respectively.

Why Use S5cmd

Created as a modern tool for interacting with S3, s5cmd provides numerous benefits including dramatically better performance. There are two reasons why s5cmd has superior performance:

  1. Written in a high-performance, concurrent language, Go, instead of Python. This means the application can take better advantage of multiple threads and is faster to run because it is compiled and not interpreted.

While s5cmd has the performance, it does still miss features that can sometimes be useful: creating and deleting buckets, aborting multipart uploads, and HEAD operations on objects.

Installation and Usage:

Create a credentials file containing your keys at “~/.aws/credentials”

[default]
aws_access_key_id = XXXXXXX
aws_secret_access_key = YYYYYYYY

Install Golang for your OS and then install s5cmd:

go get -u github.com/peakgames/s5cmd

Run example commands to list objects in a bucket and then upload an object:

> s5cmd --endpoint-url http://10.62.64.200 ls s3://joshuarobinson/
+ DIR backup/
+ 2017/10/13 10:31:29 73 people.json
+ 2019/07/10 12:39:43 53687091200 two.txt
2019/08/02 03:08:13 +OK “ls s3://joshuarobinson” (13)
> s5cmd --endpoint-url http://10.62.64.200:80 -uw 64 cp /source s3://joshuarobinson/dest

The endpoint-url argument should point to a data VIP on the FlashBlade. Without this argument, the tool will default to accessing AWS S3. The options “uw” and “dw” control the number of threads used for uploads and downloads. (Update: as of s5cmd v1.0, the ‘uw’ and ‘dw’ flags are merged to the ‘concurrency’ option.)

For day-to-day usage of s5cmd, I use the following alias in my bashrc file to avoid re-typing the same arguments:

alias s5cmd='s5cmd --endpoint-url http://10.62.64.200 -dw 32 -uw 32'

Test Environment

I will focus testing against a FlashBlade using a single server which is a 32 core bare metal machine connected with 40Gbps networking and running CentOS. The storage system is a 15-blade FlashBlade with 17TB blades running Purity//FB v2.4.2

For testing with s5cmd, I used the following Dockerfile:

FROM golang:alpine
RUN apk add git
RUN go get -u github.com/peakgames/s5cmd

Then, to run a timed upload test that copies a local file from /tmp to an S3 bucket:

docker run -it --rm \
--entrypoint=time \
-v /home/ir/.aws/credentials:/root/.aws/credentials \
-v /tmp/one.txt:/tmp/one.txt \
$IMGNAME s5cmd --endpoint-url http://10.62.64.200:80 -uw 32 cp /tmp/one.txt s3://joshuarobinson/one.txt

Note that for security purposes, the credentials file containing access keys is mounted into the container at run-time.

Alternative Tools

This section introduces alternatives and predecessors to s5cmd that I will compare against in performance testing.

Note that these tools assume the existence of the credentials file mentioned in the s5cmd configuration (except s3cmd). I also assume a bucket ‘joshuarobinson’ is created already. If not, the bucket can be created with any of these tools or via the FlashBlade GUI, CLI, or Rest API.

S3cmd

S3cmd, originally created in 2008, uses custom Python code to directly create and send S3 requests and does not use one of the standard AWS-provided SDKs. Manual RPC handling is lower performance and has challenges in keeping up with changes and additions to the S3 API.

Dockerfile:

FROM ubuntu:18.04
RUN apt-get update && apt-get install -y s3cmd --no-install-recommends \
&& rm -rf /var/lib/apt/lists/*

Generate a config file (using “s3cmd — configure”) and set the following fields:

access_key = XXXXXXXX
proxy_host = $FB_DATAVIP
proxy_port = 80
secret_key = YYYYYYYYY

S4cmd

S4cmd is an updated version of s3cmd which uses the AWS sdk boto3 to handle sending and receiving S3 requests. As my results show, using boto3 provides significantly better performance than the manual RPC processing in s3cmd.

Note that the endpoint-url argument is not available in the version installed through apt on Ubuntu (2.0.1), so instead I install version 2.1.0 from source.

Dockerfile:

FROM ubuntu:18.04RUN apt-get update && apt-get install -y git python3 python3-pip python3-setuptools --no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
ARG S4RELEASE=2.1.0
RUN git clone git://github.com/bloomreach/s4cmd.git \
&& cd s4cmd && git checkout tags/$S4RELEASE -b release \
&& pip3 install pytz boto3 && python3 setup.py install

Test commands are then executed as follows:

docker run -it --rm -v /tmp:/tmp \
$IMGNAME s4cmd --endpoint-url http://10.62.64.200:80 --num-threads=128 put /tmp/one.txt s3://joshuarobinson/one.txt

Goofys

Goofys is a fuse filesystem backed by S3 written in Golang and with performance in mind. As with its predecessor, s3fs, Goofys allows the client to use an object store as a filesystem.

Assuming Go is already installed, goofys can be downloaded and installed as follows:

> export GOPATH=$HOME/work
> go get github.com/kahing/goofys
> go install github.com/kahing/goofys

Note you will likely also need to install fuse utils for your OS.

Usage includes a one-time mount operation that mounts a bucket (“joshuarobinson” in this example). Once mounted, standard filesystem operations are used to read and write, with goofys translating to the appropriate object equivalent. Note that the result is a simulated filesystem so some operations, like random overwrites, will not work as expected.

> goofys --endpoint http://10.62.64.200 joshuarobinson /mountpoint/
> time cp /tmp/one.txt /mountpoint/one.txt
> rm /mountpoint/one.txt

Aws-cli

The aws-cli command is a multi-purpose tool for interacting with AWS services, including S3, and is written in Python using boto3.

Create a configuration file, ‘config’, to increase the amount of concurrency from the defaults:

[default]
s3 =
max_concurrent_requests = 1000
max_queue_size = 10000
multipart_threshold = 64MB
multipart_chunksize = 16MB

Dockerfile:

FROM ubuntu:18.04
RUN apt-get update && apt-get install -y awscli --no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
COPY config /root/.aws/config

Large Object Performance

The first performance test is a simple upload (PUT) and download (GET) of a single large object. The source data for this test is a 50GB chunk of a wikipedia text dump.

In this graph, the performance improvement of s5cmd over all alternatives is clear in both upload and download speeds; s5cmd is 10x faster than the fastest python-based alternative, s4cmd.

For the python-based tools, the oldest s3cmd is half the speed of s4cmd and aws-cli. Goofys, while based on Golang and theoretically higher performance, is limited by the use of filesystem tools like cp which itself is not sufficiently multi-threaded.

Note that the default s5cmd thread count value of 5 for upload and download (“-uw” and “-dw”) results in lower throughput so it is necessary to increase those values. The default setting results in GET speeds of 1.3GB/s instead of the peak of 4.3GB/s. In my testing, I found “-dw 16” and “-uw 32” to be the thread count values needed to reach peak throughput. The maximum sustained download speed for s5cmd was 4.3GB/s, which is very close to the limits of a 40Gbps connection. It is not clear if even higher download speeds are possible with more network bandwidth.

Note that aws-cli was not able to handle the 50GB file, giving the following error:

upload failed: tmp/one.txt to s3://joshuarobinson/one.txt filedescriptor out of range in select()

For the upload test results reported for the aws-cli, the object was truncated to 4GB.

Pro-tip: adding “ — entrypoint=time” to a docker run invocation makes it easy to measure runtimes without including the times for container startup and shutdown.

Small Object Upload

The next test focuses on upload speed of small (1MB) objects. For this, I use the same data as above but split the data into 50k files of 1MB each. The difference between s5cmd and the other tools is even larger in this scenario, up to 100x faster. In fact, S5cmd achieves the same performance as the single large object upload, whereas all other tools were slower.

I generated the split dataset from the original with this command:

> split -C 1M /mnt/joshua/one.txt prefix-

Each tool has a slightly different method for a recursive upload. The s5cmd invocation that I used was:

> s5cmd cp /tmp/prefix-* s3://joshuarobinson/tempdata/

In the above graph, s5cmd is 130x faster than s3cmd and 18x faster than s4cmd.

Note that s3cmd has particularly low performance at 10MB/s. A possible alternative is simply to spawn multiple separate uploads in parallel; using xargs with “-P” option spawns up to N processes in parallel.

ls /src/data-* | xargs -n1 -i -P 64 s3cmd -q put {} s3://bucketname/some/

When spawning multiple s3cmd process in parallel, aggregate upload performance increases from 10MB/s to 82MB/s. This demonstrates that the throughput does improve with more concurrency but there are still fundamental limitations in the python code.

Performance Comparison in AWS

In order to find the maximum performance, I also tested these tools using an EC2 instance with the highest measured performance to s3, c5n.18xlarge. The on-prem machine used previously has 32 cores and 40Gbps, whereas this c5n.18xlarge instance has 72 cores and 100Gbps networking. The EC2 cost of this instance is high at $3.88 per hour on-demand.

With the exception of s3cmd, all tools have slightly higher performance with the more powerful client hardware and more importantly, the relative difference between the tools is consistent across platforms. This confirms that the performance improvement is due to the s5cmd tool itself and not any specifics of the client hardware or backend object store.

Summary

Browsing and accessing an object store often requires a command line tool to make interaction and scripting easy. Traditionally, those tools have been s3cmd and aws-cli, which are both python-based and serialized, resulting in ease-of-use without performance. S5cmd is a command line tool written in Golang with an emphasis on performance, resulting in dramatic improvements of up to 32x versus s3cmd and 10x versus s4cmd. S5cmd pairs well with a high performance object store like FlashBlade, with both client and server built for concurrency and performance.

Data science, software engineering, hacking