Object Storage via Fuse Filesystems

Cloud-native applications must often co-exist with legacy applications. Those legacy applications are hardened and just work, so rewriting can seem hardly worth the trouble.

For legacy applications to take advantage of new technology requires bridges, and fuse clients for object storage are a bridge that allow most (but not all) applications that expect to read and write files to work in the new world of object storage.

I will focus on three different implementations of a fuse-based filesystem on top of object storage, s3fs, goofys, and rclone. Prior work on performance comparisons of s3fs and goofys include theoretical upper bounds and the goofys github readme.

General guidelines for when to use a fuse filesystem adaptor for object storage:

  • The application expecting files requires only moderate performance and does not have complicated dependencies on POSIX semantics.
  • You are using the filesystem adaptor for either reads or writes of the data, but not both. If your application is both reading and writing files, then best to use a real filesystem for the working data and copy only the final results to an object store.
  • You are using the adaptor because one part of your data pipeline is an application that expects files, whereas other applications expect objects.

If you find yourself primarily copying data between local filesystems and remote object storage, then tools like s5cmd or rclone will provide better performance.

There is also a python library s3fs with similar functionality, but despite the names being the same, they are distinct pieces of software. The python version indeed makes access to objects much easier than direct boto3 code, but is not as performant due to the nature of python itself.

Amongst the three choices, I personally suggest using goofys due to significantly better performance. It may have less POSIX compatibility, but if that difference matters to your use-case, then a fuse client might not be the right answer.

Fuse Best Practices and Limitations

Conceptually, these fuse clients are lightweight client-side gateways that translate between objects and files. You could also run a separate server that acts as gateway, but that incurs the additional cost and complexity of an extra server.

A fuse client is most useful when one part of a workflow requires simple reading or writing files, whereas the rest of your workflow directly accesses objects via native S3 API. In other words, a fuse client is a tactical choice for bringing a dataset and associated workflow from filesystem to object storage, where the fuse client specifically bridges the gap where an application expects to read or write files.

Things to avoid when using a fuse client:

  • Do not expect ownership or permissions to work right, control permissions with your S3 key policies instead.
  • Do not use renames (‘mv’ command)
  • Lots of directory listing operations
  • Write to files sequentially and avoid random writes or appending to existing files.
  • Do not use symlinks or hard links
  • Do not expect consistency across clients, avoid sharing files through multiple clients with fuse mounts.
  • No really large files (1TB or larger)

Both s3fs and goofys publish their respective limitations. One advantage of s3fs is that it preserves file owner/group bits as object custom metadata.

In short, the application using the fuse filesystem should be a simple reader or writer of files. If that does not match your use-case, I would suggest careful consideration before proceeding.

Installation and Mounting Instructions

Basics

sudo apt install s3fs

The mount operation uses two additional options to specify the endpoint as the FlashBlade data VIP and to use path-style requests.

sudo mkdir -p /mnt/fuse_s3fs && sudo chown $USER /mnt/fuse_s3fss3fs $BUCKETNAME /mnt/fuse_s3fs -o url=http://10.62.64.200 -o use_path_request_style

The FlashBlade’s data VIP is 10.62.64.200 in all the example commands.

Install goofys by downloading the standalone binary from the github release page:

wget -N https://github.com/kahing/goofys/releases/latest/download/goofys
chmod a+x goofys

Then mount a bucket as a filesystem as follows:

sudo mkdir -p /mnt/fuse_goofys && sudo chown $USER /mnt/fuse_goofys./goofys --endpoint=http://10.62.64.200 $BUCKETNAME /mnt/fuse_goofys

With goofys you can also mount specific prefixes, i.e., mount only a “subdirectory” and limit the visibility of data via fuse to just a certain key prefix.

goofys <bucket:prefix> <mountpoint>

Rclone-mount relies on the same installation and configuration as standard rclone. This means that if already using rclone, then it is trivial to also mount a buckets as follows where “fb” refers to my FlashBlade’s rclone.conf s3 configuration:

[fb]
type = s3
env_auth = true
region = us-east-1
endpoint = http://10.62.64.200

Replace the endpoint with the appropriate IP address and then mount with the following command:

rclone --vfs-cache-mode writes mount fb:$BUCKETNAME /mnt/fuse_rclone &

Note that I use the ampersand operator to background the mounting operation as the default is to keep rclone in the foreground.

Simulating a Directory Structure with Object Keys

The other common approach leaves directories implicit in the key structure, meaning no extra empty placeholder objects. While this may complicate some tooling, it also means that the fuse client approach supports empty directories as you would expect in a filesystem.

But if you are reading a file structure that was laid out using implicit directories, it will still work the same!

Permissions

Angle 1: Reader

Note that if clients try to write files without permission, it is possible to see inconsistencies. For example, if I touch a file with read-only permission and goofys, an immediate listing (‘ls’) will see a phantom file which eventually goes away. The ‘touch’ command does fail, so many but not all programs or scripts that unexpectedly write should fail.

$ touch foo
touch: failed to close ‘foo’: Permission denied
$ ls
foo linux-5.12.13

$ ls
linux-5.12.13

Most operations fail without the “list” permission due to expectations of being able to browse directory structures, but for example, it is still possible to read individual files with ‘cat’ without the object-list policy enabled.

Alternatively, you can mount using goofys’s flag “-o r“ for read-only access, but using keys and access policies provides stronger protections than mounting in read-only mode. Restricting permission with keys avoids users simply re-mounting without “-o r” to work around an issue.

And of course, without the object-read permission, the client can list directories and files but not access any of the file content.

$ cat pod.yaml
cat: pod.yaml: Permission denied

Angle 2: Writer

With write and list permissions, I can write files and read them back locally for a short period of time due to local caching. Note that it appears to require ‘list’ permissions and also enables overwrites.

Enabling Deletions

See the following section on “undo” for more information about how to combine deletions with the ability to undo those deletions when necessary.

Full Control

This avoids giving users more permissions than necessary, for example the ability to create and delete buckets, etc, but they can still write, read, and delete files.

Bonus: Undo an Accidental Deletion

First, enable versioning on the bucket if not already so. In the FlashBlade GUI’s bucket view, the “Enable versioning…” can be accessed on the upper right corner.

And then in order to undelete files that have been accidentally deleted, you can simply go find the delete marker and remove it. There is no “undelete” operation at the filesystem level, so this needs to be out-of-band through a different mechanism or script.

An example python script (gist here) to undelete an object by removing it’s DeleteMarker:

#!/usr/bin/python3
import boto3
import sys
FB_DATAVIP='10.62.64.200'if len(sys.argv) != 3:
print("Usage: {} bucketname key".format(sys.argv[0]))
sys.exit(1)
bucketname = sys.argv[1]
key = sys.argv[2]
s3 = boto3.resource('s3', endpoint_url='http://' + FB_DATAVIP)
kwargs = {'Bucket' : bucketname, 'Prefix' : key}
pageresponse = s3.meta.client.get_paginator('list_object_versions').paginate(**kwargs)for pageobject in pageresponse:
if 'DeleteMarkers' in pageobject.keys() and pageobject['DeleteMarkers'][0]['Key'] == key:
print("Undeleting s3://{}/{}".format(bucketname, key))
s3.ObjectVersion(bucketname, key, pageobject['DeleteMarkers'][0]['VersionId']).delete()

And then the object can be undeleted as simply as this:

./s3-undelete.py phrex temp/pod.yaml
Undeleting s3://phrex/temp/pod.yaml

A safe and secure undelete would restrict the usage of this script to an administrator, in order to limit the use of keys with broader delete permissions.

Finally, create a lifecycle rule to automatically clean up old object versions, i.e., if an object is no longer the most recent it can be eventually deleted so that space is reclaimed. Similarly, if an object is deleted, the original will be kept for this long allowing a user to undo that deletion within the lifecycle’s time window.

Performance Testing

This section presents performance testing of basic scenarios to help understand when and where the S3 fuse clients are useful. In each test, I compare the fuse clients presenting an object bucket as a “filesystem” with a true NFS shared filesystem.

Test scenario:

  • All tests run against a small 9-blade FlashBlade
  • Client is 16 core, 96 GB DRAM, Ubuntu 20.04
  • Ramdisk used as the source or sink for write and read tests respectively
  • A direct S3 performance test gets 1.1GB/s writes and 1.5GB/s reads.
  • I also compare with a high-performance NFS filesystem, backed by the same FlashBlade, to illustrate the fuse-client overhead.
  • Tested goofys version 0.24.0, s3fs version v1.86, and rclone version 1.50.2

I use filesystem tools like “cp” “rm” and “cat” for these tests, but it is important to note that in most cases the filesystem operations will be built into existing legacy applications, e.g., fwrite() and fread(). I chose these tools because they achieve good throughput on native filesystems, are simple to understand, and are easily reproducible.

The summary of performance results is that across read/write and metadata-intensive tests, the performance ordering is goofys, s3fs, and then rclone as the slowest.

Throughput results

As an example, writing to the fuse filesystem serially:

for i in {1..24}; do
cp /mnt/ramdisk/file_1G /mnt/$d/temp/file_1G_$i
done

The parallel version uses ‘&’ to launch each copy in the background and then ‘wait’ blocks until all background processes complete:

for i in {1..24}; do
cp /mnt/ramdisk/file_1G /mnt/$d/temp/file_1G_$i &
done
wait

Two observations from the write results. First, goofys is significantly faster than the other fuse clients on serial writes, though still slightly slower than direct NFS. Second, parallelizing the filesystem operations results in improved write speeds in all cases, but goofys is still the fastest.

The second test uses ‘cat’ to read files through the fuse clients, using the same set of 24 1GB files. As with the writes, the reads are tested both serially and in parallel.

Performance trends are similar with goofys fastest for serial reads, but s3fs handles parallel reads slightly better. The more surprising result is that both goofys and s3fs are faster than true NFS for serial reads. This is a consequence of how the linux kernel NFS client performs readahead less aggressively than the fuse clients.

Metadata results

Goofys is fastest for both the untar and the removal operations, but the gap is larger when compared to a native NFS. This indicates that these workloads suffer a larger performance penalty relative to native NFS.

The test to populate the source repo untars files directly into object storage using the fuse layer as intermediary. But this pushes at the edge of where a fuse client makes sense from a performance perspective. Directly untarring to an NFS mount is 6x faster. In this case, an alternative approach of untarring to local storage and then using s5cmd to upload directly to the object store is 5x faster (257 seconds) than goofys! Using local storage as a staging area is faster because the local storage has lower latencies for the serial untar operation and then s5cmd can upload files concurrently. Of course, this technique only works if the local storage has capacity for the temporary storage.

The last test uses the “find” command to find files with a certain extension (“.h” in this case) and exercises metadata responsiveness exclusively. As with the other tests, goofys performs best.

Comparing to AWS

The test scenarios here consist of serial access patterns because this is the default in most workflows. Parallelization often involves modifications of source programs, in which case it is better to simply switch to native S3 accesses.

Note that due to the fuse client, none of these tests actually stress the FlashBlade or AWS throughput bounds. The achieved lower latency of S3 operations on the FlashBlade results in better performance. For simple large, i.e., 1GB, file operations, the FlashBlade’s lower latency results in 28% faster runtimes relative to AWS S3.

In contrast, when writing or removing nested directories with small-to-medium file sizes, the performance advantage increases to 3x-6x faster in favor of FlashBlade. This indicates that the metadata overheads of LIST operations and small objects are much higher with AWS S3.

Summary

Summarizing best practices for when and how to use s3 fuse clients:

  • Best to use for only one part of your data workflow, either simple writing or reading of files.
  • Do not rely on POSIX filesystem features like permissions, file renames, random overwrites, etc.
  • Prefer goofys as a fuse client choice because of superior performance

Data science, software engineering, hacking