Object Storage via Fuse Filesystems
Cloud-native applications must often co-exist with legacy applications. Those legacy applications are hardened and just work, so rewriting can seem hardly worth the trouble.
For legacy applications to take advantage of new technology requires bridges, and fuse clients for object storage are a bridge that allow most (but not all) applications that expect to read and write files to work in the new world of object storage.
I will focus on three different implementations of a fuse-based filesystem on top of object storage, s3fs, goofys, and rclone. Prior work on performance comparisons of s3fs and goofys include theoretical upper bounds and the goofys github readme.
General guidelines for when to use a fuse filesystem adaptor for object storage:
- The application expecting files requires only moderate performance and does not have complicated dependencies on POSIX semantics.
- You are using the filesystem adaptor for either reads or writes of the data, but not both. If your application is both reading and writing files, then best to use a real filesystem for the working data and copy only the final results to an object store.
- You are using the adaptor because one part of your data pipeline is an application that expects files, whereas other applications expect objects.
If you find yourself primarily copying data between local filesystems and remote object storage, then tools like s5cmd or rclone will provide better performance.
There is also a python library s3fs with similar functionality, but despite the names being the same, they are distinct pieces of software. The python version indeed makes access to objects much easier than direct boto3 code, but is not as performant due to the nature of python itself.
Amongst the three choices, I personally suggest using goofys due to significantly better performance. It may have less POSIX compatibility, but if that difference matters to your use-case, then a fuse client might not be the right answer.
Fuse Best Practices and Limitations
First, a FUSE client is a filesystem client written in userspace. This is in contrast to most standard filesystem clients, like EXT4 or NFS, which are implemented in the linux kernel. This leads to more flexibility to implement filesystems, including ones that only roughly resemble a traditional filesystem. It also means you can more easily mount fuse filesystems without root privileges.
Conceptually, these fuse clients are lightweight client-side gateways that translate between objects and files. You could also run a separate server that acts as gateway, but that incurs the additional cost and complexity of an extra server.
A fuse client is most useful when one part of a workflow requires simple reading or writing files, whereas the rest of your workflow directly accesses objects via native S3 API. In other words, a fuse client is a tactical choice for bringing a dataset and associated workflow from filesystem to object storage, where the fuse client specifically bridges the gap where an application expects to read or write files.
Things to avoid when using a fuse client:
- Do not expect ownership or permissions to work right, control permissions with your S3 key policies instead.
- Do not use renames (‘mv’ command)
- Lots of directory listing operations
- Write to files sequentially and avoid random writes or appending to existing files.
- Do not use symlinks or hard links
- Do not expect consistency across clients, avoid sharing files through multiple clients with fuse mounts.
- No really large files (1TB or larger)
Both s3fs and goofys publish their respective limitations. One advantage of s3fs is that it preserves file owner/group bits as object custom metadata.
In short, the application using the fuse filesystem should be a simple reader or writer of files. If that does not match your use-case, I would suggest careful consideration before proceeding.
Installation and Mounting Instructions
Basics
Installing s3fs is straightforward on a variety of platforms such as ‘apt’ on Ubuntu.
sudo apt install s3fs
The mount operation uses two additional options to specify the endpoint as the FlashBlade data VIP and to use path-style requests.
sudo mkdir -p /mnt/fuse_s3fs && sudo chown $USER /mnt/fuse_s3fss3fs $BUCKETNAME /mnt/fuse_s3fs -o url=http://10.62.64.200 -o use_path_request_style
The FlashBlade’s data VIP is 10.62.64.200 in all the example commands.
Install goofys by downloading the standalone binary from the github release page:
wget -N https://github.com/kahing/goofys/releases/latest/download/goofys
chmod a+x goofys
Then mount a bucket as a filesystem as follows:
sudo mkdir -p /mnt/fuse_goofys && sudo chown $USER /mnt/fuse_goofys./goofys --endpoint=http://10.62.64.200 $BUCKETNAME /mnt/fuse_goofys
With goofys you can also mount specific prefixes, i.e., mount only a “subdirectory” and limit the visibility of data via fuse to just a certain key prefix.
goofys <bucket:prefix> <mountpoint>
Rclone-mount relies on the same installation and configuration as standard rclone. This means that if already using rclone, then it is trivial to also mount a buckets as follows where “fb” refers to my FlashBlade’s rclone.conf s3 configuration:
[fb]
type = s3
env_auth = true
region = us-east-1
endpoint = http://10.62.64.200
Replace the endpoint with the appropriate IP address and then mount with the following command:
rclone --vfs-cache-mode writes mount fb:$BUCKETNAME /mnt/fuse_rclone &
Note that I use the ampersand operator to background the mounting operation as the default is to keep rclone in the foreground.
Simulating a Directory Structure with Object Keys
When using a fuse client with S3, a “mkdir” operation corresponds to creating an empty object with a key that ends in a “/” character. In other words, the directory marker is explicitly created even though the “/” is not a special character in an object store. The “/” indicates a directory by convention.
The other common approach leaves directories implicit in the key structure, meaning no extra empty placeholder objects. While this may complicate some tooling, it also means that the fuse client approach supports empty directories as you would expect in a filesystem.
But if you are reading a file structure that was laid out using implicit directories, it will still work the same!
Permissions
One of the main challenges of using fuse clients is the fact that standard POSIX permissions no longer work as expected. Due to the mismatch between file and object permission models, I recommend restricting permissions by using access policies on the keys used by the fuse client. This means that regardless of how fuse clients apply or even ignore permissions bits (via “chmod”), the read/write/delete permissions are strictly enforced at the storage layer.
Angle 1: Reader
The following two FlashBlade Access Policies are required to configure the fuse client for read-only application usage: object-list and object-read.
Note that if clients try to write files without permission, it is possible to see inconsistencies. For example, if I touch a file with read-only permission and goofys, an immediate listing (‘ls’) will see a phantom file which eventually goes away. The ‘touch’ command does fail, so many but not all programs or scripts that unexpectedly write should fail.
$ touch foo
touch: failed to close ‘foo’: Permission denied
$ ls
foo linux-5.12.13
…
$ ls
linux-5.12.13
Most operations fail without the “list” permission due to expectations of being able to browse directory structures, but for example, it is still possible to read individual files with ‘cat’ without the object-list policy enabled.
Alternatively, you can mount using goofys’s flag “-o r“ for read-only access, but using keys and access policies provides stronger protections than mounting in read-only mode. Restricting permission with keys avoids users simply re-mounting without “-o r” to work around an issue.
And of course, without the object-read permission, the client can list directories and files but not access any of the file content.
$ cat pod.yaml
cat: pod.yaml: Permission denied
Angle 2: Writer
The second major way to use fuse clients for S3 access is for file-based applications to write data to an object store. For these applications, the required policies are object-list and object-write.
With write and list permissions, I can write files and read them back locally for a short period of time due to local caching. Note that it appears to require ‘list’ permissions and also enables overwrites.
Enabling Deletions
Sometimes in addition to write permissions, the client also needs the ability to delete files. Enable the “pure:policy/object-delete” to allow for “rm” commands.
See the following section on “undo” for more information about how to combine deletions with the ability to undo those deletions when necessary.
Full Control
For most flexible control of files within the mount, use the following policies:
This avoids giving users more permissions than necessary, for example the ability to create and delete buckets, etc, but they can still write, read, and delete files.
Bonus: Undo an Accidental Deletion
Object stores support object versioning, which provides functionality beyond traditional filesystems. Versioning keeps multiple copies of an object if a key is overwritten and inserts a DeleteMarker instead of erasing data when deletes are issued. An associated lifecycle policy ensures that deleted or overwritten data is eventually deleted.
First, enable versioning on the bucket if not already so. In the FlashBlade GUI’s bucket view, the “Enable versioning…” can be accessed on the upper right corner.
And then in order to undelete files that have been accidentally deleted, you can simply go find the delete marker and remove it. There is no “undelete” operation at the filesystem level, so this needs to be out-of-band through a different mechanism or script.
An example python script (gist here) to undelete an object by removing it’s DeleteMarker:
#!/usr/bin/python3
import boto3
import sysFB_DATAVIP='10.62.64.200'if len(sys.argv) != 3:
print("Usage: {} bucketname key".format(sys.argv[0]))
sys.exit(1)bucketname = sys.argv[1]
key = sys.argv[2]s3 = boto3.resource('s3', endpoint_url='http://' + FB_DATAVIP)
kwargs = {'Bucket' : bucketname, 'Prefix' : key}pageresponse = s3.meta.client.get_paginator('list_object_versions').paginate(**kwargs)for pageobject in pageresponse:
if 'DeleteMarkers' in pageobject.keys() and pageobject['DeleteMarkers'][0]['Key'] == key:
print("Undeleting s3://{}/{}".format(bucketname, key))
s3.ObjectVersion(bucketname, key, pageobject['DeleteMarkers'][0]['VersionId']).delete()
And then the object can be undeleted as simply as this:
./s3-undelete.py phrex temp/pod.yaml
Undeleting s3://phrex/temp/pod.yaml
A safe and secure undelete would restrict the usage of this script to an administrator, in order to limit the use of keys with broader delete permissions.
Finally, create a lifecycle rule to automatically clean up old object versions, i.e., if an object is no longer the most recent it can be eventually deleted so that space is reclaimed. Similarly, if an object is deleted, the original will be kept for this long allowing a user to undo that deletion within the lifecycle’s time window.
Performance Testing
While a fuse client for S3 is never the highest-performing data access path, it is important to understand the performance differences between the two clients, s3fs and goofys, as well as traditional shared filesystems like NFS. The goal of this section is to understand when fuse clients are useful and the performance differences between s3fs and goofys.
This section presents performance testing of basic scenarios to help understand when and where the S3 fuse clients are useful. In each test, I compare the fuse clients presenting an object bucket as a “filesystem” with a true NFS shared filesystem.
Test scenario:
- All tests run against a small 9-blade FlashBlade
- Client is 16 core, 96 GB DRAM, Ubuntu 20.04
- Ramdisk used as the source or sink for write and read tests respectively
- A direct S3 performance test gets 1.1GB/s writes and 1.5GB/s reads.
- I also compare with a high-performance NFS filesystem, backed by the same FlashBlade, to illustrate the fuse-client overhead.
- Tested goofys version 0.24.0, s3fs version v1.86, and rclone version 1.50.2
I use filesystem tools like “cp” “rm” and “cat” for these tests, but it is important to note that in most cases the filesystem operations will be built into existing legacy applications, e.g., fwrite() and fread(). I chose these tools because they achieve good throughput on native filesystems, are simple to understand, and are easily reproducible.
The summary of performance results is that across read/write and metadata-intensive tests, the performance ordering is goofys, s3fs, and then rclone as the slowest.
Throughput results
The first test reads and writes large files to determine basic throughput of each fuse client. I either write via “cp” or read via “cat” 24 files, each 1GB in size. Each test is repeated with files accessed serially or in parallel.
As an example, writing to the fuse filesystem serially:
for i in {1..24}; do
cp /mnt/ramdisk/file_1G /mnt/$d/temp/file_1G_$i
done
The parallel version uses ‘&’ to launch each copy in the background and then ‘wait’ blocks until all background processes complete:
for i in {1..24}; do
cp /mnt/ramdisk/file_1G /mnt/$d/temp/file_1G_$i &
done
wait
Two observations from the write results. First, goofys is significantly faster than the other fuse clients on serial writes, though still slightly slower than direct NFS. Second, parallelizing the filesystem operations results in improved write speeds in all cases, but goofys is still the fastest.
The second test uses ‘cat’ to read files through the fuse clients, using the same set of 24 1GB files. As with the writes, the reads are tested both serially and in parallel.
Performance trends are similar with goofys fastest for serial reads, but s3fs handles parallel reads slightly better. The more surprising result is that both goofys and s3fs are faster than true NFS for serial reads. This is a consequence of how the linux kernel NFS client performs readahead less aggressively than the fuse clients.
Metadata results
The next set of tests focuses on metadata-intensive workloads: small files, nested directories, listings, and recursive deletes. The test dataset is the linux-5.12.13 source code, which contains roughly 1GB of data in 4700 directories and 71k files. The average file size is 14KB.
Goofys is fastest for both the untar and the removal operations, but the gap is larger when compared to a native NFS. This indicates that these workloads suffer a larger performance penalty relative to native NFS.
The test to populate the source repo untars files directly into object storage using the fuse layer as intermediary. But this pushes at the edge of where a fuse client makes sense from a performance perspective. Directly untarring to an NFS mount is 6x faster. In this case, an alternative approach of untarring to local storage and then using s5cmd to upload directly to the object store is 5x faster (257 seconds) than goofys! Using local storage as a staging area is faster because the local storage has lower latencies for the serial untar operation and then s5cmd can upload files concurrently. Of course, this technique only works if the local storage has capacity for the temporary storage.
The last test uses the “find” command to find files with a certain extension (“.h” in this case) and exercises metadata responsiveness exclusively. As with the other tests, goofys performs best.
Comparing to AWS
Next, I focus on the fastest client, goofys, and compare performance when using either the FlashBlade as backing object store and AWS S3. I compare relative performance on the four major test scenarios previously presented: writing and reading large files, and then copying and removing a source code repository with directories and mixed file sizes. To match the VM used to test against the FlashBlade, I used a single m5.4xlarge instance with Ubuntu 20.04.
The test scenarios here consist of serial access patterns because this is the default in most workflows. Parallelization often involves modifications of source programs, in which case it is better to simply switch to native S3 accesses.
Note that due to the fuse client, none of these tests actually stress the FlashBlade or AWS throughput bounds. The achieved lower latency of S3 operations on the FlashBlade results in better performance. For simple large, i.e., 1GB, file operations, the FlashBlade’s lower latency results in 28% faster runtimes relative to AWS S3.
In contrast, when writing or removing nested directories with small-to-medium file sizes, the performance advantage increases to 3x-6x faster in favor of FlashBlade. This indicates that the metadata overheads of LIST operations and small objects are much higher with AWS S3.
Summary
Goofys, s3fs, and rclone-mount are fuse clients that enable the use of an object store with applications that expect files. These fuse clients enable the migration of workflows to object storage even when you have legacy file-based applications. Those applications expecting files can still work with objects through the fuse client layer.
Summarizing best practices for when and how to use s3 fuse clients:
- Best to use for only one part of your data workflow, either simple writing or reading of files.
- Do not rely on POSIX filesystem features like permissions, file renames, random overwrites, etc.
- Prefer goofys as a fuse client choice because of superior performance