Listing 67 Billion Objects in 1 Bucket

Building this Dataset

Linear Scaling of Lists at Small Scales

s5cmd --endpoint-url http://10.62.116.100 ls s3://mybucket/* | wc -l

Parallelized Client Code

Python

s3 = boto3.client('s3', use_ssl=False,
endpoint_url='http://' + FB_DATAVIP)
kwargs = {'Bucket' : bucketname, 'Prefix': prefix}
paginator = s3.get_paginator("list_objects_v2")
count = 0
for page in paginator.paginate(**kwargs):
… work…

Golang

s3Config := &aws.Config{
Endpoint: aws.String("http://10.62.116.100"),
Region: aws.String("us-east-1"),
DisableSSL: aws.Bool(true),
S3ForcePathStyle: aws.Bool(true),
}
sess := session.Must(session.NewSession(s3Config))
svc := s3.New(sess)
count := 0
err := svc.ListObjectsPages(&s3.ListObjectsInput{
Bucket: bucketname,
Prefix: &pfix,
}, func(p *s3.ListObjectsOutput, _ bool) (shouldContinue bool) {
count += len(p.Contents)
return true
})

Comparing Implementations, Small Scale

Listing 67 Billion Objects

$ time ./s3-rapid-list.py mybucket
66880635288
real 10148m6.915s
user 299745m42.624s
sys 6062m19.940s
$ time ./run-go.sh mybucket
Count 66880635288
real 2593m2.901s
user 0m13.712s
sys 0m0.940s

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store