In this post, I look at what it takes to list all keys in a single bucket with 67 billion objects and build a simple list benchmark program in Golang. This parallelized program is able to list keys at a rate of 430k per second using a small FlashBlade as the S3 backend.
67 billion objects may sound like an unreasonably high number. It is always possible to build applications to create fewer objects by coalescing and aggregating data. But sometimes re-architecting an application after being deployed in production for years of organic growth is challenging. …
Comparing Text, JSON, Parquet, and Elasticsearch
Observability and diagnosability require collecting logs from a huge variety of sources, e.g. firewalls, routers, servers, and applications, into a central location for analysis. These logs help in troubleshooting, optimization, and predictive maintenance; all of which benefit from longer timeframes of log data. But the prospect of storing all log data for months or years inevitably results in enormous capacities. And to complicate matters, there are tradeoffs about how to best store and use the log data in a cost-effective manner.
I will contrast six different open-source storage formats and quantify their tradeoffs between 1) storage used, 2) access speed, and 3) transformation cost, as well as qualitative comparisons of simplicity and ease of use. Specifically, I will test with raw text and JSON-encoded logs, both with and without gzip compression, Parquet tables, and Elasticsearch indices. This is not an exhaustive study of all options, but rather a high-level survey of the tradeoffs. …
Confluent recently announced the general availability of Tiered Storage in the Confluent Platform 6.0 release, a new feature for managing and simplifying storage in large Kafka deployments. As this feature has been in tech preview, I have been able to test the solution with an on-prem object store, FlashBlade.
Kafka provides a cornerstone functionality for any data pipeline: the ability to reliably pass data from one service or place to another. Kafka brings the advantages of microservice architectures to data engineering and data science projects. There are always more data sources to ingest and longer retention periods to increase pipeline reliability in the event of unplanned outages. …
One of the biggest challenges in making use of data is avoiding the complexity of making N copies of a dataset to be used with N different analytics systems. Dataset sprawl and application silos are quicksand for productivity. And the backups of your important tables are sitting idle so perhaps you can do more? This post describes how PrestoSql can help you leverage your backup dataset to enable dev/test and analytics too.
Prestosql is a distributed query engine capable of bringing SQL analytics to a wide variety of data stores, including an S3 object store like Pure FlashBlade. …
Monitoring infrastructure is essential for keeping production workloads healthy and debugging issues when things go wrong. Observability is essential for troubleshooting.
The goal of this post is to learn how to quickly and easily deploy a minimal configuration open-source Prometheus and Grafana monitoring infrastructure in Kubernetes.
The full yaml for the examples discussed can be found on the github repo here.
The Prometheus ecosystem continues to improve; the Prometheus operator and associated bundled project, while promising, are still in beta and improving their usability. Docker containers makes these applications particularly easy to run and configure, and Kubernetes adds additional resilience.
The target audience for this post has a basic understanding of Kubernetes and is new to Prometheus/Grafana. …
Many scale-out data tools, like noSQL databases, expand cluster capacity by either adding new nodes or new drives within each node. These applications were built for direct-attached storage, where adding storage to a node was labor intensive and space-limited. Modern architectures disaggregate compute and storage, making it easy to scale out and increase available storage when needed.
Modernizing these scale-out analytics applications to a Kubernetes environment (example: ECK operator) requires pre-allocated storage for each pod. But in Kubernetes, pods and containers are usually ephemeral and not tied to specific physical node. Requiring local storage for a pod restricts Kubernetes’ ability to schedule pods efficiently. As a result, a disaggregated storage system enables applications to better take advantage of scalability and self-healing functionality of Kubernetes. …
Benchmarking helps to build an understanding of your underlying infrastructure and validate correctly configured environments. Though never able to reflect real workloads and user experiences, a well done set of storage benchmarks is still useful for “burn-in” testing and to stress-test the system with additional load.
But if your team is all-in on Kubernetes, do you need to start from scratch and create new benchmarking tools? Fortunately, it is straightforward to use existing benchmark tools, like fio (flexible i/o tester), along with native Kubernetes PersistentVolumes and Container Storage Interface (CSI) provisioners to test your storage.
I will describe two different approaches for running parallel fio tests in Kubernetes and how they provision storage differently: 1) a simple Deployment and PersistentVolumeClaim for RWX volumes and 2) a Statefulset for RWO/RWX volumes. The fio config itself is kept simple in order to focus on Kubernetes storage concepts and I use dynamic volume provisioning to avoid the unnecessary configuration steps to manually create test volumes. …
Running stateful applications on Kubernetes is becoming easier (still not easy!) and more mainstream. The use of Container Storage Interface (CSI) plugins like the Pure Service Orchestrator (PSO) have made automating storage tasks more predictable and well-supported. But there are crucial day-2 administrative tasks for storage that are still very difficult, including visibility into how Kubernetes applications use storage.
PSO-analytics is a simple Kubernetes-native utility for providing insight into how different applications are consuming storage through PSO. The accompanying github repository contains all source code and configs.
Stateful applications in Kubernetes manage storage through PersistentVolumeClaims and use a CSI provisioner, but there is no visibility into storage layer metrics. The only information visible through the Kubernetes API is the capacity allocation per PersistentVolume, which leaves almost all useful storage administration questions…
Object storage usage continues to broaden and move into applications and analytics that demand high performance. These use-cases also necessitate utilities to drive performance, both for working with object storage and benchmarking performance limits.
S5cmd, a versatile and high-performance utility for working with object storage, has reached a v1.0 milestone. This blog post is a follow up to my previous post on s5cmd and dives into more advanced usage and benchmarking with s5cmd. For basic usage, please refer to the github README or my previous post.
S5cmd continues to be a high performance command-line tool for interacting with an object store and automating tasks, so it pairs well with an all-flash object store like FlashBlade. …
This post presents a modern data warehouse implemented with Presto and FlashBlade S3; using Presto to ingest data and then transform it to a queryable data warehouse.
A common first step in a data-driven project makes available large data streams for reporting and alerting with a SQL data warehouse. I will illustrate this step through my data pipeline and modern data warehouse using Presto and S3 in Kubernetes, building on my Presto infrastructure(part 1 basics, part 2 on Kubernetes) with an end-to-end use-case.
A basic data pipeline will 1) ingest new data, 2) perform simple transformations, and 3) load into a data warehouse for querying and reporting. The only required ingredients for my modern data pipeline are a high performance object store, like FlashBlade, and a versatile SQL engine, like Presto. There are many variations not considered here that could also leverage the versatility of Presto and FlashBlade S3. …