A Quick and Easy Tool with No Dependencies

Spent a few hours trying to debug why Apache Spark on FlashBlade is slower than expected only to realize you have an underlying networking issue?

Flashblade-plumbing is a tool to validate NFS and S3 read/write performance from a single client to a FlashBlade with minimal dependencies and input required. The only inputs required are the FlashBlade’s management IP and login token and after a few minutes will output the read and write throughputs for both NFS and S3. …

RESTful interfaces make interacting with modern applications easier for other applications. “Easy” means you can use a variety of languages and tools to build scripts and automation around the application and that REST API, instead of relying only on manual clicks and a GUI.

You may already be familiar with curl for downloading files from websites. But it is also a flexible and convenient REST client! This post will walk through the curl options and invocations I find most useful.

The big advantage of curl is that it’s often already available on your machine or easily installed on a wide…

Dremio is a rapidly growing scale-out analytics engine that is part of a new generation of data lake services, emphasizing the power and flexibility of disaggregated storage like FlashBlade. Two performance-oriented technologies power Dremio queries: Apache Arrow, a quick in-memory data format, and Gandiva, a high-performance toolset for querying Arrow data.

A key advantage of using Dremio is the ability to query data in-place on FlashBlade, using either NFS or S3, instead of needing to import or copy it. …

In this post, I look at what it takes to list all keys in a single bucket with 67 billion objects and build a simple list benchmark program in Golang. This parallelized program is able to list keys at a rate of 430k per second using a small FlashBlade as the S3 backend.

67 billion objects may sound like an unreasonably high number. It is always possible to build applications to create fewer objects by coalescing and aggregating data. But sometimes re-architecting an application after being deployed in production for years of organic growth is challenging. …

Comparing Text, JSON, Parquet, and Elasticsearch

Observability and diagnosability require collecting logs from a huge variety of sources, e.g. firewalls, routers, servers, and applications, into a central location for analysis. These logs help in troubleshooting, optimization, and predictive maintenance; all of which benefit from longer timeframes of log data. But the prospect of storing all log data for months or years inevitably results in enormous capacities. And to complicate matters, there are tradeoffs about how to best store and use the log data in a cost-effective manner.

I will contrast six different open-source storage formats and quantify their tradeoffs between…

Confluent recently announced the general availability of Tiered Storage in the Confluent Platform 6.0 release, a new feature for managing and simplifying storage in large Kafka deployments. As this feature has been in tech preview, I have been able to test the solution with an on-prem object store, FlashBlade.

Kafka provides a cornerstone functionality for any data pipeline: the ability to reliably pass data from one service or place to another. Kafka brings the advantages of microservice architectures to data engineering and data science projects. There are always more data sources to ingest and longer retention periods to increase pipeline…

One of the biggest challenges in making use of data is avoiding the complexity of making N copies of a dataset to be used with N different analytics systems. Dataset sprawl and application silos are quicksand for productivity. And the backups of your important tables are sitting idle so perhaps you can do more? This post describes how PrestoSql can help you leverage your backup dataset to enable dev/test and analytics too.

Prestosql is a distributed query engine capable of bringing SQL analytics to a wide variety of data stores, including an S3 object store like Pure FlashBlade. …

Monitoring infrastructure is essential for keeping production workloads healthy and debugging issues when things go wrong. Observability is essential for troubleshooting.

The goal of this post is to learn how to quickly and easily deploy a minimal configuration open-source Prometheus and Grafana monitoring infrastructure in Kubernetes.

The full yaml for the examples discussed can be found on the github repo here.

The Prometheus ecosystem continues to improve; the Prometheus operator and associated bundled project, while promising, are still in beta and improving their usability. Docker containers makes these applications particularly easy to run and configure, and Kubernetes adds additional resilience.

Many scale-out data tools, like noSQL databases, expand cluster capacity by either adding new nodes or new drives within each node. These applications were built for direct-attached storage, where adding storage to a node was labor intensive and space-limited. Modern architectures disaggregate compute and storage, making it easy to scale out and increase available storage when needed.

Modernizing these scale-out analytics applications to a Kubernetes environment (example: ECK operator) requires pre-allocated storage for each pod. But in Kubernetes, pods and containers are usually ephemeral and not tied to specific physical node. Requiring local storage for a pod restricts Kubernetes’ ability…

Benchmarking helps to build an understanding of your underlying infrastructure and validate correctly configured environments. Though never able to reflect real workloads and user experiences, a well done set of storage benchmarks is still useful for “burn-in” testing and to stress-test the system with additional load.

But if your team is all-in on Kubernetes, do you need to start from scratch and create new benchmarking tools? Fortunately, it is straightforward to use existing benchmark tools, like fio (flexible i/o tester), along with native Kubernetes PersistentVolumes and Container Storage Interface (CSI) provisioners to test your storage.

I will describe two different…

Joshua Robinson

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store