Cloud-native applications must often co-exist with legacy applications. Those legacy applications are hardened and just work, so rewriting can seem hardly worth the trouble.

For legacy applications to take advantage of new technology requires bridges, and fuse clients for object storage are a bridge that allow most (but not all) applications that expect to read and write files to work in the new world of object storage.

I will focus on three different implementations of a fuse-based filesystem on top of object storage, s3fs, goofys, and rclone. …

Kafka and Elasticsearch Pipelines Made Easier with Kubernetes and Object Storage

Collecting and indexing logs from servers, applications, and devices enables crucial visibility into running systems. A log analytics pipeline allows teams to debug and troubleshoot issues, track historical trends, or investigate security incidents. The most commonly deployed pipeline combines Kafka and Elasticsearch to create a reliable, scalable, and performant system to ingest and query data. The time it takes to deploy a new log pipeline is a key factor in if a new data project will be successful. But both applications traditionally use converged infrastructure (similar to HDFS) which…

A Quick and Easy Tool with No Dependencies

Spent a few hours trying to debug why Apache Spark on FlashBlade is slower than expected only to realize you have an underlying networking issue?

Flashblade-plumbing is a tool to validate NFS and S3 read/write performance from a single client to a FlashBlade with minimal dependencies and input required. The only inputs required are the FlashBlade’s management IP and login token and after a few minutes will output the read and write throughputs for both NFS and S3. …

RESTful interfaces make interacting with modern applications easier for other applications. “Easy” means you can use a variety of languages and tools to build scripts and automation around the application and that REST API, instead of relying only on manual clicks and a GUI.

You may already be familiar with curl for downloading files from websites. But it is also a flexible and convenient REST client! This post will walk through the curl options and invocations I find most useful.

The big advantage of curl is that it’s often already available on your machine or easily installed on a wide…

Dremio is a rapidly growing scale-out analytics engine that is part of a new generation of data lake services, emphasizing the power and flexibility of disaggregated storage like FlashBlade. Two performance-oriented technologies power Dremio queries: Apache Arrow, a quick in-memory data format, and Gandiva, a high-performance toolset for querying Arrow data.

A key advantage of using Dremio is the ability to query data in-place on FlashBlade, using either NFS or S3, instead of needing to import or copy it. …

In this post, I look at what it takes to list all keys in a single bucket with 67 billion objects and build a simple list benchmark program in Golang. This parallelized program is able to list keys at a rate of 430k per second using a small FlashBlade as the S3 backend.

67 billion objects may sound like an unreasonably high number. It is always possible to build applications to create fewer objects by coalescing and aggregating data. But sometimes re-architecting an application after being deployed in production for years of organic growth is challenging. …

Comparing Text, JSON, Parquet, and Elasticsearch

Observability and diagnosability require collecting logs from a huge variety of sources, e.g. firewalls, routers, servers, and applications, into a central location for analysis. These logs help in troubleshooting, optimization, and predictive maintenance; all of which benefit from longer timeframes of log data. But the prospect of storing all log data for months or years inevitably results in enormous capacities. And to complicate matters, there are tradeoffs about how to best store and use the log data in a cost-effective manner.

I will contrast six different open-source storage formats and quantify their tradeoffs between…

Confluent recently announced the general availability of Tiered Storage in the Confluent Platform 6.0 release, a new feature for managing and simplifying storage in large Kafka deployments. As this feature has been in tech preview, I have been able to test the solution with an on-prem object store, FlashBlade.

Kafka provides a cornerstone functionality for any data pipeline: the ability to reliably pass data from one service or place to another. Kafka brings the advantages of microservice architectures to data engineering and data science projects. There are always more data sources to ingest and longer retention periods to increase pipeline…

One of the biggest challenges in making use of data is avoiding the complexity of making N copies of a dataset to be used with N different analytics systems. Dataset sprawl and application silos are quicksand for productivity. And the backups of your important tables are sitting idle so perhaps you can do more? This post describes how PrestoSql can help you leverage your backup dataset to enable dev/test and analytics too.

Prestosql is a distributed query engine capable of bringing SQL analytics to a wide variety of data stores, including an S3 object store like Pure FlashBlade. …

Monitoring infrastructure is essential for keeping production workloads healthy and debugging issues when things go wrong. Observability is essential for troubleshooting.

The goal of this post is to learn how to quickly and easily deploy a minimal configuration open-source Prometheus and Grafana monitoring infrastructure in Kubernetes.

The full yaml for the examples discussed can be found on the github repo here.

The Prometheus ecosystem continues to improve; the Prometheus operator and associated bundled project, while promising, are still in beta and improving their usability. Docker containers makes these applications particularly easy to run and configure, and Kubernetes adds additional resilience.

Joshua Robinson

Data science, software engineering, hacking

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store