Open in app

Sign in

Write

Sign in

Joshua Robinson
Joshua Robinson

351 Followers

Home

About

Oct 3, 2022

Table Formats Make the Data Lake Better

The “Data Lakehouse” architecture combines the benefits of data warehouses with the flexibility and scalability of cloud data lakes. But what is a “Lakehouse” exactly, other than a chameleon-like marketing term for vendors’ offerings (1, 2, 3, 4, 5, 6, 7)? To better understand why there is excitement behind Lakehouses…

Data Lakehouse

6 min read

Table Formats Make the Data Lake Better
Table Formats Make the Data Lake Better
Data Lakehouse

6 min read


Mar 31, 2022

Improving Python S3 Client Performance with Rust

Replacing Boto3 for Fun and Profit Python is the de-facto language for data science because of its ease of use and performance. But performance comes only because libraries like NumPy offload computation-heavy functions, like matrix multiplication, to optimized C code. Data science tooling and workflows continue to improve, datasets get…

S3

7 min read

Improving Python S3 Client Performance with Rust
Improving Python S3 Client Performance with Rust
S3

7 min read


Feb 9, 2022

Concurrent Programming Case Study: S3 Metadata Requests

Comparing Python, Go, and Rust Recently a FlashBlade customer had a challenge with listing all custom metadata on objects in a large bucket. A standard S3 LIST request does not return custom metadata, therefore the task requires also issuing a HEAD request for each object. With millions of objects in…

Flashblade

12 min read

Concurrent Programming Case Study: S3 Metadata Requests
Concurrent Programming Case Study: S3 Metadata Requests
Flashblade

12 min read


Jan 26, 2022

Faster Data Loading for Pandas on S3

20x Improvement Loading CSV from FlashBlade S3 Pandas is a powerful tool for data exploration and analysis, leveraging Python’s ease of use with the optimized execution of numpy’s arithmetic libraries. But there are three common shortcomings of Pandas: 1) slow IO from object storage, 2) single-threaded execution, and 3) requirement…

Pandas

11 min read

Faster Data Loading for Pandas on S3
Faster Data Loading for Pandas on S3
Pandas

11 min read


Nov 15, 2021

Thawing the Elasticsearch Frozen Tier

Simplify Elastic Operations without Compromising Data Searchability using FlashBlade Scaling log analytics to petabyte scale is hard. Regardless of tool, deploying applications larger than their original design inevitably brings scaling complexity too. At the infrastructure level, the traditional direct-attached storage model (e.g., Hadoop) means that growing capacity requires higher node counts and therefore growing complexity. Each new node brings…

Elastic

11 min read

Thawing the Elasticsearch Frozen Tier
Thawing the Elasticsearch Frozen Tier
Elastic

11 min read


Sep 14, 2021

Object Storage via Fuse Filesystems

Cloud-native applications must often co-exist with legacy applications. Those legacy applications are hardened and just work, so rewriting can seem hardly worth the trouble. For legacy applications to take advantage of new technology requires bridges, and fuse clients for object storage are a bridge that allow most (but not all)…

S3

12 min read

Object Storage via Fuse Filesystems
Object Storage via Fuse Filesystems
S3

12 min read


May 25, 2021

Log Analytics Pipelines as-a-Service

Kafka and Elasticsearch Pipelines Made Easier with Kubernetes and Object Storage Collecting and indexing logs from servers, applications, and devices enables crucial visibility into running systems. A log analytics pipeline allows teams to debug and troubleshoot issues, track historical trends, or investigate security incidents. The most commonly deployed pipeline combines…

Elasticsearch

11 min read

Log Analytics Pipelines as-a-Service
Log Analytics Pipelines as-a-Service
Elasticsearch

11 min read


Mar 9, 2021

FlashBlade Network Plumbing Validation

A Quick and Easy Tool with No Dependencies Spent a few hours trying to debug why Apache Spark on FlashBlade is slower than expected only to realize you have an underlying networking issue? Flashblade-plumbing is a tool to validate NFS and S3 read/write performance from a single client to a…

Flashblade

6 min read

FlashBlade Network Plumbing Validation
FlashBlade Network Plumbing Validation
Flashblade

6 min read


Feb 24, 2021

Tips for Using Curl as a REST Client

RESTful interfaces make interacting with modern applications easier for other applications. “Easy” means you can use a variety of languages and tools to build scripts and automation around the application and that REST API, instead of relying only on manual clicks and a GUI. You may already be familiar with…

Curl

5 min read

Curl

5 min read


Jan 26, 2021

Modernizing SQL Analytics: Dremio and FlashBlade

Dremio is a rapidly growing scale-out analytics engine that is part of a new generation of data lake services, emphasizing the power and flexibility of disaggregated storage like FlashBlade. …

Flashblade

10 min read

Modernizing SQL Analytics: Dremio and FlashBlade
Modernizing SQL Analytics: Dremio and FlashBlade
Flashblade

10 min read

Joshua Robinson

Joshua Robinson

351 Followers

Data science, software engineering, hacking

Following
  • Yifeng Jiang

    Yifeng Jiang

  • jboothomas

    jboothomas

  • Danny Higgins

    Danny Higgins

  • Emily Potyraj

    Emily Potyraj

  • Miroslav Klivansky

    Miroslav Klivansky

See all (21)

Help

Status

About

Careers

Blog

Privacy

Terms

Text to speech

Teams