Simplify Kafka at Scale with Confluent Tiered Storage

Kafka Tiered Storage and FlashBlade

Kafka Architecture with Tiered Storage and FlashBlade

How Tiered Storage Works

Overview of How Kafka Stores Data

Tiered Storage In-Depth

Illustration of how Kafka log segments are offloaded to Tiered Storage

Understanding Tiered Storage IO Workloads

Storage

Network

networking_multiplier = replication_factor + 1 + num_consumer_groups

Objects Created

$ s5cmd --endpoint-url http://$FB_IP ls s3://kafka/*
2020/10/02 05:58:54 8 0/-0GFCOcFSDGD-p37slsw/84/00000000000090550151_0_v0.epoch-state
2020/10/02 05:58:54 179288 0/-0GFCOcFSDGD-p37slsw/84/00000000000090550151_0_v0.offset-index
2020/10/02 05:58:54 10 0/-0GFCOcFSDGD-p37slsw/84/00000000000090550151_0_v0.producer-state
2020/10/02 05:58:54 104857556 0/-0GFCOcFSDGD-p37slsw/84/00000000000090550151_0_v0.segment
2020/10/02 05:58:54 219744 0/-0GFCOcFSDGD-p37slsw/84/00000000000090550151_0_v0.timestamp-index

Deploying Tiered Storage with FlashBlade

Configuring Tiered Storage With FlashBlade

CONFLUENT_TIER_FEATURE=true
CONFLUENT_TIER_ENABLE=true
CONFLUENT_TIER_BACKEND=S3
CONFLUENT_TIER_S3_BUCKET=<BUCKET_NAME>
CONFLUENT_TIER_S3_REGION=<REGION>
CONFLUENT_TIER_S3_AWS_ENDPOINT_OVERRIDE=${ENDPOINT}
CONFLUENT_TIER_S3_AWS_ENDPOINT_OVERRIDE=http://10.62.64.200
AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
kubectl create secret generic my-s3-keys --from-literal=AWS_ACCESS_KEY_ID="$ACCESS" --from-literal=AWS_SECRET_ACCESS_KEY="$SECRET"
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: my-s3-keys
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: my-s3-keys
key: secret-key
ansible-galaxy collection install purestorage.flashblade
collections:
- purestorage.flashblade
tasks:
- name: Create Bucket
purefb_bucket:
name: “{{ BUCKET }}”
account: “{{ ACCOUNT }}”
fb_url: “{{ FB_MGMTVIP }}”
api_token: “{{ FB_TOKEN }}”
state: present
KAFKA_LOG_SEGMENT_BYTES=104857600
CONFLUENT_TIER_ARCHIVER_NUM_THREADS=8
CONFLUENT_TIER_FETCHER_NUM_THREADS=16
CONFLUENT_TIER_LOCAL_HOTSET_MS=0

Experimental Results

Experimental Testbed Configuration

Historical Query Test

Rebalance Test

max_rebalance_time = (segment_size * num_partitions) / rebalance_throttle

Data Pipeline: Kafka and Elasticsearch

filebeat.inputs:
- type: kafka
hosts:
- confluentkafka-0:9092
- confluentkafka-1:9092
- confluentkafka-2:9092
topics: ["flog"]
group_id: "flogbeats"
setup.template.settings:
index.number_of_shards: 36
index.number_of_replicas: 0
index.refresh_interval: 30s
output.elasticsearch:
hosts: '${ELASTICSEARCH_HOST}:9200'
worker: 2
bulk_max_size: 4096

Summary

--

--

--

Data science, software engineering, hacking

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Creating discussion board in five weeks— week one

The ‘progress’ in PWA is achievable if you have the right web application developer

SLAE 0x5: Part I — Analyzing MSFvenom ‘linux/x86/exec’ shellcode

Hadoop Installations and Distributions

EASILY UNDERSTAND THE HADOOP TECHNOLOGY. What is Big Data Hadoop Technology? DEMYSTIFYING THE HADOOP TECHNOLOGY

CTF: Pickle Rick

Codam Coding College — Piscine Day 5 — Exam 00

SFTP on AWS — Easy, but not Cheap

View across a ruin shrouded in mist

Helidon messaging with Oracle Streaming Service

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joshua Robinson

Joshua Robinson

Data science, software engineering, hacking

More from Medium

Case Study: How Rockset Made Me a Day Three Hero at Sounding Board

Standardising Workflows & Crontabs with Airflow

Scaling Airflow Workers in EKS