Presto-Powered S3 Data Warehouse on Kubernetes


S3 Data Warehouse Logical Application Architecture

Component 1: Hive Metastore

The Metastore RDBMS: MariaDB

Init-schemas Task

Hive Metastore

> kubectl create configmap metastore-cfg --dry-run --from-file=metastore-site.xml --from-file=core-site.xml -o yaml | kubectl apply -f -
- name: metastore-cfg-vol
mountPath: /opt/hive-metastore/conf/metastore-site.xml
subPath: metastore-site.xml
- name: metastore-cfg-vol
mountPath: /opt/hadoop/etc/hadoop/core-site.xml
subPath: core-site.xml

- name: metastore-cfg-vol
name: metastore-cfg
> kubectl create secret generic my-s3-keys --from-literal=access-key=’XXXXXXX’ --from-literal=secret-key=’YYYYYYY’
name: my-s3-keys
key: access-key
name: my-s3-keys
key: secret-key

Component 2: Presto

Presto Cluster

  • The Dockerfile itself contains a few configuration files which are common to all workers and unlikely to be changed across different environments. For example, dev, QA, and production clusters all share the same image and configurations.
  • Similar to the Hive Metastore service, I created a script to push configuration files into a configmap, allowing me to store all artifacts under source control and potentially have different configurations in different places. I could even manually patch the configmap if necessary.
  • Kubernetes secrets for S3 access and secret keys, either injected as environment variables or combined with the script mentioned next. This approach is ideal for sensitive data, like S3 keys or passwords, because it reduces the possibility of accidental exposure.
  • I created a custom entrypoint script This script combines template files with the hostname and access keys from environment variables to create the final config files at pod startup time. I created this script because no single configuration method described previously fit the requirement and so scripting provided more flexibility.
cp /opt/presto-server/etc/ /opt/presto-server/etc/
echo “$HOSTNAME” >> /opt/presto-server/etc/
> kubectl rollout restart deployment.apps/presto-coordinator && kubectl rollout restart deployment.apps/presto-worker
> kubectl scale --replicas=5 deployment.apps/presto-worker

Managing Presto

> kubectl port-forward --address service/presto 8080
> kubectl exec -it pod/presto-cli /opt/presto-cli -- --server presto:8080 --catalog hive --schema default
> kubectl exec -it pod/presto-cli /opt/presto-cli -- --server presto:8080 --catalog hive --schema default --execute “SHOW TABLES;”

Example: Create a Table on S3

presto:default> SHOW SCHEMAS FROM tpcds;

presto:default> SHOW TABLES FROM tpcds.sf1000;
presto:default> CREATE SCHEMA hive.tpcds WITH (location = ‘s3a://joshuarobinson/warehouse/tpcds/’);presto:default> CREATE TABLE tpcds.store_sales AS SELECT * FROM tpcds.sf100.store_sales;
#!/bin/bashSCALE=1000 # Scale factorsql_exec() {
kubectl exec -it pod/presto-cli /opt/presto-cli -- --server presto:8080 --catalog hive --execute “$1”
declare TABLES=”$(sql_exec “SHOW TABLES FROM tpcds.sf1;” | sed s/\”//g)”sql_exec “CREATE SCHEMA hive.tpcds WITH (location = ‘s3a://joshuarobinson/warehouse/tpcds/’);”for tab in $TABLES; do
sql_exec “CREATE TABLE tpcds.$tab AS SELECT * FROM tpcds.sf$SCALE.$tab;”
presto:tpcds> SELECT COUNT(*) FROM store_sales;

Component 3: Redash

> kubectl exec -it pod/redash-5ddb586ff5-vsz4d -c server /app/bin/docker-entrypoint create_db
> kubectl port-forward --address service/redash 5000





Data science, software engineering, hacking

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Product Configuration in Odoo 13

CS371p Fall 2020 Week 12: Darshan Bhatta

Prepare and link bind Oracle Database Cloud, Application Container Cloud, Application Container…

Codility algorithm practice Lesson 3: Time Complexity, Task 1: Frog Jump — a Python approach

Glass Orb with Patterns

Codesmith Is Like Hogwarts

Learning from Open Source (part 1)

A look at the AWS::CloudFormation::StackSet Resource

Practical Finance for Software Engineering: Part 3 - Understanding Efficiency

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joshua Robinson

Joshua Robinson

Data science, software engineering, hacking

More from Medium

How to Analyze Prometheus Alertmanager Alerts Using S3, Athena and CloudFormation

How to Connect Elastic Sink Connector with Kafka

What to remember if you decide to ingest logs using logging agent in Google Cloud

First Anniversary Celebration of Apache DolphinScheduler’s Graduation From ASF Incubator!