Presto-Powered S3 Data Warehouse on Kubernetes

Objectives

S3 Data Warehouse Logical Application Architecture

Component 1: Hive Metastore

The Metastore RDBMS: MariaDB

Init-schemas Task

Hive Metastore

> kubectl create configmap metastore-cfg --dry-run --from-file=metastore-site.xml --from-file=core-site.xml -o yaml | kubectl apply -f -
  volumeMounts:
- name: metastore-cfg-vol
mountPath: /opt/hive-metastore/conf/metastore-site.xml
subPath: metastore-site.xml
- name: metastore-cfg-vol
mountPath: /opt/hadoop/etc/hadoop/core-site.xml
subPath: core-site.xml

volumes:
- name: metastore-cfg-vol
configMap:
name: metastore-cfg
> kubectl create secret generic my-s3-keys --from-literal=access-key=’XXXXXXX’ --from-literal=secret-key=’YYYYYYY’
  env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: my-s3-keys
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: my-s3-keys
key: secret-key

Component 2: Presto

Presto Cluster

  • The Dockerfile itself contains a few configuration files which are common to all workers and unlikely to be changed across different environments. For example, dev, QA, and production clusters all share the same image and configurations.
  • Similar to the Hive Metastore service, I created a script to push configuration files into a configmap, allowing me to store all artifacts under source control and potentially have different configurations in different places. I could even manually patch the configmap if necessary.
  • Kubernetes secrets for S3 access and secret keys, either injected as environment variables or combined with the script mentioned next. This approach is ideal for sensitive data, like S3 keys or passwords, because it reduces the possibility of accidental exposure.
  • I created a custom entrypoint script autoconfig_and_launch.sh. This script combines template files with the hostname and access keys from environment variables to create the final config files at pod startup time. I created this script because no single configuration method described previously fit the requirement and so scripting provided more flexibility.
cp /opt/presto-server/etc/node.properties.template /opt/presto-server/etc/node.properties
echo “node.id=$HOSTNAME” >> /opt/presto-server/etc/node.properties
> kubectl rollout restart deployment.apps/presto-coordinator && kubectl rollout restart deployment.apps/presto-worker
> kubectl scale --replicas=5 deployment.apps/presto-worker

Managing Presto

> kubectl port-forward --address 0.0.0.0 service/presto 8080
> kubectl exec -it pod/presto-cli /opt/presto-cli -- --server presto:8080 --catalog hive --schema default
> kubectl exec -it pod/presto-cli /opt/presto-cli -- --server presto:8080 --catalog hive --schema default --execute “SHOW TABLES;”

Example: Create a Table on S3

presto:default> SHOW SCHEMAS FROM tpcds;

presto:default> SHOW TABLES FROM tpcds.sf1000;
presto:default> CREATE SCHEMA hive.tpcds WITH (location = ‘s3a://joshuarobinson/warehouse/tpcds/’);presto:default> CREATE TABLE tpcds.store_sales AS SELECT * FROM tpcds.sf100.store_sales;
#!/bin/bashSCALE=1000 # Scale factorsql_exec() {
kubectl exec -it pod/presto-cli /opt/presto-cli -- --server presto:8080 --catalog hive --execute “$1”
}
declare TABLES=”$(sql_exec “SHOW TABLES FROM tpcds.sf1;” | sed s/\”//g)”sql_exec “CREATE SCHEMA hive.tpcds WITH (location = ‘s3a://joshuarobinson/warehouse/tpcds/’);”for tab in $TABLES; do
sql_exec “CREATE TABLE tpcds.$tab AS SELECT * FROM tpcds.sf$SCALE.$tab;”
done
presto:tpcds> SELECT COUNT(*) FROM store_sales;

Component 3: Redash

> kubectl exec -it pod/redash-5ddb586ff5-vsz4d -c server /app/bin/docker-entrypoint create_db
> kubectl port-forward --address 0.0.0.0 service/redash 5000

Summary

--

--

--

Data science, software engineering, hacking

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Product Configuration in Odoo 13

CS371p Fall 2020 Week 12: Darshan Bhatta

Prepare and link bind Oracle Database Cloud, Application Container Cloud, Application Container…

Codility algorithm practice Lesson 3: Time Complexity, Task 1: Frog Jump — a Python approach

Glass Orb with Patterns

Codesmith Is Like Hogwarts

Learning from Open Source (part 1)

A look at the AWS::CloudFormation::StackSet Resource

Practical Finance for Software Engineering: Part 3 - Understanding Efficiency

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joshua Robinson

Joshua Robinson

Data science, software engineering, hacking

More from Medium

How to Analyze Prometheus Alertmanager Alerts Using S3, Athena and CloudFormation

How to Connect Elastic Sink Connector with Kafka

What to remember if you decide to ingest logs using logging agent in Google Cloud

First Anniversary Celebration of Apache DolphinScheduler’s Graduation From ASF Incubator!