Presto on FlashBlade S3

Configuration

Component 1: RDBMS for Metastore

docker volume create --driver=pure -o size=1TiB \
-o volume_label_selector=”purestorage.com/backend=block” \
metastore-vol
docker run --name metastoremaria -d --rm \
-v metastore-vol:/var/lib/mysql \
-e MYSQL_ROOT_PASSWORD=mypass \
-p 13306:3306 \
mariadb/server:latest
docker volume create metastore-vol

Component 2: Hive Metastore Service

  • javax.jdo.option.ConnectionDriverName
  • javax.jdo.option.ConnectionURL
  • javax.jdo.option.ConnectionUserName
  • javax.jdo.option.ConnectionPassword
  • fs.s3a.access.key
  • fs.s3a.secret.key
  • fs.s3a.connection.ssl.enabled = false (recommended)
  • fs.s3a.endpoint = {FlashBlade data VIP}
docker run -it --rm --name initschema hivemetastore \
/opt/hive-metastore/bin/schematool --verbose -initSchema \
-dbType mysql -userName root -passWord mypass \
-url jdbc:mysql://$HOSTNAME:13306/metastore_db?createDatabaseIfNotExist=true
docker run -d --rm --name=hivemeta -p 9083:9083 hivemetastore \
/opt/hive-metastore/bin/start-metastore

Component 3: Presto Servers

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://$MASTERHOST:8080
coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://$MASTERHOST:8080
node.environment=test
node.id=nodeXX
node.data-dir=/var/presto/data
for i in {01..20}; do
NODEIP=$(dig +short myhost$i.dev.mydomain.com)
cp node.properties.template node.properties.$NODEIP
echo “node.id=node$i” >> node.properties.$NODEIP
done
connector.name=hive-hadoop2
hive.metastore.uri=thrift://$METASTORE:9083
hive.s3.aws-access-key=XXXXXXXXX
hive.s3.aws-secret-key=YYYYYYYYYYYYYY
hive.s3.endpoint=$DATAVIP
hive.s3.path-style-access=true
hive.s3.ssl.enabled=false
docker run -d --rm --net=host \
--name=presto \
-v ${PWD}/node.properties.$(hostname -i):/opt/presto-server/etc/node.properties \
-v ${PWD}/config.properties.coordinator:/opt/presto-server/etc/config.properties \
$IMGNAME \
/opt/presto-server/bin/launcher run
presto-cli --server localhost:8080 --catalog hive --schema default

Experimental Setup

> CREATE TABLE pagecounts (code varchar, title varchar, count bigint, pgsize bigint) WITH (format = ‘textfile’, external_location = ‘s3a://BUCKETNAME/wiki_pagecounts/2016–08/’);
for i in pagecounts*; do sed -i ‘s/ /\x01/g’ $i; done
presto:default> SELECT * FROM system.runtime.nodes;
node_id | http_uri | node_version | coordinator | state
— — — — -+ — — — — — — — — — — — — — -+ — — — — — — — + — — — — — — -+ — — — —
node23 | http://10.62.185.120:8080 | 0.219 | false | active
node24 | http://10.62.185.121:8080 | 0.219 | false | active
node1 | http://10.62.205.205:8080 | 0.219 | true | active
node25 | http://10.62.185.122:8080 | 0.219 | false | active
(4 rows)

Testbed Configurations

+---------------------------------------------------------------+
| | Environment | Compute Node | Storage |
+---------------------------------------------------------------+
| on-prem/FB | on-premises | 32 core, 60GiB mem | FlashBlade S3|
| c3.8xlarge | public cloud | 32 core, 60GiB mem | AWS S3 |
| m3.2xlarge | public cloud | 16 core, 30GiB mem | AWS S3 |
+---------------------------------------------------------------+
presto:default> SELECT COUNT(*) from pagecounts;

Performance Results

In-Memory Tables

connector.name=memory
memory.max-data-per-node=8GB
> CREATE TABLE memory.default.pagecounts AS SELECT * FROM pagecounts;
> SELECT AVG(pgsize) FROM memory.default.pagecounts;

Summary

--

--

--

Data science, software engineering, hacking

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Introduction to Tilemap in Unity

SVG on the Web

Making Agile At Scale A Reality

Public Deploy of Test Network Nodes has been Through the First Phase; 21 Quality Nodes are Selected

Why Cloud Native is a must for Agile Development

Kotlin RxJava Retrofit MVVM Unit Test Cases

ways to lookup table in Spark scala

8 iOS App Development Principles to Know

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joshua Robinson

Joshua Robinson

Data science, software engineering, hacking

More from Medium

The journey of million(s) devices on AWS: Kafka, Streaming, and Scaling

Building Event Driven Applications with Apache Flink, Apache Kafka and Amazon EMR — Part 2

Three scenarios and five optimizations of Apache DolphinScheduler in XWBank for real-time, offline…

Discover, Catalog and Share your Streams on Amazon MSK as Data Products