Apache Spark on FlashBlade Part 1

Introduction

  • Simplify the management and maintenance of the cluster by separating compute and storage.
  • Quickly scale either tier independently by adding just the compute or storage resources needed.
  • Upgrade applications easily and independently with the ability to run and test multiple software versions on the same storage and dataset.
  • Consolidate multiple applications (not just Spark) on to the same storage infrastructure instead of separate silos of data.

Goals

  • Configuration required to run Spark Standalone with NFS and S3
  • Examples of automating common Spark cluster tasks
  • Performance testing with a simple machine learning workload

Spark without Hadoop!

Configuring Spark Access to Storage

Configuring NFS

Step 1: Create filesystem

pureuser@flashblade:~$ purefs create —-size 100T datahub

pureuser@flashblade:~$ purefs add --protocol nfs datahub
Name Size Used Created Protocols
datahub 100T 0.00 2019–01–03 03:38:53 PDT nfs

Step 2: Mount filesystem

flashblade01.foo.com:/datahub /datahub nfs rw,nfsvers=3,intr,_netdev 0 0

Step 3: Access from Spark

val input = sc.textFile(“file:/datahub/input-dir/”)
// With no URI, “file:/” is inferred.
outputRDD.saveAsSequenceFile(“/datahub/output-dir”)
ir@joshua:~$ ls -l /datahub/Spark-Output/
total 73549081
-rw-r — r — 1 ir ir 1506358621 Sep 12 13:30 part-00000
-rw-r — r — 1 ir ir 1505915945 Sep 12 13:30 part-00001
-rw-r — r — 1 ir ir 1506723117 Sep 12 13:30 part-00002
-rw-r — r — 1 ir ir 1506507232 Sep 12 13:30 part-00003
-rw-r — r — 1 ir ir 1505983628 Sep 12 13:30 part-00004
.
.
.
-rw-r — r — 1 ir ir 1506270906 Sep 12 13:30 part-00048
-rw-r — r — 1 ir ir 1506032424 Sep 12 13:30 part-00049
-rw-r — r — 1 ir ir 0 Sep 12 13:30 _SUCCESS

Configuring S3

Step 1: Add AWS S3A Jars to Classpath

> wget -P /tmp \ https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xf /tmp/hadoop-2.7.3.tar.gz \ hadoop-2.7.3/share/hadoop/tools/lib/hadoop-aws-2.7.3.jar hadoop-2.7.3/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar
> mv hadoop-2.7.3/share/hadoop/tools/lib/*.jar spark-$VERSION/jars/

Step 2: Configure FlashBlade Object Store

pureuser@irp210-c01-ch1-fm2> pureobjaccount create datascience
Name Created
ambari-user 2018–06–29 06:55:43 PDT
pureuser@irp210-c01-ch1-fm2> pureobjuser create datascience/spark-user
Name Access Key ID Created
datascience/spark-user — 2018–06–29 06:56:44 PDT
pureuser@irp210-c01-ch1-fm2> pureobjuser access-key create --user datascience/spark-user
Access Key ID Enabled Secret Access Key User
PSFBIAZFDDAAGEFK True 254300033/a3e2/DF81E2……HLNAGMLEHO datascience/spark-user
pureuser@irp210-c01-ch1-fm2> purebucket create --account ambari-user datahub
Name Account Used Created Time Remaining
datahub default 0.00 2019–01–09 05:42:06 PST -

Step 3: Configure S3A

Option 1: Config File

spark.hadoop.fs.s3a.access.key=YYYYYYYY
spark.hadoop.fs.s3a.secret.key=XXXXXXXX
spark.hadoop.fs.s3a.endpoint=DATA_VIP
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
spark.hadoop.fs.s3a.fast.upload=true

Option 2: Environment Variables

export AWS_ACCESS_KEY_ID=XXXXXXXXX
export AWS_SECRET_ACCESS_KEY=YYYYYYYYYY

Option 3: Command-line arguments

./bin/spark-submit XXXYYYZZZ \
-Dfs.s3a.endpoint=10.61.59.208 \
-Dfs.s3a.connection.ssl.enabled=false \
-Dmapreduce.fileoutputcommitter.algorithm.version=2 \
-Dfs.s3a.fast.upload=true \
-Dfs.s3a.access.key=XXXXXXXXX \
-Dfs.s3a.secret.key=YYYYYYYYYYYYYYY \
s3a://bucketname/input \
s3a://bucketname/output

Step 4: Access from Spark

val input = sc.textFile(“s3a://bucketname/input-dir/”)
outputRDD.saveAsSequenceFile(“s3a://bucketname/output-dir”)

Node Local Storage (Optional):

Case 1: Intermediate Results

scala> interm_result.persist(StorageLevel.DISK_ONLY)
res14: interm_result.type = MapPartitionsRDD[2] at map at <console>:26

Case 2: Centralized Logging

-e SPARK_LOG_DIR=/datahub/sparklogs \
-e SPARK_WORKER_DIR=/datahub/sparklogs

Automation Tip: Using the Pure Docker Volume Plugin

docker volume create — driver=pure -o size=1TiB \
-o volume_label_selector=”purestorage.com/backend=file” \
sparkscratchvol
docker run -d — net=host \
-v sparkscratchvol:/local \
-e SPARK_LOCAL_DIRS=/local \
$SPARK_IMAGE /opt/spark/sbin/start-slave.sh spark://$MASTER:7077

--

--

--

Data science, software engineering, hacking

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joshua Robinson

Joshua Robinson

Data science, software engineering, hacking

More from Medium

Window of Tolerance

Bruneau & Cos April 9th Historic Arms & Militaria Auction will Span Multiple Conflicts and…

Conversations that Sing…