Apache Spark on FlashBlade Part 1


  • Simplify the management and maintenance of the cluster by separating compute and storage.
  • Quickly scale either tier independently by adding just the compute or storage resources needed.
  • Upgrade applications easily and independently with the ability to run and test multiple software versions on the same storage and dataset.
  • Consolidate multiple applications (not just Spark) on to the same storage infrastructure instead of separate silos of data.


  • Configuration required to run Spark Standalone with NFS and S3
  • Examples of automating common Spark cluster tasks
  • Performance testing with a simple machine learning workload

Spark without Hadoop!

Configuring Spark Access to Storage

Configuring NFS

Step 1: Create filesystem

pureuser@flashblade:~$ purefs create —-size 100T datahub

pureuser@flashblade:~$ purefs add --protocol nfs datahub
Name Size Used Created Protocols
datahub 100T 0.00 2019–01–03 03:38:53 PDT nfs

Step 2: Mount filesystem

flashblade01.foo.com:/datahub /datahub nfs rw,nfsvers=3,intr,_netdev 0 0

Step 3: Access from Spark

val input = sc.textFile(“file:/datahub/input-dir/”)
// With no URI, “file:/” is inferred.
ir@joshua:~$ ls -l /datahub/Spark-Output/
total 73549081
-rw-r — r — 1 ir ir 1506358621 Sep 12 13:30 part-00000
-rw-r — r — 1 ir ir 1505915945 Sep 12 13:30 part-00001
-rw-r — r — 1 ir ir 1506723117 Sep 12 13:30 part-00002
-rw-r — r — 1 ir ir 1506507232 Sep 12 13:30 part-00003
-rw-r — r — 1 ir ir 1505983628 Sep 12 13:30 part-00004
-rw-r — r — 1 ir ir 1506270906 Sep 12 13:30 part-00048
-rw-r — r — 1 ir ir 1506032424 Sep 12 13:30 part-00049
-rw-r — r — 1 ir ir 0 Sep 12 13:30 _SUCCESS

Configuring S3

Step 1: Add AWS S3A Jars to Classpath

> wget -P /tmp \ https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xf /tmp/hadoop-2.7.3.tar.gz \ hadoop-2.7.3/share/hadoop/tools/lib/hadoop-aws-2.7.3.jar hadoop-2.7.3/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar
> mv hadoop-2.7.3/share/hadoop/tools/lib/*.jar spark-$VERSION/jars/

Step 2: Configure FlashBlade Object Store

pureuser@irp210-c01-ch1-fm2> pureobjaccount create datascience
Name Created
ambari-user 2018–06–29 06:55:43 PDT
pureuser@irp210-c01-ch1-fm2> pureobjuser create datascience/spark-user
Name Access Key ID Created
datascience/spark-user — 2018–06–29 06:56:44 PDT
pureuser@irp210-c01-ch1-fm2> pureobjuser access-key create --user datascience/spark-user
Access Key ID Enabled Secret Access Key User
PSFBIAZFDDAAGEFK True 254300033/a3e2/DF81E2……HLNAGMLEHO datascience/spark-user
pureuser@irp210-c01-ch1-fm2> purebucket create --account ambari-user datahub
Name Account Used Created Time Remaining
datahub default 0.00 2019–01–09 05:42:06 PST -

Step 3: Configure S3A

Option 1: Config File


Option 2: Environment Variables


Option 3: Command-line arguments

./bin/spark-submit XXXYYYZZZ \
-Dfs.s3a.endpoint= \
-Dfs.s3a.connection.ssl.enabled=false \
-Dmapreduce.fileoutputcommitter.algorithm.version=2 \
-Dfs.s3a.fast.upload=true \
-Dfs.s3a.access.key=XXXXXXXXX \
-Dfs.s3a.secret.key=YYYYYYYYYYYYYYY \
s3a://bucketname/input \

Step 4: Access from Spark

val input = sc.textFile(“s3a://bucketname/input-dir/”)

Node Local Storage (Optional):

Case 1: Intermediate Results

scala> interm_result.persist(StorageLevel.DISK_ONLY)
res14: interm_result.type = MapPartitionsRDD[2] at map at <console>:26

Case 2: Centralized Logging

-e SPARK_LOG_DIR=/datahub/sparklogs \
-e SPARK_WORKER_DIR=/datahub/sparklogs

Automation Tip: Using the Pure Docker Volume Plugin

docker volume create — driver=pure -o size=1TiB \
-o volume_label_selector=”purestorage.com/backend=file” \
docker run -d — net=host \
-v sparkscratchvol:/local \
-e SPARK_LOCAL_DIRS=/local \
$SPARK_IMAGE /opt/spark/sbin/start-slave.sh spark://$MASTER:7077




