Skip to the content.

Building a scala executable package for Spark

This tutorial will help you compile a scala application, deploy it to to HDFS and run it inside Spark.

Dependencies

  1. A configured k3s cluster
  2. A working hadoop + spark setup

Configure your environment

# Install scala-sbt (https://www.scala-sbt.org/download.html)
echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list
echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | sudo tee /etc/apt/sources.list.d/sbt_old.list

curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add

sudo apt-get update
sudo apt-get install sbt

Compile the demo sbt package and deploy it to HDFS

# The folder images/hibench/pkg contains a scala package source with multiple examples
cd images/hibench/pkg

# Build it
./pkg_build.sh

# Deploy it to HDFS. Once inside HDFS, this package will be available to every pod
./pkg_deploy.sh

If you go to HDFS explorer, you will see the file “automl-tunner_2.12-1.0.jar” in it.

Run the Spark application

# First open a pod with access to spark and hadoop
sudo kubectl exec -it spark-primary -- bash

# You can also see the content inside hadoop from here
hadoop fs -ls /

# Now tell spark to run our package
/usr/bin/time -v spark-submit --class org.apache.spark.examples.SparkPi \
    --master spark://spark-primary:7077 \
    --deploy-mode client \
    --conf spark.yarn.submit.waitAppCompletion=true \
    --conf spark.driver.host=`hostname -I` \
    --num-executors 1 \
    --driver-memory 1g \
    --executor-memory 1g \
    --executor-cores 1 \
    hdfs://hadoop-primary:9000/automl-tunner_2.12-1.0.jar \
        1000

Some details:

The correct output will contain a line with the estimation of PI, like

...
Pi is roughly 3.1415902341588566
...