Configuring a Hadoop + Spark setup

This tutorial will explain how to install hadoop and spark on this local cluster.

Dependencies

Install k3s

Prepare your nodes

HDFS requires a local CSD folder to store its data. By default this folder is /media/storage and is mounted with the second partition on every CSD. The Host must also have this folder, but it is simply a local folder. All commands must be executed inside the Host machine (the one connected to the CSD devices).

# This will create /media/storage on every CSD and inside the Host
storage_format.sh

# This will erase the content on each of the folder created before. This is useful to do a clean install
storage_clear.sh

Deploy hadoop and spark

# Access the scripts folder for this task
cd kubernetes/bigdata2

# Choose the deploy mode
./deploy.sh hybrid # If you want to configure the host and all CSDs as hadoop data nodes and spark worker nodes
./deploy.sh csd # If you want to configure all CSDs as hadoop data nodes and spark worker nodes
./deploy.sh host # If you want to configure the host as hadoop data nodes and spark worker nodes

The operation above will take some time as each CSD and the host need to download the image from docker hub. Once complete you can do a sanity check and assert they are working by accessing the following URLs.

Spark interface - Lists every spark task and every worker node
HDFS interface - Lists every data node and their capacities.
HDFS explorer - Lists the files stored in HDFS.
Hadoop interface - Lists hadoop and mapreduce tasks.

Accessing the interactive shells: spark-shell and pyspark

The interactive interface is available inside the pods, just like dns names like spark-primary and hadoop-primary. To access them, use one of the commands bellow.

# Method 1 : Access the host via bash and then open one of the interactive consoles
sudo kubectl exec -it spark-primary -- bash # To access the container

pyspark # For the python interface
spark-shell # For the scala interface
PYSPARK_PYTHON=python3 pyspark # For the python interface using python3


# Method 2 : Direct access from the Host
sudo kubectl exec -it spark-primary -- spark-shell # For the scala interface
sudo kubectl exec -it spark-primary -- pyspark # For the python interface
sudo kubectl exec -it spark-primary -- bash -c "PYSPARK_PYTHON=python3 pyspark" # For the python3 interface

Accessing hdfs from the command line

The following is a list of some hadoop commands that may help developing

# Access the pod
sudo kubectl exec -it spark-primary -- bash

# List files
hadoop fs -ls / # Default path
hadoop fs -ls hdfs://bigdata2-primary:9000/ # Full path
 
# Touch file
hadoop fs -touch /test
 
# Create a directory
hadoop fs -mkdir /diego
hadoop fs -mkdir -p /diego/fonseca/pereira/de/souza
 
# Remove directory
hadoop fs -rm -r -f /diego
hadoop fs -rmdir --ignore-fail-on-non-empty /diego
hadoop fs -rmdir /diego
 
# Send file to hdfs
hadoop fs -put gut.txt /gut2.txt
hadoop fs -copyFromLocal gut.txt /gut2.txt
 
# Get file from hadoop
hadoop fs -get /gut2.txt gut3.txt
hadoop fs -copyToLocal /gut4.txt gut5.txt
 
# Copy files inside hdfs
hadoop fs -cp /diego /diego2
hadoop fs -cp /diego /diego2
 
# Move files inside hdfs
hadoop fs -mv /diego /diego2
 
# Free space
hadoop fs -df
 
# Used space
hadoop fs -du /diego
 
# Append data to remote file
hadoop fs -appendToFile ./local /kitty
 
# Remote cat
hadoop fs -cat /kitty
 
# Remote Checksum
hadoop fs -checksum /kitty
 
# Count files
hadoop fs -count /diego
 
# Find files
hadoop fs -find /diego *seca
 
# Show first 1K bytes in file
hadoop fs -head /kitty
 
# Show last 1K bytes in file
hadoop fs -tail /kitty
hadoop fs -tail -f /kitty # Follow modifications
 
# Set replication number for an individual file
hdfs dfs -setrep -w 5 /test.snappy.parquet
 
# Display file stats
hadoop fs -stat /kitty
 
# Test if file exists (output goes to $?)
hadoop fs -test -e /kitty
 
# Test if it is a directory (output goes to $?)
hadoop fs -test -d /kitty
 
# Test if it is a file (output goes to $?)
hadoop fs -test -f /kitty

Other examples

Try to run some basic examples to see if everything is working fine.

Undeploy hadoop and spark

# Access the scripts folderr for this task
cd kubernetes/bigdata2

# Undo the operations above
./undeploy.sh

Common issues

Worker nodes are not showing in the Spark interface

This is usually caused by the firewall in Ubuntu, you can temporaly disable it with the following command/

# To stop the firewall
sudo service ufw stop

After disabling it, undeploy and deploy again.

Data nodes are not showing in HDFS interface

This is usualy caused by the data nodes not accessing the folder /media/storage, either because it doesn’t exist or because it can’t be accessed. Running storage_format.sh and storage_clear.sh in the host usually fixes it.

# Format, create and set permissions in /media/storage on every node
storage_format.sh

# Erase the content in every /media/storage
storage_clear.sh

Workers are not returning after rebooting the machine

This is a known bug in this setup and one of the reasons it is currently recommended only for experimentation. Reinstalling hadoop and spark fixes this issue.

# Access the scripts folder
cd kubernetes/bigdata2

# Undeploy them
./undeploy.sh

# Deploy them again
./deploy.sh hybrid