Use PySpark Locally With Docker

Installing Spark on Linux

Here we’ll learn how to install Spark 3.0.3 for Linux. We tested it on Ubuntu 20.04 (also WSL), but it should work for other Linux distros as well

Installing Java

Download OpenJDK 11 or Oracle JDK 11 (It’s important that the version is 11 - spark requires 8 or 11)

We’ll use OpenJDK

Download it (e.g. to ~/spark):

1

wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz

Unpack it:

1

tar xzfv openjdk-11.0.2_linux-x64_bin.tar.gz

define JAVA_HOME and add it to PATH:

1
2


export JAVA_HOME="${HOME}/spark/jdk-11.0.2"
export PATH="${JAVA_HOME}/bin:${PATH}"

We need to add the JAVA_HOME variable to our path to be able to use Java

check that it works:

1

java --version

Output:

1
2
3


openjdk 11.0.2 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

Remove the archive:

1

rm openjdk-11.0.2_linux-x64_bin.tar.gz

Installing Spark

Download Spark. Use 3.0.3 version:

1

wget https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz

Unpack:

1

tar xzfv spark-3.0.3-bin-hadoop3.2.tgz

Remove the archive:

1

rm spark-3.0.3-bin-hadoop3.2.tgz

Add it to PATH:

1
2


export SPARK_HOME="${HOME}/spark/spark-3.0.3-bin-hadoop3.2"
export PATH="${SPARK_HOME}/bin:${PATH}"

Testing Spark

Execute spark-shell and run the following:

1
2
3


val data = 1 to 10000
val distData = sc.parallelize(data)
distData.filter(_ < 10).collect()

And if we don’t want to write the variables everytime we log in, we can add the above lines in our .bashrc file using nano We can now write `source .bashrc’ to reload the file

PySpark

This document assumes you already have python.

To run PySpark, we first need to add it to PYTHONPATH:

1
2


export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH"

Make sure that the version under ${SPARK_HOME}/python/lib/ matches the filename of py4j or you will encounter ModuleNotFoundError: No module named 'py4j' while executing import pyspark.

For example, if the file under ${SPARK_HOME}/python/lib/ is py4j-0.10.9.3-src.zip, then the export PYTHONPATH statement above should be changed to

1

export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"

Now you can run Jupyter or IPython to test if things work. Go to some other directory, e.g. ~/tmp.

Download a CSV file that we’ll use for testing:

1

wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

Now let’s run ipython (or jupyter notebook) and execute:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

df = spark.read \
    .option("header", "true") \
    .csv('taxi+_zone_lookup.csv')

df.show()

Test that writing works as well:

1

df.write.parquet('zones')

Installing Spark on Linux#

Installing Java#

Installing Spark#

Testing Spark#

PySpark#

Installing Spark on Linux

Installing Java

Installing Spark

Testing Spark

PySpark