Installing Spark on Linux
Here we’ll learn how to install Spark 3.0.3 for Linux. We tested it on Ubuntu 20.04 (also WSL), but it should work for other Linux distros as well
Installing Java
Download OpenJDK 11 or Oracle JDK 11 (It’s important that the version is 11 - spark requires 8 or 11)
We’ll use OpenJDK
Download it (e.g. to ~/spark
):
|
|
Unpack it:
|
|
define JAVA_HOME
and add it to PATH
:
|
|
We need to add the JAVA_HOME variable to our path to be able to use Java
check that it works:
|
|
Output:
|
|
Remove the archive:
|
|
Installing Spark
Download Spark. Use 3.0.3 version:
|
|
Unpack:
|
|
Remove the archive:
|
|
Add it to PATH
:
|
|
Testing Spark
Execute spark-shell
and run the following:
|
|
And if we don’t want to write the variables everytime we log in, we can add the above lines in our .bashrc
file using nano
We can now write `source .bashrc’ to reload the file
PySpark
This document assumes you already have python.
To run PySpark, we first need to add it to PYTHONPATH
:
|
|
Make sure that the version under ${SPARK_HOME}/python/lib/
matches the filename of py4j or you will
encounter ModuleNotFoundError: No module named 'py4j'
while executing import pyspark
.
For example, if the file under ${SPARK_HOME}/python/lib/
is py4j-0.10.9.3-src.zip
, then the
export PYTHONPATH
statement above should be changed to
|
|
Now you can run Jupyter or IPython to test if things work. Go to some other directory, e.g. ~/tmp
.
Download a CSV file that we’ll use for testing:
|
|
Now let’s run ipython
(or jupyter notebook
) and execute:
|
|
Test that writing works as well:
|
|