What is Apache Spark
What is PySpark ?
Since Spark 2.2.0 PySpark is also available as a Python package at PyPI, which can be installed using pip. In Spark 2.1, though it was available as a Python package, but not being on PyPI, one had to install is manually, by executing the setup.py in <spark-directory>/python., and once installed it was required to add the path to PySpark lib in the PATH. Python (2.6 or higher) and Apache Spark the requirements for PySpark.
Apache Spark Standalone cluster Setup
Requirements
- Java (JRE) 7+
- Python 2.6 or higher
Step 0. Prelimenaries
Run yum update
Stop the firewall
Disable SELinux
The above command will disable SELinux for the session i.e. until next reboot – to permanently disable it set SELINUX=disabled in /etc/selinux/config file.
Step 1: Install Java
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)
Step 2: Install Apache Spark
Download the Spark binaries
wget http://www-eu.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
Untar the binary

Create a symlink
Step 3: Launch standalone Spark cluster
Option 1: Starting the Cluster manually
Start the master

Start the slave


Option 2: Start cluster using the launch scripts
sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
sbin/start-all.sh - Starts both a master and a number of slaves as described above.
sbin/stop-master.sh - Stops the master that was started via the sbin/start-master.sh script.
sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
sbin/stop-all.sh - Stops both the master and the slaves as described above.
Add SPARK_HOME
export PATH=$SPARK_HOME/bin:$PATH
You can further configure (optional) the Spark cluster by setting various environment variables in conf/spark-env.sh file. To get started you can use the conf/spark-env.sh.template to create you env file, and copy it to all your worker machines for the settings to take effect.
Step 4: Install PySpark
PySpark Shell

PySpark from PyPI
Step 5: An example
Directly using the PySpark shell
>>>
>>> seqs.collect()
[u'ATATCCCCGGGAT', u'ATCGATCGATAT']
>>>
>>> ones = seqs.flatMap(lambda x : [(c, 1) for c in list(x)])
>>> ones.collect()
[(u'A', 1), (u'T', 1), (u'A', 1), (u'T', 1), (u'C', 1), (u'C', 1), (u'C', 1), (u'C', 1), (u'G', 1), (u'G', 1), (u'G', 1), (u'A', 1), (u'T', 1), (u'A', 1), (u'T', 1), (u'C', 1), (u'G', 1), (u'A', 1), (u'T', 1), (u'C', 1), (u'G', 1), (u'A', 1), (u'T', 1), (u'A', 1), (u'T', 1)]
>>>
You can execute a python script directly using the Python shell, provided out-of-the-box
from pyspark import SparkContext
logFile = "/opt/spark/README.md" # or some file on your system
sc = SparkContext("local", "Example App")logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
