Apache Spark and PySpark on CentOS/RHEL 7.x

What is Apache Spark

You may have noticed, wherever there is a talk about big data the name Apache Spark eventually comes up, in simplest words it’s a large-scale data processing engine. Apache Spark is a fast data processing framework with provided APIs to connect and perform big data processing. Spark being the largest open-source data processing engine, has been adopted by large companies – Yahoo, eBay, Netflix, have massive scale Spark deployments, processing multiple petabytes of data on clusters of over 8,000 nodes.
Apache Spark can be started as a standalone cluster (which we’ll be doing for this tutorial), or using Mesos or YARN as cluster managers. Spark can work with data from various sources, AWS S3, HDFS, Cassandra, Hive (structured data), HBase, or any other Hadoop data source. Above all what makes Spark high in-demand is the included libraries MLib, SQL and DataFrames, GraphX, and Spark Streaming, to cater the main data processing use-cases, such that users can combinely use all these libraries in the same application.

Continue reading “Apache Spark and PySpark on CentOS/RHEL 7.x”