Apache Spark and PySpark on CentOS/RHEL 7.x

What is Apache Spark

You may have noticed, wherever there is a talk about big data the name Apache Spark eventually comes up, in simplest words it’s a large-scale data processing engine. Apache Spark is a fast data processing framework with provided APIs to connect and perform big data processing. Spark being the largest open-source data processing engine, has been adopted by large companies – Yahoo, eBay, Netflix, have massive scale Spark deployments, processing multiple petabytes of data on clusters of over 8,000 nodes.
Apache Spark can be started as a standalone cluster (which we’ll be doing for this tutorial), or using Mesos or YARN as cluster managers. Spark can work with data from various sources, AWS S3, HDFS, Cassandra, Hive (structured data), HBase, or any other Hadoop data source. Above all what makes Spark high in-demand is the included libraries MLib, SQL and DataFrames, GraphX, and Spark Streaming, to cater the main data processing use-cases, such that users can combinely use all these libraries in the same application.

Continue reading “Apache Spark and PySpark on CentOS/RHEL 7.x”

Rabbitmq Cluster on CentOS 7

Step 0. Prelimenaries

Run yum update

$ sudo yum update

Stop the firewall

To make the ports accessible i.e. for clustering nodes use 5672, 4369, and 25672.
$ sudo systemclt stop firewall-cmd

Disable SELinux

$ sudo setenforce 0
The above command will disable SELinux for the session i.e. until next reboot – to permanently disable it set SELINUX=disabled in /etc/selinux/config file.

Set host names

Doing so later will most probably break the installation.
We’ll name the nodes/VMs as rabbit1 (192.168.40.192) and rabbit2 (192.168.40.193) – the default is localhost.

Continue reading “Rabbitmq Cluster on CentOS 7”