Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.
I don’t know about other programming languages, but if you are using Python or Django, you must have heard about Celery quite a few times, and if not, you better look into it. As stated on the project Celery website:
In case of a web service (most common use-case), asynchronous task queues are utilities to push (time-consuming) tasks in background while timely sending back the response for a user request. These delegated tasks can be anything from sending few notifications, dispatching emails, update system logs, or update internal ERP. Having the aforementioned tasks in line with the request processing, can delay the response back to the user to a large extent.
If you’re going through interviews for the positions of Python developer, or looking forward to preparing for one, or just a curios developer, you better have your head clear around the concept of decorators in Python programming language.
I won’t be delving into ‘what are design patterns’ , and why should you make use of it, whenever possible. The post is merely about understanding and writing decorators in Python. You can find plethora of posts about Python decorators, the motivation for me is, that everyone has their own way of explaining, especially a technical concept.
Step 0 – Update and upgrade
We are using Ubuntu 16.04 LTS for this tutorial
apt-get update update the list of available packages and their versions, but it does not install or upgrade any packages. apt-get upgrade actually installs newer versions of the packages you have. After updating the lists, the package manager knows about available updates for the software you have installed.
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get upgrade
A regular Ubuntu release comes up with 9 months of support, except the LTS (Long Term Support) versions. Ubuntu 14.04 and 16.04 being the LTS are still widely in use at production level. Being a Python developer the first thing I need to on having a fresh Ubuntu 14.04, or 16.04 machine is update Python. Ubuntu 14.04 has Python 3.4 and 16.04 comes with Python 3.5. This blog post is about installing Python 3.6 on your Ubuntu 14.04, or 16.04 LTS.
What is Apache Spark
You may have noticed, wherever there is a talk about big data the name Apache Spark eventually comes up, in simplest words it’s a large-scale data processing engine. Apache Spark is a fast data processing framework with provided APIs to connect and perform big data processing. Spark being the largest open-source data processing engine, has been adopted by large companies – Yahoo, eBay, Netflix, have massive scale Spark deployments, processing multiple petabytes of data on clusters of over 8,000 nodes.
Apache Spark can be started as a standalone cluster (which we’ll be doing for this tutorial), or using Mesos or YARN as cluster managers. Spark can work with data from various sources, AWS S3, HDFS, Cassandra, Hive (structured data), HBase, or any other Hadoop data source. Above all what makes Spark high in-demand is the included libraries MLib, SQL and DataFrames, GraphX, and Spark Streaming, to cater the main data processing use-cases, such that users can combinely use all these libraries in the same application.
What is Jupyter Notebook
If you’re a Python developer, or someone who has to interact with Python, you may be hearing or seeing the term Jupyter Notebook quite lot, while reading articles, or looking for some solution on-line.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
What is Anaconda
Anaconda is a free open-source Python distribution (as well as R programming language), intended for large-scale data processing and analysis, and scientific computing. Anaconda Python distribution is manged and developed by Continuum Analytics.
Anaconda (“Anaconda Distribution”) is a free, easy-to-install package manager, environment manager, Python distribution, and collection of over 720 open source packages with free community support. Hundreds more open source packages and their dependencies can be installed with a simple “conda install [packagename]”. It’s platform-agnostic, can be used on Windows, OS X and Linux. Or even easier.
I decided to write this post, as I myself when for the first time tried to use conda (the package manager for Anaconda Python distribution, the first question was in what ways conda is better then pip, and so why one should think of preferring condo over the de-facto pip. Here I have put a comprehensive post about ‘getting started with conda’ i.e. what extra condo can offer.
A short comparison
- Can only be used for Python packages.
- The supported package manger by the Python foundation, hence widely used.
- Handles library dependencies even outside Python i.e. packages for C libraries, or R packages, or really anything.
- Supports virtual environment out of the box.
- Developed to be used with Anaconda Python distribution, though can be used with the standard Python distribution – but highly not recommended.
” In Unix-based computer operating systems, init (short for initialization) is the first process started during booting of the computer system. Init is a daemon process that continues running until the system is shut down.”
Elasticsearch is a distributed storage and real-time search engine.
- Distributed storage – you just need to setup and add Elasticsearch nodes, it’ll keep the data distributed on the cluster nodes. The distributed-ness makes data durable and highly-available too.
- Real-time search engine – You can get to query the data the moment it’s been written.
Due to the above 2 attributes you have been listening and reading about Elasticsearch, wherever there’s a discussion of real-time data analysis. It’d not be an overstatement to say technologies like Elasticsearch set the foundation for any efficient and reliable search engine.
The ES itself is implemented in Java, but it provides a good RESTful api interface which makes it possible to use it with any programming language.