Recent Tutorials and Articles
    Getting Started with Apache Spark
    Published on: 21st July 2016
    Posted By: Amit Kumar

    This tutorial introduces you to Apache Spark along with its built-in libraries by discussing its technical architecture and use cases.

    What is Apache Spark?

    Apache Spark is a distributed, scalable and lighening fast big data framework for batch processing. It is often compared with Hadoop MapReduce in this capacity. However, it provides following benefits over Hadoop MapReduce -

    1. Distributed In-memory data processing making it 100 times faster than Hadoop MapReduce
    2. Advance DAG (Directed Acyclic Graph) based optimized execution system in comparison to simple Map Reduce programming model making it 10 times faster even on disk. Map Reduce programming model is generally not flexible enough for modeling complex business problems and require chaining of MapReduce jobs making it even slower.
    3. Write applications in many programming languages such as Java, Scala, Python, R etc. On the other hand, Hadoop MapReduce mainly focuses on Java
    4. Runs on a varity of cluster management frameworks such as Hadoop YARN, Mesos and Spark standalone scheduler
    5. Bulit-in libraries for stream processing, machine learning and interactive data analytics

    While Apache Spark has many advantages over Hadoop MapReduce, it is not as stable as Hadoop and hence updates are released frequently. This generally poses a challenge for companies to keep their infrastructure up to date.


    Technical Architecture of Apache Spark

    Spark is a distributed framework allowing to data computation on cluster of machines. Here is the high level technical architecture of Apache Spark -

    Apache Spark Technical Architecture

    There are following components shown in the above architecture diagram -

    1. Cluster Manager: Apache Spark can be configured to run with a generic cluster manager such as Hadoop YARN, Apache Mesos and Spark Standalone. By default, Spark runs with its own standalone cluster manager. Apache Spark relies on cluster manager to manage the cluster resources (such as CPU, Memory). It uses cluster manager to allocate resources for an application and finally release back resources to cluster manager after computation is completed.
    2. Worker Node:  Worker node is responsible for running worker processes. Worker process consists of a Executor runner to launch executor processes.
      • Executor: Executor processes are responsible for executing multiple tasks of an application. It is important note that an executor does NOT run tasks from multiple applications. This scheme helps achieve isolation between applications.Executor uses multiple threads to execute the tasks and hence application code needs to be thread-safe.
      • Cache: Executor also contains a local cache that is used to store the outputs of tasks. This approach increases the throughput and performance by using cachde output of tasks instead of re-calculating these. This makes Apache Spark a better choice for iterative algorithm in comparison to Hadoop MapReduce.
    3. Driver Program: This is starting point of an application and responsible for creating SparkContext. It is responsible for negotiating resources with cluster manager and submitting the tasks to the allocated executors for execution. Results of all the tasks are sent back to driver program for aggregation or storage. It is also responsible for monitoring the status of tasks and re-schedule them in case of any failures. This program is same as Application Master in Hadoop YARN. 
    4. SparkContext: This object is responsible for execution and co-ordination of all the tasks of an application. This encapsulates the application code (such as JAR or Python files), configuration (such as input, output location information) and passes these to executors responsible for executing the tasks. Additionally, Driver program creates RDDs (Resilient Distributed Datasets) using SparkContext. A RDD is basic abstraction for an immutable and partitioned collection of elements that can be operated in parallel. RDD allows to perform basic operations on data, such as map, filter, reduce and persist.


    Apache Spark Built-in Libraries

    Apache Spark comes with following built-in libraries for performing various functions in an application -

    1. Spark Streaming: It is an adapter over core Apache Spark API. It enables high-throughput, fault-tolerant, scalable stream processing of live data streams such as tweets flowing from Twitter stream. This library can be used to ingest data from Apache Kafka, Apache Flume, Twitter, Amazon Kinesis or TCP sockets. This data can then be processed using operations such as map, reduce, join etc. After processing, output can be stored into filesystems, databases and client dashboards.
    2. DataFrames, Datasets and SQL: Spark SQL can be used to execute SQL queries (or HiveQL) on data stored in Apache Spark. DataFrames, on the other hand, is a distributed collection of data organised into named columns like a table in a relational database or a data frame in R/Python. DataFrames however is much more optimized in comparison. Dataset was added in Spark 1.6 to provide the benefits of RDDs with benefit of Spark SQL's optimized execution engine. Basically it is like using RDDs on top of Spark SQL to query data interactively.
    3. Spark MLlib: This library is used for scalable and easy machine learning. It consists of common learning algorithms such as classification, regression, clustering, collaborative filtering, dimensionality reduction and higher level pipeline APIs.
    4. Spark GraphX: This library is used for graphs computation. It extends Apache Spark RDD to introduce a new Graph abstraction - a directed multigraph with properties attached to each vertex and edge. It includes a collection of Graph algorithms and builders to simplfy graph analytics tasks.
    5. SparkR: SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 1.6.2, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.


    Use Cases of Apache Spark

    As we have seen that Apache Spark comes with many useful built-in libraries, some of the use cases of Apache Spark are as follows - 

    1. Iterative Computations
    2. High throughput stream data processing
    3. Machine learning
    4. Interactive data analysis
    5. Graph processing



    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Posted By: Amit Kumar
    Published on: 21st July 2016