This article introduces you to Apache Storm, a real-time distributed processing / computing framework for big data, by providing details of its technical architecture along with the use cases it could be utilized in.
Apache Storm, in simple terms, is a distributed framework for real time processing of Big Data like Apache Hadoop is a distributed framework for batch processing. Apache Storm works on task parallelism principle where in the same code is executed on multiple nodes with different input data.
Apache Storm does not have any state managing capabilities. It instead utilizes Apache ZooKeeper to manage its cluster state such as message acknowledgements, processing status etc. This enables Storm to start right from where it left even after the restart.
Since Storm's master node (called Nimbus) is a Thrift service, one can create and submit processing logic graph (called topology) in any programming language. Moreover, It is scalable, fault-tolerant and guarantees that input data will be processed.
Here is the architecture diagram depicting the technical architecture of Apache Storm -
There are following two types of nodes services shown in above diagram -
- Nimbus Service on Master Node - Nimbus is a daemon that runs on the master node of Storm cluster. It is responsible for distributing the code among the worker nodes, assigning input data sets to machines for processing and monitoring for failures.
Nimbus service is an Apache Thrift service enabling you to submit the code in any programming language. This way, you can always utilize the language that you are proficient in, without the need of learning a new language to utilize Apache Storm.
Nimbus service relies on Apache ZooKeeper service to monitor the message processing tasks as all the worker nodes update their tasks status in Apache ZooKeeper service.
- Supervisor Service on Worker Node - All the workers nodes in Storm cluster run a daemon called Supervisor. Supervisor service receives the work assigned to a machine by Nimbus service. Supervisor manages worker processes to complete the tasks assigned by Nimbus. Each of these worker processes executes a subset of topology that we will talk about next.
And here are the importand high level components that we have in each Supervisor node.
- Topology - Topology, in simple terms, is a graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes. A Topology typically runs distributively on multiple workers processes on multiple worker nodes.
- Spout - A Topology starts with a spout, source of streams. A stream is made of unbounded sequence of tuples. A spout may read tuples off a messaging framework and emit them as stream of messages or it may connect to twitter API and emit a stream of tweets.
In the above technical architecture diagram, a topology is shown with two spouts (source of streams).
- Bolt - A Bolt represents a node in a topology. It defines smallest processing logic within a topology. Output of a bolt can be fed into another bolt as input in a topology.
In the above technical architecture diagram, a topology is shown with five bolts to process the data coming from two spouts.
Apache Storm comes with the following benefits to provide your application with a robust real time processing engine -
- Scalable - Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm's scale, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
- Guarantees no data loss - A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.
- Extremely robust - Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.
- Fault-tolerant - If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
- Programming language agnostic - Robust and scalable realtime processing shouldn't be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.
Apache Storm, being a real-time processing framework, is suitable for extremely broad set of use cases some of which are as follows -
- Stream processing - Processing messages and updating databases
- Continuous computation - Doing a continuous query on data streams and streaming the results into clients
- Distributed RPC - Parallelizing an intense query like a search query on the fly
- Real-time analytics
- Online machine learning
You can find the information about which all companies are using Apache Storm here
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.