Data is being generated in truly massive amounts each day. Latest estimates claim that 2.7 Zettabytes of data exist in the digital universe today (big-data-interesting-facts). This data is being produced by a whole host of emerging mobile technologies, Internet of Things (IoT) sensors, social media, and web activity histories. Making sense and using this data has become a key driving objective for even the modest business to help to make critical decisions. In today’s business world, what was once secondary data has now become the primary data driving business growth.
Business and big data streams
But, the real challenges of big data is not about storing data or analysis with machine learning, it is about channeling all of this unstructured data from its source to multiple downstream analysis engines. To deal with the large data throughput, big data software pipelines must utilize distributed cluster platforms, adding additional complexity to an already difficult problem.
Traditional solutions ETL (extract, load, and store) database solutions do not scale easily and are not well suited to handle unstructured and real-time streaming data. This brief article takes a look at the key challenges that limit data streaming and how such tools were designed to tackle such problems.
Key Challenges of Modern big data
There are three fundamental challenges that limit big data streaming:
- Reliability and operability: this refers to whether the data can be trusted, non-redundant, and of sufficient quality. Systems must intelligently move large amounts of data between endpoints without error.
- Scalability and performance: data volumes will only increase, and rapidly. Big data infrastructures should be based on distributed computing, so that any solution must naturally scale and handle issues such as fault-tolerance and load balancing across the cluster.
- Maintainability: </i> since data comes from multiple sources and different types of devices, a challenge is to be able to handle both the present complexity as well as unknown future sources.
- Temporal persistence: </i> sequential and Asynchronous data; Have the ability of persistence; corollary to this is to have a way that data can be looked at randomly in time; automatic data throttling or adapting dynamically to be able to handle peak data loads.
Several data streaming solutions, such as Google Cloud Dataflow, Storm, Samza, and Spark have emerged to handle these problems. They are loosely based on an old idea, namely, asynchronous event messaging.
Messaging based systems for Stream processing
Event messaging is a natural way to deal with sequential stream processing. Traditional message queues, such as ActiveMQ or RabbitMQ, can provide reliability and scalability, however other they lack the temporal persistence and are not easily maintained. Log aggregation system (Flume, Scribe) have been commonly used for temporal persistence of events.
Inspired by messaging and logging systems, Kafka is a low-latency distributed messaging systems that act as an event ledger that is designed specifically for distributed platforms. It is based on a publish/subscribe metaphor and can handle near real-time asynchronous data streaming. Because this is a pub/subscribe system, it can back-pressure and reactive programming, which means that it essentially acts as a buffer to incoming data streams, solving the temporal persistence problem by inherently throttling the data until consumers can send it to downstream application processing.
A typical user-case for a big data streaming service is to process website activity. For example, whenever a page is loaded, view events are saved and sent to the messaging system which is processed through multiple downstream channels: storing a message for future analysis, alert triggers, email notifications, processing of user profile information, etc. A practical use can be found here.
The emerging ecosystem
The traditional boundaries between messaging, log aggregation, and streaming data platforms are increasingly becoming blurred. A mature and exciting ecosystem of scalable distributed software platforms has emerged that will push big data into the future.