Fast Data Architectures for Streaming Applications, 2nd Edition

Fast Data Architectures for Streaming Applications, Second Edition is a free report co-published by Lightbend and O'Reilly on the architectural characteristics of highly available, resilient, scalable, and responsive systems for data stream processing at scale. Originally published in October, 2016, the second edition was published in October, 2018.

The book provides an overview of the common requirements for reliable streaming systems, based on common use cases for streaming, such as serving machine learning in a streaming context. Other requirements include the need to handle potential data loss, duplication of data, late arrival, etc. Standard system concerns, the so called reactive principles are important for such systems, where services and applications must run reliably for weeks, months, even years. If you run anything long enough, it will see every rare anomaly: hardware failures, network partitions, traffic spikes, etc. Hence, streaming systems have harder operational requirements than shorter-lived batch processes

Apache Kafka is the messaging backbone of these architectures, providing high scalability and reliability for ingesting data organized into topics and orders, similar to conventional message queues.

The data can then be processed by one or more stream processors, including the following four, which decompose into two groups:

Streaming Services:
  • Apache Spark, for a rich variety of processing options, including SQL, using a minibatch processing model
  • Apache Flink, which offers low-latency (vs. minibatch) processing with rich semantics for reliably processing.
Streaming Libraries for Microservices:
  • Akka Streams, which offers very low-latency event processing with rich integrations with other systems.
  • Kafka Streams, for low-latency processing of Kafka topics.

Services like Spark and Flink do a lot of heavy lifting for you, such as automatic data partitioning, tasks management across the cluster, etc., but impose some overhead and require that your application fit their programming and execution model. Libraries like Akka Streams and Kafka Streams, provide much more flexibility, including lower latency, but don't provide automatic partitioning and task management, like Spark and Flink.

Rounding out the picture are tools for building other microservices, such as the Lightbend Platform, including management and monitoring tools.

Fast Data Architectures for Streaming Applications, Second Edition