Reactive Microservices — 3
This is an 8 part series on reactive microservices. [1 , 2 , 3, 4, 5, 6, 7, 8]
Part 2 of this series started with reactive microservices patterns. In this part we continue on the patterns.
Streams & Stream Processing
Streams are simply flow of data objects from one function to next within the same JVM or over the network. Data streams in reactive microservices can be bounded or unbounded depending on the source of stream.
Bounded Streams
Within a JVM, streams are the flow of objects from one async process to another in a pipeline (also referred to as collection pipeline pattern — https://martinfowler.com/articles/collection-pipeline/). Bounded Stream has fixed size and unchanging data and thus predictable time and execution. This kind of data stream results from a collection or a database query or a data-pump that sends finite dataset to a service. In the context of reactive microservices, the use of parallel processing in bounded streams is relevant. Stream processing is coupled with functional style features like Lambda functions and enable processing stream data synchronously or in parallel, depending on the nature of task.
Example : Java 8 java.util.stream.Stream
Unbounded Streams
Unbound Stream is unpredictable and infinite. The data creation is a never-ending cycle. They do not terminate and provide data as and when it is generated and thus must be continuously processed. Depending on the generation semantics, the data that lands in the stream can be unpredictable. Unbounded stream processing doesn’t wait for all data to arrive and is processed as a reaction to generation of the stream data.
Event-Driven microservices uses events are first-class objects for modelling of domain as well as loosely coupling integration. One of the ways to handles these events is via a message queue (e.g. JMS MQ, Rabbit MQ, AMQP) where the producer would PUT the message and listener would GET the message. Streaming design provide this integration as well but provides the ability to see data as stream which has a continuum and can be traversed back and forth on the stream. Instead of seeing data as messages that can be picked up for individual computation, streaming platforms enables complex operations on multiple input streams and records like joins and aggregation. Messaging system deletes the messages after they are successfully acknowledged by all subscribing consumers. If MQ broker doesn’t receive the acknowledgement due to any failure, then it attempts to redeliver the message, but there is no ordering guarantee and messages would be delivered out-of-order. While this may be okay for some queue scenarios where ordering doesn’t matter, for an event stream that describes state change facts in an application, establishing temporal ordering of events is vital.
MQ is PUSH model (MQ Container Service à Listener). Streaming platforms simply maintain a traversable stream of data ordered by time and keeps appending the new data at the end of the stream. The interested consumers poll the streams and get the data and can use their individual offsets to traverse the stream. Streaming is thus a PULL model (polling consumer à Stream Partition). If a new consumer is added, they will not see the old MQ messages but can traverse the message stream and see the complete history. Streams allow more complex applications. Stream processors run on the stream to work on the stream data and are usually the APIs that poll the stream middleware and provide the producer and consumer streaming pipeline. The stream consumer is also a subscriber like a MQ based pub-sub, to a specific set of streaming “topics”, but the subscription is inverted i.e. streaming middleware doesn’t know anything about the subscribing consumers.
To standardize the streaming pattern in reactive systems, http://www.reactive-streams.org/ , is the initiative to provide that standard. The standard also defines the non-blocking back pressure to enable self-healing in the stream processors(subscribers) when the subscribers cannot process the messages at the same rate as the publisher. The process of restricting the number of items that a subscriber is willing to accept (as decided by the subscriber itself) is called backpressure and is essential in prohibiting the overloading of the subscriber (pushing more items that the subscriber can handle).
Many frameworks support reactive streams specification and can be used interchangeable without depending on specific implementation.
Some stream processing frameworks:
- Java 9: https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/Flow.html. Java 9’s Reactive Streams aka Flow API are a set of Interfaces implemented by various reactive streams libraries such as RxJava 2, Akka Streams, and Vert.x. They allow these reactive libraries to interconnect, while preserving the all-important back-pressure.
- Vert.x: Vert.x is an Eclipse Foundation project. It is a polyglot event-driven application framework for the JVM. https://vertx.io/docs/vertx-reactive-streams/java/
- ReactiveX (RXJava): https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0 | https://github.com/reactive-streams/reactive-streams-jvm/blob/v1.0.2/README.md
- Reactor (Pivotal): The reactive functionality found in Spring Framework 5 is built upon Reactor 3.0. https://projectreactor.io/
- Akka Streams: Library on top of Akka toolkit and Reactive Streams. Akka Streams fully implements the Reactive Streams specification. Akka uses Actors to deal with streaming data.https://doc.akka.io/docs/akka/2.5/stream/stream-introduction.html
- Spring Cloud Stream (https://spring.io/projects/spring-cloud-stream) doesn’t implement reactive streams specification but allow binder implementations to many MQ and streaming middleware like RabbitMQ and Kafka. (https://kafka.apache.org/documentation/streams/)
- Kafka Steams: Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. Doesn’t implement reactive streams specification yet.
- Alpakka Kafka Connector: Connects Apache Kafka to Akka Streams. And is reactive streams compliant(https://github.com/akka/alpakka-kafka).
- Confluent KSQL (https://www.confluent.io/product/ksql/) Confluent KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka.
Axon : https://docs.axoniq.io/reference-guide/v/3.0/part-iii-infrastructure-components/event-processing
For both bounded and unbounded streams, the stream processing API provides stream operations that provide functional interfaces to apply async computation to the stream data. Parallelization and stream processing are powerful features that reactive microservices can leverage. Typical single stream operations are: Split, Merge, Ordering, Windowing, Filter etc. These methods can be “pipelined” together.
Stream Processing Cases
Control Flow — Simple Event Consumption
Event-Driven microservices consume events as a matter of control flow i.e. to initiate the processing of the next applicable service to fulfil a user journey. The events replace the synchronous calls between services.
Materialized Views on Streams
In event-driven systems, the event-log become a source of truth and system of records for the events from multiple topics. While individual microservices listen to events from certain topics, the microservices can create local materialized views of the topic or “join” multiple topics to create a local materialized view of the two streams and run analytics, reporting and searches on these. The sink of this materialized view can be another topic as well.
Stream(“stream_1”).join(“stream_2”).transform(()->{//join the two}).collect(()->{//sink it});
Event Enrichment
Depending on the event model, a coarse-grained event would not always carry all the data that is needed to process an event. For example, in an e-Comm portal, when a user checks out from the cart and makes payment, the completion page usually shows a list of recommended products that are built on a relevance model for a user. Relevance model is user specific and carries rich set of data that may include history, related items, wish-list, lifegoal etc. data. The payment event doesn’t need to have all this user recommendation data in the “PaymentSuccessForOrderEvent”. So, the event would need to be enriched with user relevance data to emit something like “PaymentSuccessForOrderEventWithUserRelevanceData”. There are two choices:
- Query the user-profile service for each event received over a synchronous REST call. (The option to query the user-profile tables directly is not even mentioned here). This would mean a speed mismatch between a stream source and a network call resulting in latency and contention.
- Build a local user-profile view and subscribe to user-profile change event. This STREAM-TABLE join is much faster than a service call. The “PaymentSuccessForOrderEvent” would carry user identifier which can be used to enhance or enrich this event with data from the local user service.
Both “Materialized View” and “Event Enrichment” are even more optimized (for example via Kafka partitions) where each producer (user service in this case) can write to a single partition and consumer group can read from the same partition and the local state of the user is built in the consumer JVM for subset of users. The user profile data for a given user would always reside in the same partition so consumers can use an embedded database (like Level DB/Rocks DB/Apache Geode/Pivotal Gem Fire) to maintain the local state. This state can be re-built from the event-log anytime if a consumer goes down.
CDC (Data Pump)
Legacy database transaction log can be converted into stream for consumption. CDC enables streaming every single event from a database transaction log into streaming platform. Few examples
- https://debezium.io/
- https://www.oracle.com/middleware/data-integration/goldengate/big-data/
- https://docs.confluent.io/current/connect/index.html (Kafka Connect depends on JDBC connectors to data sources instead of transaction logs)
Messaging System Requirements
The messaging system that drives event-streams can be chosen from a variety of providers. The requirements from reactive microservices perspective that the messaging system needs are:
Well-Defined delivery Semantics
How does the system guarantee delivery of messages: At-Least-Once, Exactly-Once, At-the-most-once, Essentially-Once? Depending on these semantics, the microservices may need to handle the messages in a certain way. For example if the delivery semantics is At-Least-Once, then a microservice may receive the same message more than once and needs to be a Idempotent (https://medium.com/airbnb-engineering/avoiding-double-payments-in-a-distributed-payments-system-2981f6b070bb). In At-most-once, the services would have to deal with lost messages.
High-Throughput (Realtime)
The producer and consumer pipeline should be non-blocking.
High-Availability
Needs to provide HA architecture to enable the messages system to work in case of partial failures. Multiple replicas of a topic distributed across network ensures availability in event of a network partition.
Durable
In event of failure, the messages should be persisted and readable.
Scalable
Grow the capacity of the system over-time. Capacity includes data and processing both.
Backpressure
A slow consumer should be able to handle data from fast producer. The consumer should not fail under the weight of messages coming from a producer.
This is an 8 part series on reactive microservices. [1 , 2 , 3, 4, 5, 6, 7, 8]