Your web browser is out of date. Update your browser for more security, speed and the best experience on this site.

Event Stream Processing with Apache Kafka

As soon as you step out the door, you notice that the world never stands still. A car passes by, the traffic light turns red, and an alarm blares in the background. These are small and large events that occur thousands of times throughout the day. Online, too, there are continuous activities generating a significant amount of data. By using Apache Kafka for your event streaming, you not only gain visibility into all these events but can also detect risks and opportunities.

As a company, you typically don't use just one application, but a range of digital products and systems. Event streaming is a convenient way to synchronize your applications and business processes, allowing you to operate those tools in the same manner. It's a bit like in an online shop, where each event is handled chronologically without losing any information along the way.

Share article
Java
Logo1

Event Streaming

The data captured in event streaming can come from various sources, such as databases, sensors, mobile devices, cloud services, and software applications.

You can process this stream of events in real-time or retroactively by performing manipulations, processes, or responses. Subsequently, the event stream can be forwarded to a destination. There are also various possible destinations, such as databases, data warehouses, messaging queues, or software applications. You can persist the data from your event streaming, allowing you to replay certain events afterward. It is advisable to use a lightweight format to store the data.

Event streaming provides a continuous flow of data that can be interpreted in multiple ways, ensuring that information arrives at the right place at the right time.

What is event streaming used for?

In financial transactions such as on the stock market, in banks, or in insurance, it is necessary for these transactions to be processed in real-time. Event streaming can be a good solution for this, allowing for secure handling of data.

Another relevant example is monitoring traffic by processing data in real-time, providing up-to-date insights into the state of traffic. The throughput of this data can be very high, as you are tracking thousands or millions of vehicles regularly transmitting their position and other data. You can also perform ad-hoc analyses on the data, creating insights after the fact.

Machines and sensors connected to the internet also provide an interesting source of data. You can analyze this data both ad-hoc and in real-time. For example, you could track the number of people in the rooms of a building and dispatch staff to improve flow where needed. You can also perform ad-hoc analyses to, for example, rearrange the building or place directional signs more effectively if people often follow the wrong route.

In webshops or physical stores, you can collect customer interactions and orders and potentially respond to them. For example, if a customer visits your webshop, adds something to their cart, and then leaves the site, you could send an email reminding the customer to complete their purchase. Also, when an order is placed, when payment is made, or when the package is shipped, event stream processing allows you to send an email.

Another example? Monitoring patients and predicting changes. Based on images and data, you can use event streaming to predict if there is an emergency situation and whether intervention is necessary. Because you can monitor almost in real-time with event streaming, this can be crucial for the patient's well-being.

Bringing together different departments of your company is made possible by connecting, storing, and making data available. With event streaming, you can let your data live on a data platform. The advantage is that you no longer have to produce your data multiple times and send it to different entities. You can simply produce it on your data platform and consume it multiple times.

Kafka cluster

Event Streaming Platforms

To successfully perform end-to-end event streaming, you need three key components:

  • The ability to publish to and subscribe to event stream capabilities. Continuous import and export to other systems should be central to this process.
  • There is a need for sustainable and reliable storage of event streams for an indefinite duration.
  • You must be able to query processes of event streams as they occur or retroactively.

It is crucial that these components are implemented in a distributed, scalable, elastic, fault-tolerant, and secure manner. An example of such a platform is Apache Kafka, which has been in existence for over 10 years.

Kafka can be deployed on bare-metal hardware, virtual machines, and containers. It can run both on-premise and in the cloud. There are both self-managed and fully managed services available. At Axxes, we have experience with AWS MSK (Amazon Managed Streaming for Apache Kafka), among others.

With Kafka, you can seamlessly perform event streaming. The principles of Kafka rely heavily on the three components of an event streaming platform, allowing you to process large amounts of data without much further consideration. Furthermore, Kafka is decoupled from your application, allowing seamless consumption from and production to your platform.

Key Concepts and Terminology

An event records that "something happens" in the world or in your business. It is also called a record or message. Conceptually, an event has a key, value, timestamp, and optional metadata headers.

You will often hear the terms "producer" and "consumer" when it comes to event streaming. A producer is a client application that publishes events to Kafka, while a consumer is a client application that subscribes to these events. In Kafka, producers and consumers are completely decoupled from each other. This is a core principle of the platform, making it highly scalable.

Producers do not need to wait for consumers, and can therefore produce as much data as they want. Producers and consumers can also be scaled separately. Kafka provides various guarantees, such as the ability to process events exactly once.

A topic is where events are stored and processed. You could compare a topic to a folder in a file system, with events being the files in that specific folder. Topics are always multi-producer and multi-subscriber. Events can be read from topics as often as necessary, as unlike traditional messaging systems, events are not deleted after consumption.

In Kafka, you define how long you want to retain your data, also known as retention, through per-topic configuration. This can be done by setting a time limit or a limit on the size of the data allowed on the topic. Kafka's performance remains constant even when storing data for a long time.

Partitions are a component of a topic. A topic is partitioned, meaning it is spread across a number of "buckets" that reside on Kafka Brokers. These distributed locations are important for scalability, allowing client applications to read from and write to multiple brokers simultaneously. When a new event is emitted, it is added to a topic partition. Events with the same key are sent to the same partition, ensuring that any consumer of a topic partition will always read events with that specific key in chronological order.

To make your data fault-tolerant and highly available, each topic can be replicated. This can be done across geo-regions or datacenters, ensuring that multiple brokers have a copy of your data in case something goes wrong or maintenance needs to be performed on the brokers. A typical production setting is a replication factor of three, meaning three copies of your data exist simultaneously. This replication occurs at the level of a topic partition.

How does Kafka work?

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol.

The platform is deployed as a cluster of one or more servers that can span across multiple data centers or cloud regions. Some of the servers form the storage layer and are also known as brokers. Other servers are deployed as Kafka Connect, allowing them to continuously import and export data as event streams. Additionally, they can integrate Kafka with existing systems such as relational or NoSQL databases or other messaging systems, as well as other Kafka clusters.

Clients enable you to write distributed applications and microservices that read, write, and process event streams in a parallel, scalable, and fault-tolerant manner. Even in the event of a network problem or machine failure, Kafka guarantees that events are handled correctly.

In short, Kafka is an open-source distributed streaming system. It is easy to integrate with other applications and technologies, and thanks to Kafka's high throughput and scalability, it serves as an ideal backbone for your platform. The fact that large companies like Airbnb, LinkedIn, Netflix, The New York Times, and Microsoft rely on Kafka demonstrates that it is a robust and reliable system.


Staying up-to-date with our Insights?

Jorgi Leus

Jorgi Leus

Axxes