Best practices in (Kafka) event streaming consumers

A 7 minutes story written on Dec 2022 by Adrian B.G.

image

Apache Kafka is the gold standard for building real-time data pipelines and streaming apps. Scalable, fault-tolerant and blazingly fast, countless organizations rely on Kafka to provide a unified, high-throughput, low-latency platform for processing massive data feeds.

The downside? While the platform solves many big data issues, it can be complex to install, set up and maintain. As such, it’s important for developers to consider not just what the technology can do when the platform is running well, but how the code will behave during failures.

In this post, we focus on solving client/consumer failures within Apache Kafka. These best practices will help engineers build code that will maintain performance when consumed from a high-throughput pipeline and also address failures without creating bottlenecks or lost data.

This article was published on Crowdstrike’s Engineering Blog based on the company’s internal collective knowledge and multiple Crowdstrike Cloud Engineers contributions.

The best practices described in this topic can be applied to any event streaming consumer like RabbitMQ, AWS SQS and to other types of consumer patterns (not only distributed real-time topics).

Achieving a Resilient and Fault-tolerant Processing Pipeline During a Kafka Consumer Implementation

At CrowdStrike, our platform must ingest trillions of events per day, in real time and without data loss. To increase the throughput of the Kafka processing pipelines, CrowdStrike leverages the power of Go. This language allows us to build performant, lightweight microservices that can scale horizontally and use multi-core instances efficiently. The built-in concurrency mechanism that Go provides is particularly useful when the event process is I/O bound.

Each of our consumers has a series of concurrent workers. To maximize efficiency and avoid “hot spotting,” or overloading one partition with events, we typically use a “round-robin” assignment approach, which distributes events to a pool of workers. Kafka partitions allow us to have multiple instances of the consumer (as a microservice instance) so that when a batch of events is received from Kafka, the consumer will “deal” events to the workers concurrently; each worker receives only one event at a time.

Using this approach, we achieve two basic advantages: 1. Code simplicity: the code in the event handler is scoped to only one message, so the logic is straightforward and easy to test; and 2. Granular error handling: this allows the worker to fail only one event (the consumer will automatically retry/redrive) and the system can continue processing.

At CrowdStrike, we use Kafka to manage more than 2 trillion events a day that come from sensors all over the world. That equates to a processing time of just a few milliseconds per event. Our approach of using concurrency and error handling — which helps us avoid mass multiple failures — is incredibly important to the overall performance of our system and our customers’ security.

A Deep Dive Into Failure Processing in Kafka

For the purposes of this article, we will consider a failure to be any unsuccessful attempt to process a specific event. The reasons may vary, from dependency failures, such as database outages or timeouts, to malformed payloads. The solution that we choose for these issues will depend on the type of failure we are experiencing, as well as the goal of the system.

It is important to note that not all solutions apply to all use cases. For example, a system may sacrifice its accuracy by allowing a small portion of events to be lost in order to achieve a higher throughput. Another may need to process the events in a specific order, sacrificing throughput and also making it incompatible with a redrive (retry) system, which we will cover later in this article.

When considering these best practices, it is important to take into account the goals of the code and ensure that the solution does not exacerbate the very problem the team is trying to solve. Typically speaking, our team experiences two basic failure types: partial failures and outages. What follows is an overview of our most common recovery techniques.

Partial failures

Partial failures occur when the system fails to process only a small portion of events. In writing the code for our internal systems that require a high throughput, we created a set of rules that will keep the events moving despite mass flow.

One rule we created is a timeout. This is when we isolate individual events that are taking too long to process. When this happens, the system is trained to set the event aside and pick up the next message. The idea is that it is more important to maintain the performance of the overall system, as opposed to processing every single event. Timeouts can be applied to the batch level and/or for each event in particular.

The side effect of not having “rogue events” is a predictable latency. This leads to a more deterministic throughput of the system, which is an important feature of any distributed system.

CrowdStrike’s microservices leverage the Golang driver built by Confluent (confluent-kafka-go and librdkafka) through our in-house built wrapper that implements all of the techniques mentioned in this article.

Another feature of our Kafka wrapper is the ability to retry specific messages that failed for any reason, including timeouts. The dispatcher that assigns events received from Kafka to the worker pool keeps a record of each message processing result and the amount of retries that were attempted.

In our system, there are three levels of retries:

  • one is at “runtime,” which is done while the batch is in progress;
  • the second is using a “redrive system,” which identifies specific messages to be reprocessed later and
  • the third is from cold storage.

In our system, the retries can be combined. For example, we allow three runtime retries for each redrive and a maximum of five redrives, which means we allow 15 retries in total. In practice, the runtime retries cover 99% of the failures. In the unlikely scenario of the event failing all 15 retries, it is removed from the pipeline altogether and set aside in a “dead letter queue” system (which in our case is a cold persistent storage) followed by other retries and manual investigations.

We call a “redrive” system a message queue that stores the same messages as the topic it is used for. In CrowdStrike’s case, the redrive is usually a secondary Kafka topic. When a message fails, it is requeued in another topic, and a resulting advantage is that the consumer can prioritize which events are more important (the new or failed ones). To implement this kind of system, you can also use the same topic, or another messaging queue (like SQS) altogether. The redrive comes with two caveats: it adds code complexity (the consumer has to listen from two topics), and it is not compatible with consumers that need to consume the events in order.

The first type of retry is meant to fix “glitches” in the system, such as failed requests or timeouts. An exponential retry in the orders should take just a few milliseconds. The second retry that uses the redrive system usually fixes temporary malfunctions of the consumer dependencies. The best example would be a database rejecting requests for several minutes.

For readers that do not know how Kafka offsets work, here is a brief overview of the most common scenario: Consumer groups do not have the ability to acknowledge or retry a specific message from a Kafka topic. It has to advance its offsets, communicating to Kafka that all of those messages were processed. This means that the consumer group is always stuck at the oldest unprocessed message. By “redriving” the few failed events, we can acknowledge the messages from the topic because they will return later, through the redrive. Pro tip: Do not consider malformed events as failures, as this can lead to unwanted side effects, especially when throttling, which leads to our next topic: Outages

Outages

Continue reading and find out how to deal with consumer downstream outages on CrowdStrike’s Engineering Blog.

More articles

You may be interested to find out how CrowdStrike stores over 40 petabytes of data and handles trillions of events per day while routinely serving upward of 70 million requests per second.

Thanks! 🤝

Please share the article, subscribe or send me your feedback so I can improve the following posts!

comments powered by Disqus