Operating Kafka in Kubernetes with Strimzi

Kubernetes is not the first platform that comes to mind to run Apache Kafka clusters. Indeed, Kafka’s strong dependency on storage might be a pain point regarding Kubernetes’ way of doing things when it comes to persistent storage. Kafka brokers are unique and stateful, how can we implement this in Kubernetes?

Let’s go through the basics of Strimzi, a Kafka operator for Kubernetes curated by Red Hat and see what problems it solves.

A special focus will be made on how to plug additional Kafka tools to a Strimzi installation.

We will also compare Strimzi with other Kafka operators by providing their pros and cons.

Strimzi

Strimzi is a Kubernetes Operator aiming at reducing the cost of deploying Apache Kafka clusters on cloud native infrastructures.

As an operator, Strimzi extends the Kubernetes API by providing resources to natively manage Kafka resources, including:

Kafka clusters
Kafka topics
Kafka users
Kafka MirrorMaker2 instances
Kafka Connect instances

The project is currently at the “Sandbox” stage at the Cloud Native Computing Foundation.

Note: The CNCF website defines a “sandbox” project as “Experimental projects not yet widely tested in production on the bleeding edge of technology.”

With Strimzi, deploying a 3 broker tls-encrypted cluster is as simple as applying the following YAML file:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
spec:
  kafka:
    version: 3.2.3
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.2"
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 100Gi
          deleteClaim: false
        - id: 1
          type: persistent-claim
          size: 100Gi
          deleteClaim: false
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 100Gi
      deleteClaim: false
  entityOperator:
    topicOperator: {}
    userOperator: {}

A topic looks like this:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: my-topic
  labels:
    strimzi.io/cluster: my-cluster
spec:
  partitions: 1
  replicas: 1
  config:
    retention.ms: 7200000
    segment.bytes: 1073741824

Both of these examples are from the examples directory of the Strimzi operator. This directory includes many more examples covering all of Strimzi’s capabilities.

Security

An interesting feature of Strimzi is the out-of-the-box security features. By default, intra-broker communication is encrypted with TLS while communication with ZooKeeper is both autenticated and encrypted with mTLS.

The Apache ZooKeeper clusters backing the Kafka instances are not exposed outside of the Kubernetes cluster, providing additionnal security.

These configurations are actually impossible to override, thought it is possible to access the ZooKeeper by using a tweak project by scholzj.

Strimzi PodSets

Kubernetes comes with its own solution for managing distributed stateful applications: StatefulSets.

The official documentation states:

(StatefulSets) manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods.

While StatfulSets have the benefit of being Kubernetes native resources, they have some limitations.

Here are a few examples:

Scaling up and down is linear. If you have a StatefulSet with 3 pods: pod-1, pod-2, pod-3, scaling up will create pod-4 and scaling down can only delete pod-4. This can be an issue when you want to eliminate a particular pod of your deployment. Applied to Kafka, you might be in a situation where a bad topic can make a broker instable, with StatefulSets you can not delete this particular broker and scale out a new fresh broker.
All the pods share the same specs (CPU, Mem, # of PVCs, etc.)
Critical node failure requires manual intervention

These limitations were addressed by the Strimzi team by developping their own resources: the StrimziPodSets, a feature introduced in Strimzi 0.29.0.

The benefits of using StrimziPodSets include:

Scaling up and down is more flexible
Per broker configuration
Opens the gate for broker specialization once ZooKeeper-less Kafka is GA (KIP-500, more on this topic later in the article)

A drawback of using StrimziPodSets is that the Strimzi Operator instance becomes critical.

If you want to hear more about the Strimzi PodSets, feel free to watch the StrimziPodSets - What it is and why should you care? video by Jakub Scholz.

Deploying Strimzi

Strimzi’s Quickstart documentation is perfectly complete and functionnal.

We will focus the rest of the article on addressing useful issues that are not covered by Strimzi.

Kafka UI on top of Strimzi

Strimzi brings a lot of comfort for users when it comes to managing Kafka resources in Kubernetes. We wanted to bring something to the table by showing how to deploy a Kafka UI on top of a Strimzi cluster as a native Kubernetes ressource.

There are multiple open source Kafka UI projects on GitHub, to cite a few:

Let’s go for Kafka UI which has the cleanest UI (IMO) among the competition.

The project provides official Docker images as we can see in the documentation. We will leverage this image and deploy a Kafka UI instance as a Kubernetes deployment.

The following YAML is an example of a Kafka UI instance configured over a SCRAM-SHA-512 authenticated Strimzi Kafka cluster. The UI is authenticated against an OpenLDAP via ldaps.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-kafka-ui
  namespace: kafka
spec:
  selector:
    matchLabels:
      app: cluster-kafka-ui
  template:
    metadata:
      labels:
        app: cluster-kafka-ui
    spec:
      containers:
        - image: provectuslabs/kafka-ui:v0.4.0
          name: kafka-ui
          ports:
            - containerPort: 8080
          env:
            - name: KAFKA_CLUSTERS_0_NAME
              value: "cluster"
            - name: KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS
              value: "cluster-kafka-bootstrap:9092"
            - name: KAFKA_CLUSTERS_0_PROPERTIES_SECURITY_PROTOCOL
              value: SASL_PLAINTEXT
            - name: KAFKA_CLUSTERS_0_PROPERTIES_SASL_MECHANISM
              value: SCRAM-SHA-512
            - name: KAFKA_CLUSTERS_0_PROPERTIES_SASL_JAAS_CONFIG
              value: 'org.apache.kafka.common.security.scram.ScramLoginModule required username="admin" password="XSnBiq6pkFNp";'
            # LDAP auth
            - name: AUTH_TYPE
              value: LDAP
            - name: SPRING_LDAP_URLS
              value: ldaps://myldapinstance.company:636
            - name: SPRING_LDAP_DN_PATTERN
              value: uid={0},ou=People,dc=company
            - name: SPRING_LDAP_ADMINUSER
              value: uid=admin,ou=Apps,dc=company
            - name: SPRING_LDAP_ADMINPASSWORD
              value: Adm1nP@ssw0rd!
            # Custom truststore for ldaps
            - name: JAVA_OPTS
              value: "-Djdk.tls.client.cipherSuites=TLS_RSA_WITH_AES_128_GCM_SHA256 -Djavax.net.ssl.trustStore=/etc/kafka-ui/ssl/truststore.jks"
          volumeMounts:
            - name: truststore
              mountPath: /etc/kafka-ui/ssl
              readOnly: true
      volumes:
        - name: truststore
          secret:
            secretName: myldap-truststore

Note: By leveraging a PLAINTEXT internal listener on port 9092, we don’t need to provide a KAFKA_CLUSTERS_0_PROPERTIES_SSL_TRUSTSTORE_LOCATION configuration.

With this configuration, users need to authenticate via LDAP to the Kafka UI. Once they are logged in, the underlying user used for interactions with the Kafka cluster is the admin user defined in KAFKA_CLUSTERS_0_PROPERTIES_SASL_JAAS_CONFIG. Role based access control was recently introduced with this issue.

Schema Registry with Strimzi

We had a functionnal need to deploy a Schema Registry instance for our Kafka clusters running in Kubernetes.

While Strimzi goes the extra mile by managing additional tools like Kafka Connect or MirrorMaker instances, it is not yet capable of deploying a Schema Registry.

To mitigate this issue, the Rubin Observatory Science Quality and Reliability Engineering team worked on the strimzi-registry-operator.

The configurations we used are the one showcased in the example section of the README.

The only issue we encountered was that the operator is not yet capable to deploy a Schema Registry backed on a SCRAM-SHA-512 secured cluster.

What about ZooKeeper-less Kafka?

After many years of work on KIP-500, the Apache Kafka team finally announced that running Kafka in KRaft mode (ZooKeeper less) became production ready. The announcement was made as part of the Kafka 3.3 release.

The Strimzi team began work on the KRaft mode in Strimzi 0.29.0. As stated in the Strimzi documentation, the feature is still experimental, both on Kafka and Strimzi levels.

Strimzi’s primary contributor, Jakub Scholz, has commented the following on the matter:

I think calling it production ready for new clusters is a bit strange. It means that we would need to maintain two parallel code paths with guaranteed upgrades etc. for possibly a long time. So, TBH, I hoped we would have much more progress at this point in time and be more prepared for ZooKeeper removal. But as a my personal opinion - I would be probably very reluctant to call anything at this stage production ready anyway.

Following on these comments, we can guess that ZooKeeper-less Kafka is not going to be the default configuration in Strimzi in the next release (0.34.0 at the time of writing) but it will definitely happen at some point.

What about storage?

Storage is often a pain point with bare metal Kubernetes clusters and Kafka makes no exception.

The community consensus for provisioning storage on Kubernetes is via Ceph with Rook thought other solutions exists (Longhorn or OpenEBS on the Open Source side, Portworx or Linstor as proprietary solutions).

Comparing storage engines for bare metal Kubernetes clusters is too big a topic to be included in this article but feel free to check out our previous article ”Ceph object storage within a Kubernetes cluster with Rook” for more on Rook.

We did have the opportunity to compare performances between a 3 brokers Kafka installation with Strimzi/Rook Ceph against a 3 brokers Kafka cluster running on the same machine with direct disk access.

Here are the specs and results of the benchmark:

Specs

Kubernetes environement:

Kafka Version 3.2.0 on Kubernetes through Strimzi
3 brokers (one pod per node)
6 RBD devices per broker (provisionned by the Rook Ceph Storage Class)
Xms java default (2g)
Xmx java default (29g)

Bare metal environement:

Kafka Version 3.2.0 as JVM process with the Apache release
3 brokers (one JVM per node)
6 RBD devices per broker (JBOD with ext4 formatting)
Xms java default (2g)
Xmx java default (29g)

Notes: The benchmarks were run on the same machines (HP Gen 7 with 192 Gb RAM and 6 x 2 TB disks) with RHEL 7.9. Kubernetes was not running when Kafka as JVM process was running and vice versa.

kafka-producer-perf-test \
--topic my-topic-benchmark \
--record-size 1000 \
--throughput -1 \
--producer.config /mnt/kafka.properties \
--num-records 50000000

Note: The topic my-topic-benchmark has 100 partitions and 1 replica.

Results

We ran the previous benchmark 10 times on each configuration and averaged the results:

Metric	JBOD bare metal	Ceph RBD	Performance difference
Records/sec	75223	65207	- 13.3 %
Avg latency	1.45	1.28	+ 11.1 %

The results are interesting: while the write performances were better on JBOD, the latency was slower using Ceph.

Strimzi alternatives

There are two main alternatives to Strimzi when it comes to operating Kafka on Kubernetes:

Confluent for Kubernetes
Koperator (previously known as “Banzai Cloud Kafka Operator”)

We did not test Koperator thoroughly so it would be unfair to compare it to Strimzi in this article.

As for the Confluent operator, it provides many features that we don’t have with Strimzi. Here are a few that we deemed interesting:

Schema Registry integration
ksqlDB integration
LDAP authentication support
Out-of-the-box UI (Confluent Control Center) for both Admins and Developpers
Alerting
Tiered storage

All these come with the cost (literally) of buying a commercial license from Confluent. Note that the operator and Control Center can be tested for a 30 days trial period.

Share this article