Apache Kafka

Apache Kafka is an open source platform for stream processing. The software was originally developed by LinkedIn and written in the programming languages Scala and Java. In 2011, Kafka became part of the Apache foundation. The software was named after the author Franz Kafka because it represents a system optimized for writing.

The aim of the project is to provide a unified platform with high throughput and low latency for processing real-time data feeds. Kafka can connect to external systems and, with Kafka Streams, offers stream processing in Java.

Kafka is widely used in real-time streaming data architectures to provide real-time analytics. It is defined for:

Publication and subscription to data streams
Effective storage of data streams
Process streams in real time

As the software is a fast, scalable and fault tolerant publish-subscribe messaging system, Kafka is used in use cases where the message systems Java Message Service (JMS), RabbitMQ and AMQP may not be considered due to the volume and responsiveness. Kafka offers higher throughput and reliability properties and is therefore suitable for high data volumes with which conventional Message Oriented Middleware (MOM) may be overwhelmed.

Learn more: Official website
Related tags: NATS; Streaming

State of the Hadoop open-source ecosystem in early 2013

Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, Hadoop, Kafka, Mahout, Data Science

Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…

By David WORMS

Jul 8, 2013

Apache Apex: next gen Big Data analytics

Categories: Data Science, Events, Tech Radar | Tags: Apex, Storm, Tools, Flink, Hadoop, Kafka, Data Science, Machine Learning

Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallel…

By César BEREZOWSKI

Jul 17, 2016

Exposing Kafka on two different networks

Categories: Infrastructure | Tags: Cyber Security, VLAN, Kafka, Cloudera, CDH, Network

A Big Data setup usually requires you to have multiple networking interface, let’s see how to set up Kafka on more than one of them. Kafka is a open-source stream processing software platform system…

By César BEREZOWSKI

Jul 22, 2017

Apache Metron in the Real World

Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithm, NiFi, Solr, Storm, pcap, RDBMS, HDFS, Kafka, Metron, Spark, Data Science, Elasticsearch, SQL

Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation was…

By Michael HATOUM

May 29, 2018

Curing the Kafka blindness with the UI manager

Categories: Big Data | Tags: Ambari, Hortonworks, HDF, JMX, UI, Kafka, Ranger, HDP

Today it’s really difficult for developers, operators and managers to visualize and monitor what happens in a Kafka cluster. This articles covers a new graphical interface to oversee Kafka. It was…

By Lucas BAKALIAN

Jun 20, 2018

Lando: Deep Learning used to summarize conversations

Categories: Data Science, Learning | Tags: Micro Services, Open API, Deep Learning, Internship, Kubernetes, Neural Network, Node.js

Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users to…

By Yliess HATI

Sep 18, 2018

Deploying a secured Flink cluster on Kubernetes

Categories: Big Data | Tags: Encryption, Flink, HDFS, Kafka, Elasticsearch, Kerberos, SSL/TLS

When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink native…

By David WORMS

Oct 8, 2018

Running Apache Hive 3, new features and tips and tricks

Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: JDBC, LLAP, Druid, Hadoop, Hive, Kafka, Release and features

Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…

By Gauthier LEONARD

Jul 25, 2019

Machine Learning model deployment

Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema

“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…

By Oskar RYNKIEWICZ

Sep 30, 2019

Internship Data Science & Data Engineer - ML in production and streaming data ingestion

Categories: Data Engineering, Data Science | Tags: DevOps, Flink, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python

Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…

By David WORMS

Nov 26, 2019

InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS

Categories: Big Data, Containers Orchestration | Tags: DevOps, LXD, Hadoop, Kafka, Spark, Ceph, Internship, Kubernetes, NoSQL

Context The acquisition of a high-capacity cluster is in line with Adaltas’ desire to build a PAAS-type offering to use and to provide Big Data and container orchestration platforms. The platforms are…

By David WORMS

Nov 26, 2019

Should you move your Big Data and Data Lake to the Cloud

Categories: Big Data, Cloud Computing | Tags: DevOps, AWS, Azure, Cloud, CDP, Databricks, GCP

Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big…

By Joris RUMMENS

Dec 9, 2019

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Categories: Data Engineering, Learning | Tags: Kafka, Spark, Apache Spark Streaming, Big Data, Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…

By Oskar RYNKIEWICZ

Apr 18, 2019

Policy enforcing with Open Policy Agent

Categories: Cyber Security, Data Governance | Tags: Kafka, Ranger, Authorization, Cloud, Kubernetes, REST, SSL/TLS

Open Policy Agent is an open-source multi-purpose policy engine. Its main goal is to unify policy enforcement across the cloud native stack. The project was created by Styra and it is currently…

By Leo SCHOUKROUN

Jan 22, 2020

Comparison of different file formats in Big Data

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…

By Aida NGOM

Jul 23, 2020

Internship in Data Engineering

Categories: Front End, Learning | Tags: Metrics, Monitoring, Hive, Kafka, Delta Lake, Elasticsearch, IaC, Internship, Kubernetes, Streaming

Job Description Data is a valuable business asset. Some call it the new oil. The data engineer collects, transform and refine raw data into information that can be used by business analysts and data…

By David WORMS

Oct 25, 2021

Spring 2022 internship - building a Data Lab

Categories: Data Science, Learning | Tags: MongoDB, Spark, Argo CD, Elasticsearch, Internship, Keycloak, Kubernetes, OpenID Connect, PostgreSQL

Job Description Over the last few years, we developed the ability to use computers to process large amounts of data. The ecosystem evolved over a large offering of tools and libraries and the creation…

By David WORMS

Nov 24, 2021

Operating Kafka in Kubernetes with Strimzi

Categories: Big Data, Containers Orchestration, Infrastructure | Tags: Kafka, Big Data, Kubernetes, Open source, Streaming

Kubernetes is not the first platform that comes to mind to run Apache Kafka clusters. Indeed, Kafka’s strong dependency on storage might be a pain point regarding Kubernetes’ way of doing things when…

By Leo SCHOUKROUN

Mar 7, 2023

Apache Kafka

Related articles