Apache Iceberg
Apache Iceberg is an open data format for big analytic datasets. Developed by Netflix, Iceberg was designed to be an open community standard and a table format specification that allows compatibility across multiple languages and implementations. After being made open source, organizations like Apple have actively helped in its development.
Between 2016 and 2018, Iceberg, alongside Delta Tables and Apache Hudi emerged to challenge Apache Hive's table format used since 2010. Besides working as a query engine for large batch jobs, Hive works as a metadata catalog and table format used by query engines such as Spark and Presto. The main issue with Hive was handling data changes over large datasets while coordinating multiple applications and not corrupting the data. To solve this atomic transactions were required.
According to Iceberg creators, the project brings the reliability and simplicity of SQL tables to big data while making it possible for engines like Spark, Trino, Flink, Presto, and Hive to work with the same tables simultaneously and safely. It is written in Java and offers a Scala API. The center of its architectural design contains a catalog that supports operations for updating the current metadata pointer, allowing for atomic transactions.
Iceberg is still in active development and has started to be integrated and implemented by multiple organizations like AWS, Adobe, Apple, Netflix, Dremio, Linkedin, Expedia.
- Learn more
- Official website
Related articles
CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP
Categories: Big Data, Data Engineering, Learning | Tags: NiFi, Business intelligence, Data Engineering, Iceberg, Spark, Big Data, Cloudera, CDP, Data Analytics, Data Lake, Data Warehouse
In this hands-on lab session we demonstrate how to build an end-to-end big data solution with Cloudera Data Platform (CDP) Public Cloud, using the infrastructure we have deployed and configured overā¦
Jul 24, 2023
CDP part 1: introduction to end-to-end data lakehouse architecture with CDP
Categories: Cloud Computing, Data Engineering, Infrastructure | Tags: Data Engineering, Hortonworks, Iceberg, AWS, Azure, Big Data, Cloud, Cloudera, CDP, Cloudera Manager, Data Warehouse
Cloudera Data Platform (CDP) is a hybrid data platform for big data transformation, machine learning and data analytics. In this series we describe how to build and use an end-to-end big dataā¦
By Stephan BAUM
Jun 8, 2023
Data platform requirements and expectations
Categories: Big Data, Infrastructure | Tags: Data Engineering, Data Governance, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Science
A big data platform is a complex and sophisticated system that enables organizations to store, process, and analyze large volumes of data from a variety of sources. It is composed of severalā¦
By David WORMS
Mar 23, 2023
Adaltas Summit 2022 Morzine
Categories: Big Data, Adaltas Summit 2022 | Tags: Data Engineering, Infrastructure, Iceberg, Container, Data lakehouse, Docker, Kubernetes
For its third edition, the whole Adaltas crew is gathering in Morzine for a whole week with 2 days dedicated to technology the 15th and the 16Th of september 2022. The speakers choose one of theā¦
By David WORMS
Jan 13, 2023
Comparison of database architectures: data warehouse, data lake and data lakehouse
Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data lakehouse, Data Warehouse, File Format
Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparingā¦
By Gonzalo ETSE
May 17, 2022