Apache ORC
Related articles
Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: Avro, Hive, NiFi, ORC, Spark, File Format, Data Governance, HDF, Operation, Protocol Buffers, Registry, Schema, Data Lake
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…
By David WORMS
Jun 18, 2018
What's new in Apache Spark 2.3?
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, ORC, Spark, Spark MLlib, PySpark, Docker, Kubernetes, Streaming, Tuning, pandas
Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…
May 23, 2018
Essential questions about Time Series
Categories: Big Data | Tags: Druid, Hive, ORC, Elasticsearch, Graphana, IOT, HBase
Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand. We…
By David WORMS
Mar 19, 2018
HDFS and Hive storage - comparing file formats and compression methods
Categories: Big Data | Tags: Analytics, Hive, ORC, Parquet, File Format
A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The…
By David WORMS
Mar 13, 2012