File Format
A file format is a representation type of data that organizes the data inside a file. The type of format is often associated and identified with a file extension (e.g. .csv). This representation allows software adapted to the formats in question to be able to decode information contained in the file, as well as interoperability between software or other computer programs.
Choose an appropriate file format has a paramount importance in the context of data processing. Because depending of the use cases, some formats are more suitable than others due to their specificities. For example, CSV is a very understandable and widely used format despite its lack of formalism.
- Learn more
- Wikipedia
Related articles
Data platform requirements and expectations
Categories: Big Data, Infrastructure | Tags: Data Engineering, Data Governance, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Science
A big data platform is a complex and sophisticated system that enables organizations to store, process, and analyze large volumes of data from a variety of sources. It is composed of severalā¦
By David WORMS
Mar 23, 2023
Comparison of database architectures: data warehouse, data lake and data lakehouse
Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data lakehouse, Data Warehouse, File Format
Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparingā¦
By Gonzalo ETSE
May 17, 2022
CSV package for Node.js version 6
Categories: Node.js | Tags: Data Engineering, Refactoring, CSV, File Format, Release and features
Version 6 of the package for Node.js is released along its sub projects. Here are the latest versions: version , latest version was NPM version , latest version was NPM version , latest versionā¦
By David WORMS
Nov 15, 2021
Storage size and generation time in popular file formats
Categories: Data Engineering, Data Science | Tags: Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)
Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in aā¦
Mar 22, 2021
Introduction to Ludwig and how to deploy a Deep Learning model via Flask
Categories: Data Science, Tech Radar | Tags: Learning and tutorial, Deep Learning, Ludwig Deep Learning Toolbox, Machine Learning, Python
Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomousā¦
Mar 2, 2020
Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop
Categories: Data Engineering, Learning | Tags: Spark, Apache Spark Streaming, Python, Streaming
Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Dataā¦
May 28, 2019
Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customersā¦
By David WORMS
Jun 18, 2018
State of the Hadoop open-source ecosystem in early 2013
Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, Hadoop, Kafka, Mahout, Data Science
Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flumeā¦
By David WORMS
Jul 8, 2013
Apache Hive Essentials How-to by Darren Lee
Categories: Business Intelligence, Learning | Tags: UDF, Hadoop, Hive, File Format, SQL
Recently, Iāve been ask to review a new book on Apache Hive called āApache Hive Essentials How-toā (edit: the second edition is now available) written by Darren Lee and published by Packt Publishingā¦
By David WORMS
Apr 23, 2013
Convert .flac music files to .mp3 on osx
Categories: Hack | Tags: OS X, File Format
As an osx user for years now, one should know by then that iTunes doesnāt support the flac format. We are now in 2012, Iāve been waiting for this to happen since years know. Loosing patience, darkā¦
By David WORMS
Jul 20, 2012
HDFS and Hive storage - comparing file formats and compression methods
Categories: Big Data | Tags: Business intelligence, Hive, ORC, Parquet, File Format
A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. Theā¦
By David WORMS
Mar 13, 2012
Two Hive UDAF to convert an aggregation to a map
Categories: Data Engineering | Tags: Java, HBase, Hive, File Format
I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: āUDAFToMapā and āUDAFToOrderedMapā or you can download the jar file. Theā¦
By David WORMS
Mar 6, 2012
Timeseries storage in Hadoop and Hive
Categories: Data Engineering | Tags: CRM, timeseries, Tuning, Hadoop, HDFS, Hive, File Format
In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rowsā¦
By David WORMS
Jan 10, 2012