Batch processing

Batch processing is a data processing term describing the non-interactive, automatic, sequential, and complete processing of one or more input file(s). Common examples for batch processing include:

Accounting: Book incoming payments of a working day, leading to new account balances
Data migration: Convert a number of files from one format into another
Retail: Generate aggregate statistics from all sales in the current month

The results of batch processing are often batches themselves, for example lists of receipts, reports or changed data sets.

Learn more: Wikipedia

Storage size and generation time in popular file formats

Categories: Data Engineering, Data Science | Tags: Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)

Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…

By Barthelemy NGOM

Mar 22, 2021

Comparison of different file formats in Big Data

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…

By Aida NGOM

Jul 23, 2020

Apache Flink: past, present and future

Categories: Data Engineering | Tags: Flink, Pipeline, Kubernetes, Machine Learning, SQL, Streaming

Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…

By César BEREZOWSKI

Nov 5, 2018

Apache Beam: a unified programming model for data processing pipelines

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Flink, Pipeline, Spark

In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…

By Gauthier LEONARD

May 24, 2018

Batch processing

Related articles

Storage size and generation time in popular file formats

Comparison of different file formats in Big Data

Apache Flink: past, present and future

Apache Beam: a unified programming model for data processing pipelines