Storage is the capacity to retain digital information on a computer component. In practice, storage is organised in hierarchy, placing hot data which required fast but costly access closer to the CPU and cold data further away on slower but persistent devices sometimes accessed through the network. Fast but volatile storage is most often called "memory.".
The main characteristics of storage inclue volatility, mutability, accessibility, adressability, capacity, performance, energy use and security.
- Learn more
- Wikipedia
Related articles

Storage and massive processing with Hadoop
Categories: Big Data | Tags: Hadoop, HDFS, Storage
Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projectsā¦
By David WORMS
Nov 26, 2010

Timeseries storage in Hadoop and Hive
Categories: Data Engineering | Tags: CRM, timeseries, Tuning, Hadoop, HDFS, Hive, File Format
In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rowsā¦
By David WORMS
Jan 10, 2012

Two Hive UDAF to convert an aggregation to a map
Categories: Data Engineering | Tags: Java, HBase, Hive, File Format
I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: āUDAFToMapā and āUDAFToOrderedMapā or you can download the jar file. Theā¦
By David WORMS
Mar 6, 2012

HDFS and Hive storage - comparing file formats and compression methods
Categories: Big Data | Tags: Business intelligence, Hive, ORC, Parquet, File Format
A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. Theā¦
By David WORMS
Mar 13, 2012

Merging multiple files in Hadoop
Categories: Hack | Tags: File system, Hadoop, HDFS
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the āgetmergeā utility of but contrary to āgetmergeā, the finalā¦
By David WORMS
Jan 12, 2013

State of the Hadoop open-source ecosystem in early 2013
Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, Hadoop, Kafka, Mahout, Data Science
Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flumeā¦
By David WORMS
Jul 8, 2013

Red Hat Storage Gluster and its integration with Hadoop
Categories: Big Data | Tags: GlusterFS, Red Hat, Hadoop, HDFS, Storage
I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I willā¦
By David WORMS
Jul 3, 2015

Hive, Calcite and Druid
Categories: Big Data | Tags: Business intelligence, Database, Druid, Hadoop, Hive
BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnalā¦
By David WORMS
Jul 14, 2016

Kubernetes 1.8
Categories: Containers Orchestration, Open Source Summit Europe 2017 | Tags: containerd, CRD, RBAC, Kubernetes, Network, OCI, Release and features
The 1.8 release of Kubernetes brings a lot of new things. With 2500+ pull request, 2000+ commits, 400+ commiters, Kubernetes added 39 new features in this version. This is the richest release in termsā¦
Oct 24, 2017

Kubernetes Storage Primitives for Stateful Workloads
Categories: Cloud Computing, Containers Orchestration, Open Source Summit Europe 2017 | Tags: Container Storage Interface (CSI), PVC, Azure, Docker, GCE, Kubernetes, Storage
This article is based on the presentation āIntroduction to Kubernetes Storage Primitives for Stateful Workloadsā from the OSS Convention Prague 2017 by the {Code} team. So, letās start, what isā¦
Oct 28, 2017

Notes after Katacoda Training on Kubernetes Container Orchestration
Categories: Containers Orchestration, Learning | Tags: Helm, Ingress, Kubeadm, CNI, Micro Services, Minikube, Kubernetes
A few weeks ago, I dedicated two days to follow the turorials available on Katacoda, the interactive learning platform for Kubernetes or any other container orchestration platform. Iām sharing myā¦
By David WORMS
Dec 14, 2017

YARN and GPU Distribution for Machine Learning
Categories: Data Science, DataWorks Summit 2018 | Tags: GPU, YARN, Machine Learning, Neural Network, Storage
This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can beā¦
May 30, 2018

Apache Flink: past, present and future
Categories: Data Engineering | Tags: Pipeline, Flink, Kubernetes, Machine Learning, SQL, Streaming
Apache Flink is a little gem which deserves a lot more attention. Letās dive into Flinkās past, its current state and the future it is heading to by following the keynotes and presentations at Flinkā¦
Nov 5, 2018

Running Apache Hive 3, new features and tips and tricks
Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: JDBC, LLAP, Druid, Hadoop, Hive, Kafka, Release and features
Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available sinceā¦
Jul 25, 2019

Rook with Ceph doesn't provision my Persistent Volume Claims!
Categories: DevOps & SRE | Tags: PVC, Linux, Rook, Ubuntu, Ceph, Cluster, Internship, Kubernetes
Ceph installation inside Kubernetes can be provisioned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoidā¦
Sep 9, 2019

Data versioning and reproducible ML with DVC and MLflow
Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th Novemberā¦
Sep 30, 2020

OAuth2 and OpenID Connect, a gentle and working introduction (Part 1)
Categories: Containers Orchestration, Cyber Security | Tags: Go Lang, JAMstack, LDAP, CNCF, Kubernetes, OAuth2, OpenID Connect
Understanding OAuth2, OpenID and OpenID Connect (OIDC), how they relate, how the communications are established, and how to architecture your application with the given access, refresh and id tokensā¦
By David WORMS
Nov 17, 2020

Apache HBase: RegionServers co-location
Categories: Big Data, Adaltas Summit 2021, Infrastructure | Tags: Ambari, Database, Infrastructure, Tuning, Hadoop, HBase, Big Data, HDP, Storage
RegionServers are the processes that manage the storage and retrieval of data in Apache HBase, the non-relational column-oriented database in Apache Hadoop. It is through their daemons that any CRUDā¦
Feb 22, 2022

Architecture of object-based storage and S3 standard specifications
Categories: Big Data, Data Governance | Tags: Database, API, Amazon S3, Big Data, Data Lake, Storage
Object storage has been growing in popularity among data storage architectures. Compared to file systems and block storage, object storage faces no limitations when handling petabytes of data. Byā¦
Jun 20, 2022

MinIO object storage within a Kubernetes cluster
Categories: Big Data, Data Governance, Learning | Tags: Amazon S3, Big Data, Cluster, Data Lake, Kubernetes, Storage
MinIO is a popular object storage solution. Often recommended for its simple setup and ease of use, it is not only a great way to get started with object storage: it also provides excellentā¦
Jul 9, 2022

Ceph object storage within a Kubernetes cluster with Rook
Categories: Big Data, Data Governance, Learning | Tags: Amazon S3, Big Data, Ceph, Cluster, Data Lake, Kubernetes, Storage
Ceph is a distributed all-in-one storage system. Reliable and mature, its first stable version was released in 2012 and has since then been the reference for open source storage. Cephās main perk isā¦
Aug 4, 2022