All our articles
Introduction to OpenLineage
Categories: Big Data, Data Governance, Infrastructure | Tags: Data Engineering, Infrastructure, Atlas, Data Lake, Data lakehouse, Data Warehouse, Data lineage
OpenLineage is an open-source specification for data lineage. The specification is complemented by Marquez, its reference implementation. Since its launch in late 2020, OpenLineage has been a presenceā¦
Dec 19, 2023
Installation Guide to TDP, the 100% open source big data platform
Categories: Big Data, Infrastructure | Tags: Infrastructure, VirtualBox, Hadoop, Vagrant, TDP
The Trunk Data Platform (TDP) is a 100% open source big data distribution, based on Apache Hadoop and compatible with HDP 3.1. Initiated in 2021 by EDF, the DGFiP and Adaltas, the project is governedā¦
By Paul FARAULT
Oct 18, 2023
New TDP website launched
Categories: Big Data | Tags: Programming, Ansible, Hadoop, Python, TDP
The new TDP (Trunk Data Platform) website is online. We invite you to browse its pages to discover the platform, stay informed, and cultivate contact with the TDP community. TDP is a completely openā¦
By David WORMS
Oct 3, 2023
CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP
Categories: Big Data, Data Engineering, Learning | Tags: NiFi, Business intelligence, Data Engineering, Iceberg, Spark, Big Data, Cloudera, CDP, Data Analytics, Data Lake, Data Warehouse
In this hands-on lab session we demonstrate how to build an end-to-end big data solution with Cloudera Data Platform (CDP) Public Cloud, using the infrastructure we have deployed and configured overā¦
Jul 24, 2023
CDP part 5: user permissions management on CDP Public Cloud
Categories: Big Data, Cloud Computing, Data Governance | Tags: Ranger, Cloudera, CDP, Data Warehouse
When you create a user or a group in CDP, it requires permissions to access resources and use the Data Services. This article is the fifth in a series of six: CDP part 1: introduction to end-to-endā¦
Jul 18, 2023
CDP part 4: user management on CDP Public Cloud with Keycloak
Categories: Big Data, Cloud Computing, Data Governance | Tags: EC2, Big Data, CDP, Docker Compose, Keycloak, SSO
Previous articles of the serie cover the deployment of a CDP Public Cloud environment. All the components are ready for use and it is time to make the environment available to other users to exploreā¦
Jul 4, 2023
CDP part 3: Data Services activation on CDP Public Cloud environment
Categories: Big Data, Cloud Computing, Infrastructure | Tags: Infrastructure, AWS, Big Data, Cloudera, CDP
One of the big selling points of Cloudera Data Platform (CDP) is their mature managed service offering. These are easy to deploy on-premises, in the public cloud or as part of a hybrid solution. Theā¦
Jun 27, 2023
CDP part 2: CDP Public Cloud deployment on AWS
Categories: Big Data, Cloud Computing, Infrastructure | Tags: Infrastructure, AWS, Big Data, Cloud, Cloudera, CDP, Cloudera Manager
The Cloudera Data Platform (CDP) Public Cloud provides the foundation upon which full featured data lakes are created. In a previous article, we introduced the CDP platform. This article is the secondā¦
Jun 19, 2023
CDP part 1: introduction to end-to-end data lakehouse architecture with CDP
Categories: Cloud Computing, Data Engineering, Infrastructure | Tags: Data Engineering, Hortonworks, Iceberg, AWS, Azure, Big Data, Cloud, Cloudera, CDP, Cloudera Manager, Data Warehouse
Cloudera Data Platform (CDP) is a hybrid data platform for big data transformation, machine learning and data analytics. In this series we describe how to build and use an end-to-end big dataā¦
By Stephan BAUM
Jun 8, 2023
Local development environments with Terraform + LXD
Categories: Containers Orchestration, DevOps & SRE | Tags: Automation, DevOps, KVM, LXD, Virtualization, VM, Terraform, Vagrant
As a Big Data Solutions Architect and InfraOps, I need development environments to install and test software. They have to be configurable, flexible, and performant. Working with distributed systemsā¦
Jun 1, 2023
Data platform requirements and expectations
Categories: Big Data, Infrastructure | Tags: Data Engineering, Data Governance, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Science
A big data platform is a complex and sophisticated system that enables organizations to store, process, and analyze large volumes of data from a variety of sources. It is composed of severalā¦
By David WORMS
Mar 23, 2023
Keycloak deployment in EC2
Categories: Cloud Computing, Data Engineering, Infrastructure | Tags: Security, EC2, Authentication, AWS, Docker, Keycloak, SSL/TLS, SSO
Why use Keycloak Keycloak is an open-source identity provider (IdP) using single sign-on (SSO). An IdP is a tool to create, maintain, and manage identity information for principals and to provideā¦
By Stephan BAUM
Mar 14, 2023
Operating Kafka in Kubernetes with Strimzi
Categories: Big Data, Containers Orchestration, Infrastructure | Tags: Kafka, Big Data, Kubernetes, Open source, Streaming
Kubernetes is not the first platform that comes to mind to run Apache Kafka clusters. Indeed, Kafkaās strong dependency on storage might be a pain point regarding Kubernetesā way of doing things whenā¦
Mar 7, 2023
Kubernetes: debugging with ephemeral containers
Categories: Containers Orchestration, Tech Radar | Tags: Debug, Kubernetes
Anyone who has ever had to manipulate Kubernetes has found himself confronted with the resolution of pod errors. The methods provided for this purpose are efficient, and allow to overcome the mostā¦
Feb 7, 2023
Dive into tdp-lib, the SDK in charge of TDP cluster management
Categories: Big Data, Infrastructure | Tags: Programming, Ansible, Hadoop, Python, TDP
All the deployments are automated and Ansible plays a central role. With the growing complexity of the code base, a new system was needed to overcome the Ansible limitations which will enable us toā¦
Jan 24, 2023
Adaltas Summit 2022 Morzine
Categories: Big Data, Adaltas Summit 2022 | Tags: Data Engineering, Infrastructure, Iceberg, Container, Data lakehouse, Docker, Kubernetes
For its third edition, the whole Adaltas crew is gathering in Morzine for a whole week with 2 days dedicated to technology the 15th and the 16Th of september 2022. The speakers choose one of theā¦
By David WORMS
Jan 13, 2023
How to build your OCI images using Buildpacks
Categories: Containers Orchestration, DevOps & SRE | Tags: CI/CD, CNCF, Docker, Kubernetes, OCI
Docker has become the new standard for building your application. In a Docker image we place our source code, its dependencies, some configurations and our application is almost ready to be deployedā¦
Jan 9, 2023
Big data infrastructure internship
Categories: Big Data, Data Engineering, DevOps & SRE, Infrastructure | Tags: Infrastructure, Hadoop, Big Data, Cluster, Internship, Kubernetes, TDP
Job description Big Data and distributed computing are at the core of Adaltas. We accompagny our partners in the deployment, maintenance, and optimization of some of the largest clusters in Franceā¦
By Stephan BAUM
Dec 2, 2022
Traefik, Docker and dnsmasq to simplify container networking
Categories: Containers Orchestration, Infrastructure, Tech Radar | Tags: DNS, Gatsby, JAMstack, Linux, Docker, Network
Good tech adventures start with some frustration, a need, or a requirement. This is the story of how I simplified the management and access of my local web applications with the help of Traefik andā¦
By David WORMS
Nov 17, 2022
WasmEdge: WebAssembly runtimes are coming for the edge
Categories: Containers Orchestration, Adaltas Summit 2021, Infrastructure, Tech Radar | Tags: JAMstack, Linux, Docker, Rust Lang, WebAssembly
With many security challenges solved by design in its core conception, lots of projects benefit from using WebAssembly. WasmEdge runtime is an efficient Virtual Machine optimized for edge computingā¦
Sep 29, 2022
Ingresses and Load Balancers in Kubernetes with MetalLB and nginx-ingress
Categories: Containers Orchestration, Infrastructure, Tech Radar | Tags: Ingress, Kubeadm, Cluster, Deployment, Kubernetes
When it comes to exposing services from a Kubernetes cluster and making it accessible from outside the cluster, the recommended option is to use a load-balancer type service to redirect incomingā¦
Sep 8, 2022
Spark on Hadoop integration with Jupyter
Categories: Adaltas Summit 2021, Infrastructure, Tech Radar | Tags: Infrastructure, Jupyter, Spark, YARN, CDP, HDP, Notebook, TDP
For several years, Jupyter notebook has established itself as the notebook solution in the Python universe. Historically, Jupyter is the tool of choice for data scientists who mainly develop in Pythonā¦
Sep 1, 2022
Framework laptop with NixOS, a user feedback
Categories: Learning, Tech Radar | Tags: CLI, DevOps, Learning and tutorial, Linux, Packaging, NixOS, Open source
A new job comes with a new laptop. As such, I was given a Framework Laptop DIY Edition with the objective to install and configure it entirely with NixOS. I will share my first impressions afterā¦
Aug 22, 2022
Ceph object storage within a Kubernetes cluster with Rook
Categories: Big Data, Data Governance, Learning | Tags: Amazon S3, Big Data, Ceph, Cluster, Data Lake, Kubernetes, Storage
Ceph is a distributed all-in-one storage system. Reliable and mature, its first stable version was released in 2012 and has since then been the reference for open source storage. Cephās main perk isā¦
By Luka BIGOT
Aug 4, 2022
MinIO object storage within a Kubernetes cluster
Categories: Big Data, Data Governance, Learning | Tags: Amazon S3, Big Data, Cluster, Data Lake, Kubernetes, Storage
MinIO is a popular object storage solution. Often recommended for its simple setup and ease of use, it is not only a great way to get started with object storage: it also provides excellentā¦
By Luka BIGOT
Jul 9, 2022
Architecture of object-based storage and S3 standard specifications
Categories: Big Data, Data Governance | Tags: Database, API, Amazon S3, Big Data, Data Lake, Storage
Object storage has been growing in popularity among data storage architectures. Compared to file systems and block storage, object storage faces no limitations when handling petabytes of data. Byā¦
By Luka BIGOT
Jun 20, 2022
TDP workshop: Become a TDP power user from your terminal
Categories: Events, Learning | Tags: DevOps, Ansible, Hadoop, Open source, TDP
The TDP CLI is used to deploy and operate your TDP services. It relies on tdp-lib to provide control and flexibility at your fingertips. Some time ago, we announced the public release of TDP - Trunkā¦
By Paul FARAULT
Jun 17, 2022
Comparison of database architectures: data warehouse, data lake and data lakehouse
Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data lakehouse, Data Warehouse, File Format
Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparingā¦
By Gonzalo ETSE
May 17, 2022
NixOS: Enabling LXD virtual machines using Flakes
Categories: Hack, Learning | Tags: Learning and tutorial, Linux, LXD, Packaging, VM, GitHub, NixOS, Open source
Nixpkgs is an ever-increasing collection of software packages for Nix and NixOS. Even with more than 80,000 packages, you easily run in a situation where there is a functionality that is not yetā¦
May 13, 2022
Databricks logs collection with Azure Monitor at a Workspace Scale
Categories: Cloud Computing, Data Engineering, Adaltas Summit 2021 | Tags: Metrics, Monitoring, Spark, Azure, Databricks, Log4j
Databricks is an optimized data analytics platform based on Apache Spark. Monitoring Databricks plateform is crucial to ensure data quality, job performance, and security issues by limiting access toā¦
By Claire PLAYE
May 10, 2022
Introducing Trunk Data Platform: the Open-Source Big Data Distribution Curated by TOSIT
Categories: Big Data, DevOps & SRE, Infrastructure | Tags: DevOps, Hortonworks, Ansible, Hadoop, HBase, Knox, Ranger, Spark, Cloudera, CDP, CDH, Open source, TDP
Ever since Cloudera and Hortonworks merged, the choice of commercial Hadoop distributions for on-prem workloads essentially boils down to CDP Private Cloud. CDP can be seen as the ābest of both worldsā¦
Apr 14, 2022
Blockchain 102: Cryptocurrencies, Wallets and DApps
Categories: Adaltas Summit 2021, Infrastructure | Tags: Cryptography, Infrastructure, Blockchain, Consensus
A lot of people own cryptocurrencies today. But holding some tokens on an exchange does not mean interacting with the blockchain. The assets you trade are only numbers stored inside the exchangeāsā¦
Apr 12, 2022
JS monorepos in prod 7: Continuous Integration and Continuous Deployment with GitHub Actions
Categories: DevOps & SRE, Front End | Tags: CI/CD, Monorepo, Node.js, Unit tests
The value of CI/CD lies in the ability to control and coordinate changes and feature addition in multiple, iterative releases while simultaneously having multiple services being actively developed inā¦
Apr 6, 2022
Nix package creation: install a not yet supported font
Categories: Hack | Tags: Learning and tutorial, Linux, Packaging, GitOps, NixOS, Open source
The Nix packages collection is large with over 60 000 packages. However, chances are that sometimes the package you need is not available. You must integrate it yourself. I needed for some fonts whichā¦
By David WORMS
Mar 29, 2022
Deploy your containerized AI applications with nvidia-docker
Categories: Containers Orchestration, Data Science | Tags: containerd, DevOps, Learning and tutorial, NVIDIA, Docker, Keras, TensorFlow
More and more products and services are taking advantage of the modeling and prediction capabilities of AI. This article presents the nvidia-docker tool for integrating AI (Artificial Intelligenceā¦
Mar 24, 2022
Ansible variables: choosing the right location
Categories: DevOps & SRE | Tags: Infrastructure, Ansible, IaC, YAML
Defining variables for your Ansible playbooks and roles can become challenging as your project grows. Browsing the Ansible documentation, the diversity of Ansible variables location is confusing, toā¦
Mar 15, 2022
Apache HBase: RegionServers co-location
Categories: Big Data, Adaltas Summit 2021, Infrastructure | Tags: Ambari, Database, Infrastructure, Tuning, Hadoop, HBase, Big Data, HDP, Storage
RegionServers are the processes that manage the storage and retrieval of data in Apache HBase, the non-relational column-oriented database in Apache Hadoop. It is through their daemons that any CRUDā¦
Feb 22, 2022
Reliable and reproducible Linux installation with NixOS
Categories: Infrastructure, Learning | Tags: Linux, Packaging, VM, NixOS, TDP
When using an operating system, upgrading packages or installing new ones are common tasks that introduce the risk of affecting the stability of the system. NixOS is a Linux distribution that ensuresā¦
Feb 8, 2022
Nix introduction, main concepts and commands
Categories: Infrastructure, Learning | Tags: Arch Linux, CentOS, Linux, OS X, Packaging, Ubuntu, NixOS, TDP
Nix is a functional package manager for Linux and other Unix systems, making the management of packages more reliable and easy to reproduce. With a traditional package manager, when updating a packageā¦
Feb 1, 2022
Blockchain 101: Blockchains and Consensus Mechanisms
Categories: Adaltas Summit 2021, Infrastructure, Learning | Tags: Cryptography, Infrastructure, Blockchain, Consensus
Cryptocurrencies are booming in 2021, with a market cap moving from 750 to more than 3,000 billion dollars. Letās face it, this is mainly due to speculation. A lot of people involved do not have aā¦
Jan 18, 2022
GitOps in practice, deploy Kubernetes applications with ArgoCD
Categories: Containers Orchestration, DevOps & SRE, Adaltas Summit 2021 | Tags: Argo CD, CI/CD, Git, GitOps, IaC, Kubernetes
GitOps is a set of practices to deploy applications using Git. Application definitions, configurations, and connectivity are to be stored in a version control software such as Git. Git then serves asā¦
Dec 16, 2021
JS monorepos in prod 6: CI/CD, continuous integration and deployment with Travis CI
Categories: DevOps & SRE, Front End | Tags: CI/CD, Monorepo, Node.js, Unit tests
Implementing continuous integration CI and continuous deployment (CD) on a monorepo is quite complex due to the diversity of multiple responsibilities between developers and the need to coordinateā¦
By David WORMS
Dec 6, 2021
Spring 2022 internship - building a Data Lab
Categories: Data Science, Learning | Tags: MongoDB, Spark, Argo CD, Elasticsearch, Internship, Keycloak, Kubernetes, OpenID Connect, PostgreSQL
Job Description Over the last few years, we developed the ability to use computers to process large amounts of data. The ecosystem evolved over a large offering of tools and libraries and the creationā¦
By David WORMS
Nov 24, 2021
CSV package for Node.js version 6
Categories: Node.js | Tags: Data Engineering, Refactoring, CSV, File Format, Release and features
Version 6 of the package for Node.js is released along its sub projects. Here are the latest versions: version , latest version was NPM version , latest version was NPM version , latest versionā¦
By David WORMS
Nov 15, 2021
H2O in practice: a protocol combining AutoML with traditional modeling approaches
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python, XGBoost
H20 comes with a lot of functionalities. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with traditional modeling and optimization approach. The objectiveā¦
Nov 12, 2021
Internship in Big Data infrastructure with TDP
Categories: Infrastructure, Learning | Tags: Cyber Security, DevOps, Java, Hadoop, IaC, Internship, TDP
Job Description Big Data and distributed computing is at Adaltasā core. We support our partners in the deployment, maintenance and optimization of some of Franceās largest clusters. Adaltas is also anā¦
By Daniel HARTY
Oct 25, 2021
Internship in Data Engineering
Categories: Front End, Learning | Tags: Metrics, Monitoring, Hive, Kafka, Delta Lake, Elasticsearch, IaC, Internship, Kubernetes, Streaming
Job Description Data is a valuable business asset. Some call it the new oil. The data engineer collects, transform and refine āāraw data into information that can be used by business analysts and dataā¦
By David WORMS
Oct 25, 2021
Internship in Web Technologies
Categories: Front End, Learning | Tags: DevOps, LDAP, React.js, CI/CD, Docker, GraphQL, IaC, Internship, Kubernetes, Node.js, OAuth2
Job Description As part of its Big Data activities, Adaltas Academy is an information-sharing platform bringing together articles, training content, and a knowledge base. The users of the platform areā¦
By David WORMS
Oct 14, 2021
H2O in practice: a Data Scientist feedback
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python
Automated machine learning (AutoML) platforms are gaining popularity and becoming a new important tool in the data scientistsā toolbox. A few months ago, I introduced H2O, an open-source platform forā¦
Sep 29, 2021
Adaltas Summit 2021, 2nd edition in corsica
Categories: Adaltas Summit 2021, Learning | Tags: Ansible, Hadoop, Spark, Azure, Blockchain, Deep Learning, Docker, Terraform, Kubernetes, Node.js
For its second edition, the whole Adaltas crew is gathering in Corsica for a whole week with 2 days dedicated to technology the 23rd and the 24th of september 2021. After a year and a half of sanitaryā¦
By David WORMS
Sep 21, 2021
Running your Travis CI builds locally with Docker
Categories: DevOps & SRE, Front End | Tags: Bash, Tools, CI/CD, Node.js, Unit tests
Setting up the environment to run the tests on a CI/CD can take a few roundtrips between your host machine and the CI/CD running remotely. For every attempt, youāll have to commit and publish yourā¦
By David WORMS
Sep 6, 2021
Using Cloudera Deploy to install Cloudera Data Platform (CDP) Private Cloud
Categories: Big Data, Cloud Computing | Tags: Ansible, Cloudera, CDP, Cluster, Data Warehouse, Vagrant, IaC
Following our recent Cloudera Data Platform (CDP) overview, we cover how to deploy CDP private Cloud on you local infrastructure. It is entirely automated with the Ansible cookbooks published byā¦
Jul 23, 2021
An overview of Cloudera Data Platform (CDP)
Categories: Big Data, Cloud Computing, Data Engineering | Tags: SDX, Big Data, Cloud, Cloudera, CDP, CDH, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Warehouse
Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security andā¦
Jul 19, 2021
Modern Python part 3: run a CI pipeline & publish your package to PiPy
Categories: DevOps & SRE | Tags: CI/CD, Git, GitHub, Python, Release and features, Unit tests
To propose a well-maintained and usable Python package to the open-source community or even inside your company, you are expected to accomplish a set of critical steps. First ensure that your code isā¦
By Faouzi BRAZA
Jun 28, 2021
Modern Python part 2: write unit tests & enforce Git commit conventions
Categories: DevOps & SRE | Tags: Git, pandas, Python, Unit tests
Good software engineering practices always bring a lot of long-term benefits. For example, writing unit tests permits you to maintain large codebases and ensures that a specific piece of your codeā¦
By Faouzi BRAZA
Jun 24, 2021
Modern Python part 1: start a project with pyenv & poetry
Categories: DevOps & SRE | Tags: Git, Python, Release and features, Unit tests
When learning a programming language, the focus is essentially on understanding the syntax, the code style, and the underlying concepts. With time, you become sufficiently comfortable with theā¦
By Faouzi BRAZA
Jun 9, 2021
Desacralizing the Linux overlay filesystem in Docker
Categories: Containers Orchestration, Infrastructure | Tags: DevOps, File system, Linux, Docker
Overlay filesystems (also called union filesystems) is a fundamental technology in Docker to create images and containers. They allow creating a union of directories to create a filesystem. Multipleā¦
By David WORMS
Jun 3, 2021
Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI
Categories: Data Engineering, Learning | Tags: Cloud, Data Lake, Databricks, Delta Lake, MLflow
Self-paced trainings are proposed by Databricks inside their Academy program. The price is $ 2000 USD for unlimited access to the training courses for a period of 1 year, but also free for customersā¦
May 26, 2021
JS monorepos in prod 5: merging Git repositories and preserve commit history
Categories: DevOps & SRE, Node.js | Tags: Bash, DevOps, Packaging, Git, GitHub, GitOps, JavaScript, Monorepo
At Adaltas, we maintain several open-source Node.js projects organized as Git monorepos and published on NPM. We shared our experience to work with Lerna monorepos in a set of articles: Partā¦
May 21, 2021
Find your way into data related Microsoft Azure certifications
Categories: Cloud Computing, Data Engineering | Tags: Data Governance, Azure, Data Science
Microsoft Azure has certification paths for many technical job roles such as developer, Data Engineer, Data Scientist and solution architect among others. Each of these certifications consists ofā¦
Apr 14, 2021
Bridging the DBnomics Swagger/OpenAPI schema with GraphQL
Categories: DevOps & SRE, Front End | Tags: Data Engineering, JAMstack, GraphQL, JavaScript, Node.js, REST, Schema
While redacting a long and fastidious document today, I came across DBnomics, an open platform federating economic datasets. Browsing its website and APIs, I found their OpenAPI schema (aka Swaggerā¦
By David WORMS
Apr 8, 2021
Apache Liminal: when MLOps meets GitOps
Categories: Big Data, Containers Orchestration, Data Engineering, Data Science, Tech Radar | Tags: Data Engineering, CI/CD, Data Science, Deep Learning, Deployment, Docker, GitOps, Kubernetes, Machine Learning, MLOps, Open source, Python, TensorFlow
Apache Liminal is an open-source software which proposes a solution to deploy end-to-end Machine Learning pipelines. Indeed it permits to centralize all the steps needed to construct Machine Learningā¦
Mar 31, 2021
Storage size and generation time in popular file formats
Categories: Data Engineering, Data Science | Tags: Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)
Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in aā¦
Mar 22, 2021
TensorFlow Extended (TFX): the components and their functionalities
Categories: Big Data, Data Engineering, Data Science, Learning | Tags: Beam, Data Engineering, Pipeline, CI/CD, Data Science, Deep Learning, Deployment, Machine Learning, MLOps, Open source, Python, TensorFlow
Putting Machine Learning (ML) and Deep Learning (DL) models in production certainly is a difficult task. It has been recognized as more failure-prone and time consuming than the modeling itself, yetā¦
Mar 5, 2021
JS monorepos in prod 4: unit testing with Mocha and Should.js
Categories: DevOps & SRE, Front End | Tags: Automation, CI/CD, Git, GitOps, Monorepo, Node.js, Unit tests
Unit testing is essential for every long-term project and allows you to pull down functionalities of your code into isolated testable units. Indeed the main goal of a unit test is to verify if anā¦
By David WORMS
Feb 25, 2021
JS monorepos in prod 3: commit enforcement and changelog generation
Categories: DevOps & SRE, Front End | Tags: CI/CD, Git, JavaScript, Monorepo, Node.js, Release and features, Unit tests
Conventional Commits introduces a structured format for commit messages. It standardizes the messages among all the contributors. This makes them more readable and easy to automate. It simplifies theā¦
By David WORMS
Feb 2, 2021
JS monorepos in prod 2: project versioning and publishing
Categories: DevOps & SRE, Front End | Tags: CI/CD, Git, GitOps, JavaScript, Monorepo, Node.js, Release and features, Unit tests
One great advantage of a monorepo is to maintain coherent versions between packages and to automatize the version creation and the publication of packages. This article covers the versioning andā¦
By David WORMS
Jan 11, 2021
JS monorepos in prod 1: project initialization
Categories: DevOps & SRE, Front End | Tags: Git, GitOps, JavaScript, Monorepo, Node.js, Release and features
Every project journey begins with the step of initialization. When your overall project is composed of multiple projects, it is tempting to create one Git repository per project. In Node.js, a projectā¦
By David WORMS
Jan 5, 2021
Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin
Categories: Big Data, Infrastructure | Tags: Maven, Hadoop, HBase, Hive, Spark, Git, Release and features, TDP, Unit tests
The Hadoop ecosystem gave birth to many popular projects including HBase, Spark and Hive. While technologies like Kubernetes and S3 compatible object storages are growing in popularity, HDFS and YARN ā¦
Dec 18, 2020
Faster model development with H2O AutoML and Flow
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python
Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate aā¦
Dec 10, 2020
OAuth2 and OpenID Connect for microservices and public applications (Part 2)
Categories: Containers Orchestration, Cyber Security | Tags: LDAP, Micro Services, CNCF, JavaScript Object Notation (JSON), OAuth2, OpenID Connect
Using OAuth2 and OpenID Connect, it is important to understand how the authorization flow is taking place, who shall call the Authorization Server, how to store the tokens. Moreover, microservices andā¦
By David WORMS
Nov 20, 2020
OAuth2 and OpenID Connect, a gentle and working introduction (Part 1)
Categories: Containers Orchestration, Cyber Security | Tags: Go Lang, JAMstack, LDAP, CNCF, Kubernetes, OAuth2, OpenID Connect
Understanding OAuth2, OpenID and OpenID Connect (OIDC), how they relate, how the communications are established, and how to architecture your application with the given access, refresh and id tokensā¦
By David WORMS
Nov 17, 2020
Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)
Categories: Big Data, Cloud Computing, Data Engineering | Tags: NiFi, Hadoop, HDFS, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), OAuth2
As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recentlyā¦
Nov 5, 2020
Rebuilding HDP Hive: patch, test and build
Categories: Big Data, Infrastructure | Tags: Maven, Java, Hive, Git, GitHub, Release and features, TDP, Unit tests
The Hortonworks HDP distribution will soon be deprecated in favor of Clouderaās CDP. One of our clients wanted a new Apache Hive feature backported into HDP 2.6.0. We thought it was a good opportunityā¦
Oct 6, 2020
Data versioning and reproducible ML with DVC and MLflow
Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th Novemberā¦
Sep 30, 2020
Experiment tracking with MLflow on Databricks Community Edition
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn
Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and theā¦
Sep 10, 2020
Version your datasets with Data Version Control (DVC) and Git
Categories: Data Science, DevOps & SRE | Tags: DevOps, Infrastructure, Operation, Git, GitOps, SCM
Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldnāt we have a similar approach suchā¦
Sep 3, 2020
Plugin architecture in JavaScript and Node.js with Plug and Play
Categories: Front End, Node.js | Tags: Asynchronous, DevOps, Programming, Agile, JavaScript, Open source, Release and features
Plug and Play helps library and application authors to introduce a plugin architecture into their code. It simplifies complex code execution with well-defined interception points, also called hooksā¦
By David WORMS
Aug 28, 2020
Installing Hadoop from source: build, patch and run
Categories: Big Data, Infrastructure | Tags: Maven, Java, LXD, Hadoop, HDFS, Docker, TDP, Unit tests
Commercial Apache Hadoop distributions have come and gone. The two leaders, Cloudera and Hortonworks, have merged: HDP is no more and CDH is now CDP. MapR has been acquired by HP and IBM BigInsightsā¦
Aug 4, 2020
Download datasets into HDFS and Hive
Categories: Big Data, Data Engineering | Tags: Business intelligence, Data Engineering, Data structures, Database, Hadoop, HDFS, Hive, Big Data, Data Analytics, Data Lake, Data lakehouse, Data Warehouse
Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,ā¦). This explains the explosion of the data volume and theā¦
By Aida NGOM
Jul 31, 2020
Comparison of different file formats in Big Data
Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers
In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposesā¦
By Aida NGOM
Jul 23, 2020
Automate a Spark routine workflow from GitLab to GCP
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: Learning and tutorial, Airflow, Spark, CI/CD, GitLab, GitOps, GCP, Terraform
A workflow consists in automating a succession of tasks to be carried out without human intervention. It is an important and widespread concept which particularly apply to operational environmentsā¦
Jun 16, 2020
Importing data to Databricks: external tables and Delta Lake
Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python
During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a laterā¦
May 21, 2020
Introducing Apache Airflow on AWS
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: PySpark, Learning and tutorial, Airflow, Oozie, Spark, AWS, Docker, Python
Apache Airflow offers a potential solution to the growing challenge of managing an increasingly complex landscape of data management tools, scripts and analytics processes. It is an open-sourceā¦
May 5, 2020
Expose a Rook-based Ceph cluster outside of Kubernetes
Categories: Containers Orchestration | Tags: Debug, Rook, Ceph, Docker, Kubernetes
We recently deployed a LXD based Hadoop cluster and we wanted to be able to apply size quotas on some filesystems (ie: service logs, user homes). Quota is a built in feature of the Linux kernel usedā¦
Apr 16, 2020
Snowflake, the Data Warehouse for the Cloud, introduction and tutorial
Categories: Business Intelligence, Cloud Computing | Tags: Cloud, Data Lake, Data Science, Data Warehouse, Snowflake
Snowflake is a SaaS-based data-warehousing platform that centralizes, in the cloud, the storage and processing of structured and semi-structured data. The increasing generation of data produced overā¦
Apr 7, 2020
Optimization of Spark applications in Hadoop YARN
Categories: Data Engineering, Learning | Tags: Tuning, Hadoop, Spark, Python
Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This articleā¦
Mar 30, 2020
MLflow tutorial: an open source Machine Learning (ML) platform
Categories: Data Engineering, Data Science, Learning | Tags: AWS, Azure, Databricks, Deep Learning, Deployment, Machine Learning, MLflow, MLOps, Python, Scikit-learn
Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Scienceā¦
Mar 23, 2020
Introduction to Ludwig and how to deploy a Deep Learning model via Flask
Categories: Data Science, Tech Radar | Tags: Learning and tutorial, Deep Learning, Ludwig Deep Learning Toolbox, Machine Learning, Python
Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomousā¦
Mar 2, 2020
Install and debug Kubernetes inside LXD
Categories: Containers Orchestration | Tags: Debug, Linux, LXD, Docker, Kubernetes, Node
We recently deployed a Kubernetes cluster with the need to maintain clusters isolation on our bare metal nodes across our infrastructure. We knew that Virtual Machines would provide the requiredā¦
Feb 4, 2020
Policy enforcing with Open Policy Agent
Categories: Cyber Security, Data Governance | Tags: Kafka, Ranger, Authorization, Cloud, Kubernetes, REST, SSL/TLS
Open Policy Agent is an open-source multi-purpose policy engine. Its main goal is to unify policy enforcement across the cloud native stack. The project was created by Styra and it is currentlyā¦
Jan 22, 2020
Cloudera CDP and Cloud migration of your Data Warehouse
Categories: Big Data, Cloud Computing | Tags: Azure, Cloudera, Data Hub, Data Lake, Data Warehouse
While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriateā¦
By David WORMS
Dec 16, 2019
Logstash pipelines remote configuration and self-indexing
Categories: Data Engineering, Infrastructure | Tags: Docker, Elasticsearch, Kibana, Logstash, Log4j
Logstash is a powerful data collection engine that integrates in the Elastic Stack (Elasticsearch - Logstash - Kibana). The goal of this article is to show you how to deploy a fully managed Logstashā¦
Dec 13, 2019
Should you move your Big Data and Data Lake to the Cloud
Categories: Big Data, Cloud Computing | Tags: DevOps, AWS, Azure, Cloud, CDP, Databricks, GCP
Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customerās Bigā¦
Dec 9, 2019
Hadoop Ozone part 3: advanced replication strategy with Copyset
Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes, Node
Hadoop Ozone provide a way of setting a ReplicationType for every write you make on the cluster. Right now is supported HDFS and Ratis but more advanced replication strategies can be achieved. In thisā¦
Dec 3, 2019
Hadoop Ozone part 2: tutorial and getting started of its features
Categories: Infrastructure | Tags: CLI, Learning and tutorial, HDFS, Ozone, Amazon S3, Cluster, REST
The releases of Hadoop Ozone come with a handy docker-compose file to try out Ozone. The below instructions provide details on how to use it. You can also use the Katacoda training sandbox whichā¦
Dec 3, 2019
Hadoop Ozone part 1: an introduction of the new filesystem
Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes
Hadoop Ozone is an object store for Hadoop. It is designed to scale to billions of objects of varying sizes. It is currently in development. The roadmap is available on the project wiki. This articleā¦
Dec 3, 2019
InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS
Categories: Big Data, Containers Orchestration | Tags: DevOps, LXD, Hadoop, Kafka, Spark, Ceph, Internship, Kubernetes, NoSQL
Context The acquisition of a high-capacity cluster is in line with Adaltasā desire to build a PAAS-type offering to use and to provide Big Data and container orchestration platforms. The platforms areā¦
By David WORMS
Nov 26, 2019
Internship Data Science & Data Engineer - ML in production and streaming data ingestion
Categories: Data Engineering, Data Science | Tags: DevOps, Flink, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python
Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitatesā¦
By David WORMS
Nov 26, 2019
Insert rows in BigQuery tables with complex columns
Categories: Cloud Computing, Data Engineering | Tags: GCP, BigQuery, Schema, SQL
Googleās BigQuery is a cloud data warehousing system designed to process enormous volumes of data with several features available. Out of all those features, letās talk about the support of Structā¦
Nov 22, 2019
Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod
Categories: Data Science | Tags: GPU, Deep Learning, Horovod, Keras, TensorFlow
The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle whichā¦
Nov 15, 2019
Kerberos and Spnego authentication on Windows with Firefox
Categories: Cyber Security | Tags: Firefox, HTTP, FreeIPA, Kerberos
In Greek mythology, Kerberos, also called Cerberus, guards the gates of the Underworld to prevent the dead from leaving. He is commonly described as a three-headed dog, a serpentās tail, mane ofā¦
By David WORMS
Nov 4, 2019
Notes on the Cloudera Open Source licensing model
Categories: Big Data | Tags: CDSW, License, Cloudera Manager, Open source
Following the publication of its Open Source licensing strategy on July 10, 2019 in an article called āour Commitment to Open Source Softwareā, Cloudera broadcasted a webinar yesterday October 2ā¦
By David WORMS
Oct 25, 2019
Innovation, project vs product culture in Data Science
Categories: Data Science, Data Governance | Tags: DevOps, Agile, Scrum
Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes areā¦
By David WORMS
Oct 8, 2019
Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
āEnterprise Machine Learning requires looking at the big picture [ā¦] from a data engineering and a data platform perspective,ā lectured Justin Norman during the talk on the deployment of Machineā¦
Sep 30, 2019
Rook with Ceph doesn't provision my Persistent Volume Claims!
Categories: DevOps & SRE | Tags: PVC, Linux, Rook, Ubuntu, Ceph, Cluster, Internship, Kubernetes
Ceph installation inside Kubernetes can be provisioned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoidā¦
Sep 9, 2019
Users and RBAC authorizations in Kubernetes
Categories: Containers Orchestration, Data Governance | Tags: Cyber Security, RBAC, Authentication, Authorization, Kubernetes, SSL/TLS
Having your Kubernetes cluster up and running is just the start of your journey and you now need to operate. To secure its access, user identities must be declared along with authentication andā¦
Aug 7, 2019
TensorFlow installation on Docker
Categories: Containers Orchestration, Data Science, Learning | Tags: CPU, Jupyter, Linux, AI, Deep Learning, Docker, TensorFlow
TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data arrayā¦
Aug 5, 2019
Running Apache Hive 3, new features and tips and tricks
Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: JDBC, LLAP, Druid, Hadoop, Hive, Kafka, Release and features
Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available sinceā¦
Jul 25, 2019
Auto-scaling Druid with Kubernetes
Categories: Big Data, Business Intelligence, Containers Orchestration | Tags: Helm, Metrics, OLAP, Operation, Container Orchestration, EC2, Druid, Cloud, CNCF, Data Analytics, Kubernetes, Prometheus, Python
Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talkā¦
Jul 16, 2019
Mount Aladdin eToken in Firefox on Archlinux
Categories: Hack | Tags: Arch Linux, Cyber Security, Firefox, Security, Smart card, 2FA
Given youāre on Archlinux and have an Aladdin eToken, letās see how we can mount it in Firefox for web authentication. An Aladdin eToken is a cryptographic device (token, smart card) that storesā¦
Jul 12, 2019
Spark Streaming part 4: clustering with Spark MLlib
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming
Spark MLlib is an Apacheās Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform forā¦
Jun 27, 2019
Google Cloud Summit Paris Notes
Categories: Events | Tags: AWS, Azure, Cloud, GCP, Kubernetes, On-premises
Google organized its yearly Summit edition 2019 in Paris on the 18th of June. This yearās event was the biggest yet in Paris, which reflect Googleās commitment to position itself in the French marketā¦
Jun 26, 2019
Druid and Hive integration
Categories: Big Data, Business Intelligence, Tech Radar | Tags: LLAP, OLAP, Druid, Hive, Data Analytics, SQL
This article covers the integration between Hive Interactive (LDAP) and Druid. One can see it as a complement of the Ultra-fast OLAP Analytics with Apache Hive and Druid article. Tools descriptionā¦
Jun 17, 2019
Spark Streaming part 3: DevOps, tools and tests for Spark applications
Categories: Big Data, Data Engineering, DevOps & SRE | Tags: DevOps, Learning and tutorial, Spark, Apache Spark Streaming
Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on dataā¦
May 31, 2019
Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop
Categories: Data Engineering, Learning | Tags: Spark, Apache Spark Streaming, Python, Streaming
Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Dataā¦
May 28, 2019
Spark Streaming part 1: build data pipelines with Spark Structured Streaming
Categories: Data Engineering, Learning | Tags: Kafka, Spark, Apache Spark Streaming, Big Data, Streaming
Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. Theā¦
Apr 18, 2019
Recover from an EFI failure on a dedicated server
Categories: Hack | Tags: Infrastructure, Linux, Cloud
A few weeks ago, before upgrading our Ubuntu systems, we sort of messed around with our EFI partitions and the impacted servers never came back online on system reboot after the upgrade. Provisionningā¦
Apr 16, 2019
First Class Functions in Python
Categories: Hack, Learning | Tags: Programming, Python
I recently watched a talk by Dave Cheney about first class functions in Go. Python supports first class functions too, so can we use them in the same ways? Absolutely. I have been using Python for aā¦
Apr 15, 2019
Gatsby.js, React and GraphQL for documentation websites
Categories: Adaltas Summit 2018, Front End | Tags: Gatsby, HTTP, JAMstack, React.js, SEO, API, GitOps, GraphQL, JavaScript, Markdown, Node.js
In the last few months, I have started to redesign some of our Open Source project websites. This includes the websites of the Node.js CSV project, the Node.js HBase client and the Nikita project, ourā¦
By David WORMS
Apr 1, 2019
Publish Spark SQL DataFrame and RDD with Spark Thrift Server
Categories: Data Engineering | Tags: Thrift, JDBC, Hadoop, Hive, Spark, SQL
The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reportsā¦
Mar 25, 2019
Multihoming on Hadoop
Categories: Infrastructure | Tags: Hadoop, HDFS, Kerberos, Network
Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is anā¦
Mar 5, 2019
Introduction to Cloudera Data Science Workbench
Categories: Data Science | Tags: Azure, Cloudera, Docker, Git, Kubernetes, Machine Learning, MLOps, Notebook
Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their mainā¦
Feb 28, 2019
Apache Knox made easy!
Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: LDAP, Active Directory, Knox, Ranger, Kerberos, REST
Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in aā¦
Feb 4, 2019
Installing Kubernetes on CentOS 7
Categories: Containers Orchestration | Tags: CentOS, cgroups, DevOps, Infrastructure, Namespaces, Red Hat, VM, Ceph, CNCF, Docker, Kubernetes
This article explains how to install a Kubernetes cluster. I will dive into what each step does so you can build a thorough understanding of what is going on. This article is based on my talk from theā¦
Jan 29, 2019
Self-sovereign identities with verifiable claims
Categories: Data Governance | Tags: Authentication, Blockchain, Cloud, IAM, Ledger
Towards a trusted, personal, persistent, and portable digital identity for all. Digital identity issues Self-sovereign identities are an attempt to solve a couple of issues. The first is theā¦
By Nabil MELLAL
Jan 23, 2019
Applying Deep Reinforcement Learning to Poker
Categories: Data Science | Tags: Algorithm, Gaming, Q-learning, Deep Learning, Machine Learning, Neural Network, Python
We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then weāll apply a version of this algorithm to the game of Pokerā¦
Jan 9, 2019
LXD: The Missing Piece
Categories: Containers Orchestration | Tags: CPU, Linux, LXD, VM, Docker, Kubernetes
LXD stands for Linux Container Daemon. Yet another container technology. But LXD is very different. It stands apart from the pack. It is not necessarily better nor much faster nor more secure! But itā¦
Dec 28, 2018
Monitoring a production Hadoop cluster with Kubernetes
Categories: DevOps & SRE | Tags: Thrift, Grafana, Shinken, Hadoop, Knox, Cluster, Docker, Elasticsearch, Kubernetes, Node, Node.js, Prometheus, Python
Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplestā¦
Dec 21, 2018
CodaLab ā Data Science competitions
Categories: Data Science, Adaltas Summit 2018, Learning | Tags: Database, Infrastructure, Machine Learning, MySQL, Node.js, Python
CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Letās see how itā¦
Dec 17, 2018
Native modules for Node.js with N-API
Categories: Adaltas Summit 2018, Front End | Tags: C++, NPM, JavaScript, Kerberos, Node.js
How to create native modules for Node.js? How to use N-API, the future of native addons development? Writing C/C++ addon is a useful and powerful feature of the Node.js runtime. Letās explore themā¦
Dec 12, 2018
Microsoft introduces Cloud Native Application Bundles
Categories: Containers Orchestration | Tags: CLI, Helm, Packaging, Docker, Kubernetes
At DockerCon EU 2018 in Barcelona, Matt Butcher, Principal Engineer at Microsoft and inventor of Helm, introduced CNAB, Cloud Native Application Bundles, a packaging format for distributedā¦
Dec 4, 2018
Jumbo, the Hadoop cluster bootstrapper
Categories: Infrastructure | Tags: Ambari, Automation, Ansible, Cluster, Vagrant, HDP, REST
Introducing Jumbo, a Hadoop cluster bootstrapper for developers. Jumbo helps you deploy development environments for Big Data technologies. It takes a few minutes to get a custom virtualized Hadoopā¦
Nov 29, 2018
Main advantages of GraphQL as an alternative to REST
Categories: Front End | Tags: gRPC, API, GraphQL, JavaScript Object Notation (JSON), Node.js, Registry, REST
GraphQL is based on a simple idea, moving the assembly of a request from the server to the client. The client sees the overall strongly-typed schema instead of multiple REST endpoints and he buildsā¦
By David WORMS
Nov 27, 2018
Node.js CSV version 4 - re-writing and performance
Categories: Node.js | Tags: CLI, Data Engineering, Refactoring, CSV, Release and features
Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well asā¦
By David WORMS
Nov 19, 2018
Hadoop cluster takeover with Apache Ambari
Categories: Big Data, DevOps & SRE, Adaltas Summit 2018 | Tags: Ambari, Automation, iptables, Nikita, Systemd, Cluster, HDP, Kerberos, Node, Node.js, REST
We recently migrated a large production Hadoop cluster from a āmanualā automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why thisā¦
Nov 15, 2018
Managing User Identities on Big Data Clusters
Categories: Cyber Security, Data Governance | Tags: LDAP, Active Directory, Ansible, FreeIPA, IAM, Kerberos
Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy toā¦
By David WORMS
Nov 8, 2018
Apache Flink: past, present and future
Categories: Data Engineering | Tags: Pipeline, Flink, Kubernetes, Machine Learning, SQL, Streaming
Apache Flink is a little gem which deserves a lot more attention. Letās dive into Flinkās past, its current state and the future it is heading to by following the keynotes and presentations at Flinkā¦
Nov 5, 2018
One week to discuss technology in a Moroccan riad
Categories: Adaltas Summit 2018, Learning | Tags: CDSW, Gatsby, React.js, Flink, Hadoop, Knox, Data Science, Deep Learning, Kubernetes, Node.js
Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix theā¦
By David WORMS
Oct 11, 2018
Nvidia and AI on the edge
Categories: Data Science | Tags: Caffe, GPU, NVIDIA, AI, Deep Learning, Edge computing, Keras, PyTorch, TensorFlow
In the last four years, corporations have been investing a lot in AI and particularly in Deep Learning and Edge Computing. While the theory has taken huge steps forward and new algorithms are inventedā¦
By Yliess HATI
Oct 10, 2018
Deploying a secured Flink cluster on Kubernetes
Categories: Big Data | Tags: Encryption, Flink, HDFS, Kafka, Elasticsearch, Kerberos, SSL/TLS
When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink nativeā¦
By David WORMS
Oct 8, 2018
KVM machines for Vagrant on Archlinux
Categories: DevOps & SRE | Tags: Arch Linux, KVM, Linux, Virtualization, VM, Vagrant
Vagrant supports different providers to manage virtualization. In a Linux environment, you can dramatically improve VM performance by using the libvirt provider and the KVM hypervisor. This tutorialā¦
Sep 19, 2018
Lando: Deep Learning used to summarize conversations
Categories: Data Science, Learning | Tags: Micro Services, Open API, Deep Learning, Internship, Kubernetes, Neural Network, Node.js
Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users toā¦
By Yliess HATI
Sep 18, 2018
Clusters and workloads migration from Hadoop 2 to Hadoop 3
Categories: Big Data, Infrastructure | Tags: Slider, Erasure Coding, Rolling Upgrade, HDFS, Spark, YARN, Docker
Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your currentā¦
Jul 25, 2018
Deep learning on YARN: running Tensorflow and friends on Hadoop cluster
Categories: Data Science | Tags: GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost
With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveragingā¦
Jul 24, 2018
Curing the Kafka blindness with the UI manager
Categories: Big Data | Tags: Ambari, Hortonworks, HDF, JMX, UI, Kafka, Ranger, HDP
Today itās really difficult for developers, operators and managers to visualize and monitor what happens in a Kafka cluster. This articles covers a new graphical interface to oversee Kafka. It wasā¦
Jun 20, 2018
A CoreOS development cluster with Vagrant and VirtualBox
Categories: Hack, Infrastructure | Tags: Arch Linux, CoreOS, Linux, VirtualBox, etcd, Vagrant
Following CoreOSās instructions on how to set up a development environment in VirtualBox did not work out well for me. Here are the steps I followed to get Container Linux up and running with Vagrantā¦
Jun 20, 2018
Guide to Keybase encrypted directories
Categories: Cyber Security, Hack | Tags: Cryptography, Encryption, File system, Keybase, PGP, Authorization
This is a guide to using Keybaseās encrypted directories to store and share files. Keybase is a group, file and chat application whoās goal is to bring public key crypto based on PGP to everyone inā¦
Jun 18, 2018
Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customersā¦
By David WORMS
Jun 18, 2018
Apache Hadoop YARN 3.0 ā State of the union
Categories: Big Data, DataWorks Summit 2018 | Tags: GPU, Hortonworks, Hadoop, HDFS, MapReduce, YARN, Cloudera, Data Science, Docker, Release and features
This article covers the āApache Hadoop YARN: state of the unionā talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the twoā¦
May 31, 2018
Accelerating query processing with materialized views in Apache Hive
Categories: Business Intelligence, DataWorks Summit 2018 | Tags: Calcite, OLAP, Druid, Hive, Release and features, SQL
The new materialized view feature is coming in Apache Hive 3.0. Jesus Camacho Rodriguez from Hortonworks held a talk āAccelerating query processing with materialized views in Apache Hiveā about itā¦
May 31, 2018
YARN and GPU Distribution for Machine Learning
Categories: Data Science, DataWorks Summit 2018 | Tags: GPU, YARN, Machine Learning, Neural Network, Storage
This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can beā¦
May 30, 2018
TensorFlow on Spark 2.3: The Best of Both Worlds
Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, C++, CPU, GPU, Tuning, Spark, YARN, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow
The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the newā¦
By Yliess HATI
May 29, 2018
Apache Metron in the Real World
Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithm, NiFi, Solr, Storm, pcap, RDBMS, HDFS, Kafka, Metron, Spark, Data Science, Elasticsearch, SQL
Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation wasā¦
May 29, 2018
Running Enterprise Workloads in the Cloud with Cloudbreak
Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: Cloudbreak, Operation, Hadoop, AWS, Azure, GCP, HDP, OpenStack
This article is based on Peter Darvasi and Richard Doktoricsā talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworksā automated deployment toolā¦
May 28, 2018
Omid: Scalable and highly available transaction processing for Apache Phoenix
Categories: Big Data, DataWorks Summit 2018 | Tags: Omid, Phoenix, Transaction, ACID, HBase, SQL
Apache Omid provides a transactional layer on top of key/value NoSQL databases. In practice, it is usually used on top of Apache HBase. Credits to Ohad Shacham for his talk and his work for Apacheā¦
May 24, 2018
Apache Beam: a unified programming model for data processing pipelines
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Pipeline, Flink, Spark
In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 inā¦
May 24, 2018
Present and future of Hadoop workflow scheduling: Oozie 5.x
Categories: Big Data, DataWorks Summit 2018 | Tags: Hadoop, Hive, Oozie, Sqoop, CDH, HDP, REST
During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features ofā¦
May 23, 2018
What's new in Apache Spark 2.3?
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, PySpark, Tuning, ORC, Spark, Spark MLlib, Data Science, Docker, Kubernetes, pandas, Streaming
Let ās dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apacheā¦
May 23, 2018
Essential questions about Time Series
Categories: Big Data | Tags: Grafana, Druid, HBase, Hive, ORC, Data Science, Elasticsearch, IOT
Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles ā¦ IoT and Big Data go hand in hand. Weā¦
By David WORMS
Mar 18, 2018
Execute Python in an Oozie workflow
Categories: Data Engineering | Tags: Oozie, Elasticsearch, Python, REST
Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, letās see how to do that. Iāve recently designed a workflow that would interactā¦
Mar 6, 2018
Publishing guidelines
Categories: DevOps & SRE | Tags: Arch Linux, KVM, VM, Vagrant, Markdown
This is as much a set of guidelines targeting everyone publishing content on the web as rules for reviewers to ensure no validation is forgotten before submitting for publication. It mostly targetsā¦
By David WORMS
Feb 28, 2018
Ambari - How to blueprint
Categories: Big Data, DevOps & SRE | Tags: Ambari, Automation, DevOps, Operation, Ranger, REST
As infrastructure engineers at Adaltas, we deploy Hadoop clusters. A lot of them. Letās see how to automate this process with REST requests. While really handy for deploying one or two clusters, theā¦
Jan 17, 2018
Notes after Katacoda Training on Kubernetes Container Orchestration
Categories: Containers Orchestration, Learning | Tags: Helm, Ingress, Kubeadm, CNI, Micro Services, Minikube, Kubernetes
A few weeks ago, I dedicated two days to follow the turorials available on Katacoda, the interactive learning platform for Kubernetes or any other container orchestration platform. Iām sharing myā¦
By David WORMS
Dec 14, 2017
Scaling massive, real-time data pipelines with Go
Categories: Open Source Summit Europe 2017, Learning | Tags: Algorithm, Data structures, Go Lang, Pipeline, Protocols, Network
Last week at the Open Source Summit in Prague, Jean de Klerk held a talk called Scaling massive, real-time data pipelines with Go. This article goes over the main points of the talk, detailing theā¦
Nov 21, 2017
Mesos Introduction
Categories: Containers Orchestration, Open Source Summit Europe 2017 | Tags: Mesos, GPU, Container Orchestration, CUDA, Data Science, Docker
Apache Mesos is an open source cluster management project designed to implement and optimize distributed systems. Mesos enables the management and sharing of resources in a fine and dynamic wayā¦
Nov 15, 2017
Micro Services
Categories: Cloud Computing, Containers Orchestration, Open Source Summit Europe 2017 | Tags: Mesos, DNS, Encryption, gRPC, Istio, Linkerd, Micro Services, MITM, Service Mesh, CNCF, Kubernetes, Proxy, SPOF, SSL/TLS
Back in the days, applications were monolithic and we could use an IP address to access a service. With virtual machines (VM), multiple hosts started to appear on the same machine with multiple appsā¦
By David WORMS
Nov 14, 2017
Lightweight containerization with Tupperware
Categories: Containers Orchestration, Open Source Summit Europe 2017, Infrastructure | Tags: Btrfs, LXD, Red Hat, Systemd, Zookeeper, Cloud, Consensus
In this article, I will present lightweight containerization set up by Facebook called Tupperware. What is Tupperware Tupperware is a homemade framework written and used internally at Facebookā¦
Nov 3, 2017
Multi-Repo, Multi-Node Gating at Massive Scale
Categories: Cloud Computing, DevOps & SRE, Open Source Summit Europe 2017 | Tags: Infrastructure, Jenkins, Red Hat, Zuul, Ansible, CI/CD, OpenStack
This is a recap and personal review of Monty Taylorās presentation of OpenStackās Continuous Integration tool Zuul at the OpenSource Summit 2017 in Prague (not to mix with Netflixā Zuul projectā¦
Oct 28, 2017
Apache Thrift vs REST
Categories: DevOps & SRE, Open Source Summit Europe 2017 | Tags: Thrift, gRPC, HTTP, JavaScript Object Notation (JSON), REST
Adaltas recently attended the Open Source Summit Europe 2017 in Prague. I had the opportunity to follow a presentation made by Randy Abernethy and Jens Geyer of RM-X, a cloud native consulting companyā¦
Oct 28, 2017
Kubernetes Storage Primitives for Stateful Workloads
Categories: Cloud Computing, Containers Orchestration, Open Source Summit Europe 2017 | Tags: Container Storage Interface (CSI), PVC, Azure, Docker, GCE, Kubernetes, Storage
This article is based on the presentation āIntroduction to Kubernetes Storage Primitives for Stateful Workloadsā from the OSS Convention Prague 2017 by the {Code} team. So, letās start, what isā¦
Oct 28, 2017
Nobody* puts Java in a Container
Categories: Containers Orchestration, Open Source Summit Europe 2017, Infrastructure | Tags: cgroups, Java, JRE, JVM, Namespaces, Docker
This talk was about the issues of putting Java in a container and how, in its latest version, the JDK is now more aware of the container it is running in. The presentation is led by Joerg Schadā¦
Oct 28, 2017
From Dockerfile to Ansible Containers
Categories: Containers Orchestration, DevOps & SRE, Open Source Summit Europe 2017 | Tags: pip, Shell, Ansible, Docker, Docker Compose, YAML
This talk was an introduction to the Dockerfile format and to Ansible containerās tool and then a comparison of both. It was hold by Tomas Tomecek from Red Hatās containerization team. The Dockerfileā¦
Oct 25, 2017
Kubernetes 1.8
Categories: Containers Orchestration, Open Source Summit Europe 2017 | Tags: containerd, CRD, RBAC, Kubernetes, Network, OCI, Release and features
The 1.8 release of Kubernetes brings a lot of new things. With 2500+ pull request, 2000+ commits, 400+ commiters, Kubernetes added 39 new features in this version. This is the richest release in termsā¦
Oct 24, 2017
Yahoo's Vespa Engine
Categories: Tech Radar | Tags: Database, Tools, Elasticsearch, Search Engine
Vespa is Yahooās fully autonomous and self-sufficient big data processing and serving engine. It aims at serving results of queries on huge amounts of data in real time. An example of this would beā¦
Oct 16, 2017
Cloudera Sessions Paris 2017
Categories: Big Data, Events | Tags: Altus, CDSW, SDX, EC2, Azure, Cloudera, CDH, Data Science, PaaS
Adaltas was at the Cloudera Sessions on October 5, where Cloudera showcased their new products and offerings. Below youāll find a summary of what we witnessed. Note: the information were aggregated inā¦
Oct 16, 2017
MariaDB integration with Hadoop
Categories: Infrastructure | Tags: Database, HA, MariaDB, Hadoop, Hive
During a workshop with one of our customers, Adaltas has identified a potential risk to use MariaDBās High Availability (HA) strategy. Since the customer selected Clouderaās CDH 5 distribution, theā¦
By David WORMS
Jul 31, 2017
Managing authorizations with Apache Sentry
Categories: Data Governance | Tags: Hue, Database, LDAP, Nikita, Sentry, Ansible, CDH, Deployment
Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. With this article, we will show you how we are using Apache Sentry atā¦
By Axel JACQIN
Jul 24, 2017
Exposing Kafka on two different networks
Categories: Infrastructure | Tags: Cyber Security, VLAN, Kafka, Cloudera, CDH, Network
A Big Data setup usually requires you to have multiple networking interface, letās see how to set up Kafka on more than one of them. Kafka is a open-source stream processing software platform systemā¦
Jul 22, 2017
Oracle DB synchrnozation to Hadoop with CDC
Categories: Data Engineering | Tags: CDC, GoldenGate, Oracle, Hive, Sqoop, Data Warehouse
This note is the result of a discussion about the synchronization of data written in a database to a warehouse stored in Hadoop. Thanks to Claude Daub from GFI who wrote it and who authorizes us toā¦
By David WORMS
Jul 13, 2017
Change Ambari's topbar color
Categories: Big Data, Hack | Tags: Ambari, Front-end
We recently had a client that has multiple environments (Production, Integration, Testing, ā¦) running on HDP and managed using one Ambari instance per cluster. One of the questions that came up wasā¦
Jul 9, 2017
MiNiFi: Data at Scales & the Values of Starting Small
Categories: Big Data, DevOps & SRE, Infrastructure | Tags: MiNiFi, NiFi, C++, HDF, Cloudera, HDP, IOT
This conference presented rapidly Apache NiFi and explained where MiNiFi came from: basically itās a NiFi minimal agent to deploy on small devices to bring data to a clusterās NiFi pipeline (ex: IoTā¦
Jul 8, 2017
Advanced multi-tenant Hadoop and Zookeeper protection
Categories: Big Data, Infrastructure | Tags: DoS, iptables, Operation, Scalability, Zookeeper, Clustering, Consensus
Zookeeper is a critical component to Hadoopās high availability operation. The latter protects itself by limiting the number of maximum connections (maxConns = 400). However Zookeeper does not protectā¦
Jul 5, 2017
HDP cluster monitoring
Categories: Big Data, DevOps & SRE, Infrastructure | Tags: Alert, Ambari, Metrics, Monitoring, HDP, REST
With the current growth of BigData technologies, more and more companies are building their own clusters in hope to get some value of their data. One main concern while building these infrastructuresā¦
Jul 5, 2017
Hive Metastore HA with DBTokenStore: Failed to initialize master key
Categories: Big Data, DevOps & SRE | Tags: Infrastructure, Hive, Bug
This article describes my little adventure around a startup error with the Hive Metastore. It shall be reproducable with any secure installation, meaning with Kerberos, with high availability enabledā¦
By David WORMS
Jul 21, 2016
EclairJS - Putting a Spark in Web Apps
Categories: Data Engineering, Front End | Tags: Jupyter, Spark, JavaScript
Presentation by David Fallside from IBM, images extracted from the presentation. Introduction Web Apps development has moved from Java to NodeJS and Javascript. It provides a simple and richā¦
By David WORMS
Jul 17, 2016
Apache Apex with Apache SAMOA
Categories: Data Science, Events, Tech Radar | Tags: Apex, Samoa, Storm, Tools, Flink, Hadoop, Machine Learning
Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Onlineā¦
Jul 17, 2016
Apache Apex: next gen Big Data analytics
Categories: Data Science, Events, Tech Radar | Tags: Apex, Storm, Tools, Flink, Hadoop, Kafka, Data Science, Machine Learning
Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallelā¦
Jul 17, 2016
Get in control of your workflows with Apache Airflow
Categories: Big Data, Tech Radar | Tags: DevOps, Airflow, Cloud, Python
Below is a compilation of my notes taken during the presentation of Apache Airflow by Christian Trebing from BlueYonder. Introduction Use case: how to handle data coming in regularly from customersā¦
Jul 17, 2016
Hive, Calcite and Druid
Categories: Big Data | Tags: Business intelligence, Database, Druid, Hadoop, Hive
BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnalā¦
By David WORMS
Jul 14, 2016
Network Namespace without Docker
Categories: Hack | Tags: DNS, Linux, Namespaces, VLAN, Docker, Network
Letās imagine the following use case: I am connected to several networks (wlan0, eth0, usb0). I want to choose which network Iām gonna use when I launch apps. My app doesnāt allow me to choose aā¦
Jul 6, 2016
Red Hat Storage Gluster and its integration with Hadoop
Categories: Big Data | Tags: GlusterFS, Red Hat, Hadoop, HDFS, Storage
I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I willā¦
By David WORMS
Jul 3, 2015
A simple connect middleware to transpile CoffeeScript files
Categories: Hack, Node.js | Tags: Tools, CoffeeScript, Node.js
This new module called connect-coffee-script is a Connect middleware used to serve JavaScript files written in CoffeeScript. This middleware is to be used by connect or any Connect compatibleā¦
By David WORMS
Jul 4, 2014
Tutorial for creating and publishing a new Node.js module
Categories: Front End | Tags: Learning and tutorial, License, Mocha, NPM, Travis CI, CoffeeScript, GitHub, JavaScript, Node.js, Unit tests
In this tutorial, I provide complete instructions for creating a new Node.js module, writing the code in coffee-script, publishing it on GitHub, sharing it with other Node.js fellows through NPMā¦
By David WORMS
Dec 3, 2013
Crawl you website including login form with Phantomjs
Categories: Front End | Tags: Mocha, CoffeeScript, JavaScript, Node.js, Unit tests
With PhantomJS, we start a headless WebKit and pilot it with our own scripts. Said differently, we write a script in JavaScript or CoffeeScript which controls an Internet browser and manipulates theā¦
By David WORMS
Nov 27, 2013
Catch 'uncaughtException' error in your mocha test
Categories: Node.js | Tags: DevOps, Mocha, JavaScript, Unit tests
This isnāt the first time I faced this situation. Today, I finally found the time and energy to look for a solution. In your mocha test, letās say you need to test an expected āuncaughtExceptionā¦
By David WORMS
Oct 27, 2013
Remote connection with SSH
Categories: Cyber Security | Tags: Automation, HTTP, SSH
While teaching Big Data and Hadoop, a student asks me about SSH and how to use. Iāll discuss about the protocol and the tools to benefit from it. Lately, I automate the deployment of Hadoop clustersā¦
By David WORMS
Oct 2, 2013
Composants for CDH and HDP
Categories: Big Data | Tags: Flume, Hortonworks, Hadoop, Hive, Oozie, Sqoop, Zookeeper, Cloudera, CDH, HDP
I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writtingā¦
By David WORMS
Sep 22, 2013
Splitting HDFS files into multiple hive tables
Categories: Data Engineering | Tags: Flume, Pig, HDFS, Hive, Oozie, SQL
I am going to show how to split a CSV file stored inside HDFS as multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over ourā¦
By David WORMS
Sep 15, 2013
About the new BSD license and its difference with other BSD licenses
Categories: Data Governance | Tags: License, Open source
As a non restrictive Open Source license, the ānew BSD licenseā is a commonly used license across the Node.js community. However, this is only one of the BSD license available along the original āBSDā¦
By David WORMS
Aug 8, 2013
Kerberos and delegation tokens security with WebHDFS
Categories: Cyber Security | Tags: HTTP, HDFS, Big Data, Kerberos
WebHDFS is an HTTP Rest server bundle with the latest version of Hadoop. What interests me on this article is to dig into security with the Kerberos and delegation tokens functionalities. I will coverā¦
By David WORMS
Jul 25, 2013
Testing the Oracle SQL Connector for Hadoop HDFS
Categories: Data Engineering | Tags: Database, File system, Oracle, HDFS, CDH, SQL
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with otherā¦
By David WORMS
Jul 15, 2013
Maven 3 behind a proxy
Categories: Hack | Tags: Maven, Java, Proxy
Maven 3 isnāt so different to itās previous version 2. You will migrate most of your project quite easily between the two versions. That wasnāt the case a fews years ago between versions 1 andā¦
By David WORMS
Jul 11, 2013
Node CSV version 0.2.7
Categories: Hack | Tags: Pipeline, CoffeeScript, CSV, Node.js
While Iām release version 0.2.7 of the CSV parser for Node.js, I stop here to drop a few lines of what has made into this release. We are now using the latest CoffeeScript, which is version 1.4.ā¦
By David WORMS
Jul 9, 2013
State of the Hadoop open-source ecosystem in early 2013
Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, Hadoop, Kafka, Mahout, Data Science
Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flumeā¦
By David WORMS
Jul 8, 2013
Oracle and Hive, how data are published?
Categories: Big Data | Tags: Oracle, Hive, Sqoop, Data Lake
In the past few days, Iāve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector withā¦
By David WORMS
Jul 6, 2013
Oracle to Apache Hive with the Oracle SQL Connector
Categories: Business Intelligence | Tags: Oracle, HDFS, Hive, Network
In a previous article published last week, I introduced the choices available to connect Oracle and Hadoop. In a follow up article, I covered the Oracle SQL Connector, its installation and integrationā¦
By David WORMS
May 27, 2013
Options to connect and integrate Hadoop with Oracle
Categories: Data Engineering | Tags: Database, Java, Oracle, R, RDBMS, Avro, HDFS, Hive, MapReduce, Sqoop, NoSQL, SQL
I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up articleā¦
By David WORMS
May 15, 2013
The state of Hadoop distributions
Categories: Big Data | Tags: Hortonworks, Intel, Oracle, Hadoop, Cloudera
Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is aā¦
By David WORMS
May 11, 2013
Apache Hive Essentials How-to by Darren Lee
Categories: Business Intelligence, Learning | Tags: UDF, Hadoop, Hive, File Format, SQL
Recently, Iāve been ask to review a new book on Apache Hive called āApache Hive Essentials How-toā (edit: the second edition is now available) written by Darren Lee and published by Packt Publishingā¦
By David WORMS
Apr 23, 2013
Hadoop development cluster of virtual machines with static IP using VirtualBox
Categories: Infrastructure | Tags: Ambari, Hortonworks, Red Hat, VirtualBox, VM, VMware, Cloudera, Network
A few days ago, I explained how to set up a cluster of virtual machine with static IPsand Internet access suitable to host your Hadoop cluster locally for development. At the time I made use of VMWareā¦
By David WORMS
Mar 14, 2013
Definitions of machine learning algorithms present in Apache Mahout
Categories: Data Science | Tags: Algorithm, Š”lassification, Hadoop, Mahout, Clustering, Machine Learning
Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoopā¦
By David WORMS
Mar 8, 2013
Virtual machines with static IP for your Hadoop development cluster
Categories: Infrastructure | Tags: Ambari, Hortonworks, Red Hat, VirtualBox, VM, VMware, Cloudera, Network
While I am about to install and test Ambari, this article is the occasion to illustrate how I set up my development environment with multiple virtual machines. Ambari, the deployment and monitoringā¦
By David WORMS
Feb 27, 2013
Merging multiple files in Hadoop
Categories: Hack | Tags: File system, Hadoop, HDFS
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the āgetmergeā utility of but contrary to āgetmergeā, the finalā¦
By David WORMS
Jan 12, 2013
E-commerce electronic cigarettes: first impressions with Prestashop
Categories: Tech Radar | Tags: HTML, Java, Node.js
Last year, I had to select and integrate an e-commerce software for the website CigarHit selling electronic cigarettes. Considering that the last e-commerce integration I made dated from 2005, I tookā¦
By David WORMS
Jul 25, 2012
Node CSV version 0.2.1
Categories: Node.js | Tags: CoffeeScript, CSV, Release and features, Streaming
After the announcement of the version 0.2.0 of the Node.js CSV parser at the beginning of october, we are releasing today a new version 0.2.1. This is mostly a bug fix release with enhancedā¦
By David WORMS
Jul 24, 2012
Node CSV version 0.1 and future developments
Categories: Node.js | Tags: CoffeeScript, CSV, Markdown, Release and features, Streaming
The Node CSV parser has just reach version 0.1 which close the 0.0.x releases. Started almost 2 years ago, the project has received a tremendous amount of participation in the form of bug reportsā¦
By David WORMS
Jul 21, 2012
Convert .flac music files to .mp3 on osx
Categories: Hack | Tags: OS X, File Format
As an osx user for years now, one should know by then that iTunes doesnāt support the flac format. We are now in 2012, Iāve been waiting for this to happen since years know. Loosing patience, darkā¦
By David WORMS
Jul 20, 2012
Hadoop and R with RHadoop
Categories: Business Intelligence, Data Science | Tags: Thrift, Learning and tutorial, R, Hadoop, HBase, HDFS, MapReduce, Data Analytics
RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters ofā¦
By David WORMS
Jul 19, 2012
Asynchronous array iteration in Node.js with Each
Categories: Node.js | Tags: Asynchronous, CoffeeScript, JavaScript, Release and features
Control flow in Node.js is the sort of library for which almost all the developers have created and publish their own libraries. They usually aim at reducing spaghetti codes made of deep callbacks. Iā¦
By David WORMS
Jul 18, 2012
Installing and using MADlib with PostgreSQL on OSX
Categories: Data Science | Tags: Database, Greenplum, Statistics, PostgreSQL, SQL
We cover basic installation and usage of PostgreSQL and MADlib on OSX and Ubuntu. Instructions for other environments should be similar. PostgreSQL is an Open Source database with enterpriseā¦
By David WORMS
Jul 7, 2012
Node CSV version 0.2 with streaming API
Categories: Node.js | Tags: Data Engineering, CSV, Markdown, Node.js, Streaming
The Node CSV parser in its version 0.2 has just been released. This version is a major enhancement as it aligned the parser with the best Node.js practice in respect of streams. The CSV parser behaveā¦
By David WORMS
Jul 2, 2012
HDFS and Hive storage - comparing file formats and compression methods
Categories: Big Data | Tags: Business intelligence, Hive, ORC, Parquet, File Format
A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. Theā¦
By David WORMS
Mar 13, 2012
Two Hive UDAF to convert an aggregation to a map
Categories: Data Engineering | Tags: Java, HBase, Hive, File Format
I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: āUDAFToMapā and āUDAFToOrderedMapā or you can download the jar file. Theā¦
By David WORMS
Mar 6, 2012
Java versus JS fun, a quote from the Node.js mailing list
Categories: Node.js | Tags: Java, JavaScript, Node.js
I just read that one on the mailing list. I found it relevant enough to share it with those who did not subscribe to it: First Lothar Pfeiler: I still wonder, if itās cool to have such a bigā¦
By David WORMS
Feb 23, 2012
A fresh look at testing Node.js projects: Mocha, Should and Travis
Categories: DevOps & SRE, Node.js | Tags: DevOps, Mocha, CI/CD, JavaScript, Node.js, Unit tests
Today, I finally decided to spend some time around Travis. Itās been a few weeks since that little green image on top of many GitHub homepages has been buzzing me. Well, to be totally honest, this isnā¦
By David WORMS
Feb 19, 2012
Coffee script, how do I debug that damn js line?
Categories: Hack, Node.js | Tags: Debug, CoffeeScript, JavaScript, Node.js
Update April 12th, 2012: Pull request adding error reporting to CoffeeScript with line mapping Chances are that, if you code in CoffeeScript, you often find yourself facing a JavaScript exceptionā¦
By David WORMS
Feb 15, 2012
Announcing Mecano, a set of functions for system deployment
Categories: DevOps & SRE, Node.js | Tags: Automation, Infrastructure, CoffeeScript, JavaScript, Open source
Update July 2016, Mecano is now renamed Nikita. We are releasing Node Mecano on GitHub which gather common functions used while deploying systems. The idea was to group those functions into aā¦
By David WORMS
Feb 12, 2012
OS module on steroids with the SIGAR Node binding
Categories: Node.js | Tags: C++, CPU, File system, Metrics, Monitoring, Network
Today we are announcing the first release of the Node binding to the SIGAR library. Visit the project website or the source code repository on GitHub. SIGAR is a cross platform interface for gatheringā¦
By David WORMS
Jan 11, 2012
Timeseries storage in Hadoop and Hive
Categories: Data Engineering | Tags: CRM, timeseries, Tuning, Hadoop, HDFS, Hive, File Format
In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rowsā¦
By David WORMS
Jan 10, 2012
How Node CSV parser may save your weekend
Categories: Hack | Tags: Bash, Hack, CSV, Node.js
Last Friday, an hour before the doors of my customer close for the weekend, a co-worker came to me. He just finished to export 9 CSV files from an Oracle database which he wanted to import intoā¦
By David WORMS
Dec 13, 2011
Node.js is now integrated to the Microsoft Azure platform
Categories: Cloud Computing, Tech Radar | Tags: Linux, Azure, Cloud, Node.js
Node is now a first class citizen in the Microsoft Azure cloud environment alongside .Net, Java and PHP. This integration is the logical consequence of Microsoftās involvement in the development ofā¦
By David WORMS
Dec 11, 2011
Hadoop and HBase installation on OSX in pseudo-distributed mode
Categories: Big Data, Learning | Tags: Hue, Infrastructure, Hadoop, HBase, Big Data, Deployment
The operating system chosen is OSX but the procedure is not so different for any Unix environment because most of the software is downloaded from the Internet, uncompressed and set manually. Only aā¦
By David WORMS
Dec 1, 2010
Storage and massive processing with Hadoop
Categories: Big Data | Tags: Hadoop, HDFS, Storage
Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projectsā¦
By David WORMS
Nov 26, 2010
Node HBase, a NodeJs client for Apache HBase
Categories: Big Data, Node.js | Tags: HBase, Big Data, Node.js, REST
HBase is a ācolumn famillyā database from the Hadoop ecosystem built on the model of Google BigTable. HBase can accommodate very large volumes of data (tera or peta) while maintaining highā¦
By David WORMS
Nov 1, 2010
MapReduce introduction
Categories: Big Data | Tags: Java, MapReduce, Big Data, JavaScript
Information systems have more and more data to store and process. Companies like Google, Facebook, Twitter and many others store astronomical amounts of information from their customers and must beā¦
By David WORMS
Jun 26, 2010
Node.js, JavaScript on the server side
Categories: Front End, Node.js | Tags: HTTP, Server, JavaScript, Node.js
Waiting for the Next Big Language (NBL for Next Big Language), this is now 3 years or more since I predict to my customers a bright future for JavaScript as a programming language for serverā¦
By David WORMS
Jun 12, 2010