Operation
Related articles
Data versioning and reproducible ML with DVC and MLflow
Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th Novemberā¦
Sep 30, 2020
Version your datasets with Data Version Control (DVC) and Git
Categories: Data Science, DevOps & SRE | Tags: DevOps, Infrastructure, Operation, Git, GitOps, SCM
Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldnāt we have a similar approach suchā¦
Sep 3, 2020
Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
āEnterprise Machine Learning requires looking at the big picture [ā¦] from a data engineering and a data platform perspective,ā lectured Justin Norman during the talk on the deployment of Machineā¦
Sep 30, 2019
Auto-scaling Druid with Kubernetes
Categories: Big Data, Business Intelligence, Containers Orchestration | Tags: Helm, Metrics, OLAP, Operation, Container Orchestration, EC2, Druid, Cloud, CNCF, Data Analytics, Kubernetes, Prometheus, Python
Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talkā¦
Jul 16, 2019
Clusters and workloads migration from Hadoop 2 to Hadoop 3
Categories: Big Data, Infrastructure | Tags: Slider, Erasure Coding, Rolling Upgrade, HDFS, Spark, YARN, Docker
Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your currentā¦
Jul 25, 2018
Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customersā¦
By David WORMS
Jun 18, 2018
Running Enterprise Workloads in the Cloud with Cloudbreak
Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: Cloudbreak, Operation, Hadoop, AWS, Azure, GCP, HDP, OpenStack
This article is based on Peter Darvasi and Richard Doktoricsā talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworksā automated deployment toolā¦
May 28, 2018
Ambari - How to blueprint
Categories: Big Data, DevOps & SRE | Tags: Ambari, Automation, DevOps, Operation, Ranger, REST
As infrastructure engineers at Adaltas, we deploy Hadoop clusters. A lot of them. Letās see how to automate this process with REST requests. While really handy for deploying one or two clusters, theā¦
Jan 17, 2018
Advanced multi-tenant Hadoop and Zookeeper protection
Categories: Big Data, Infrastructure | Tags: DoS, iptables, Operation, Scalability, Zookeeper, Clustering, Consensus
Zookeeper is a critical component to Hadoopās high availability operation. The latter protects itself by limiting the number of maximum connections (maxConns = 400). However Zookeeper does not protectā¦
Jul 5, 2017