Machine Learning
Machine learning is a subfield of artificial intelligence. The aim is to build a mathematical description or model of the data we have in order to be able to gain new understanding about the system or to predict its future behavior. Approaches can be divided in three categories:
- Supervised learning ā observations are labeled, meaning that each observation in a dataset belongs to a known class. The aim is to predict this class of new observations, where it is unknown. Some algorithms: linear and logistic regression, decision trees, support vector machines, artificial neural networks.
- Unsupervised learning ā data is unlabeled. The goal is to discover new underlying patterns with minimum of human supervision. Examples of algorithms are clustering, principal component analysis and association rules.
- Reinforcement learning ā does not need labeled data. An agent exists in an environment in which it takes actions towards accomplishing a goal. For each action it can be positively or negatively rewarded. After repeating the same sequence of actions multiple times, it seeks to maximize the award and minimize the effort. Thus, it learns the optimal way to accomplish a task. Two categories of algorithms are model-free and model-based algorithms.
- Learn more
- Wikipedia
Related articles

H2O in practice: a protocol combining AutoML with traditional modeling approaches
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python, XGBoost
H20 comes with a lot of functionalities. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with traditional modeling and optimization approach. The objectiveā¦
Nov 12, 2021

H2O in practice: a Data Scientist feedback
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python
Automated machine learning (AutoML) platforms are gaining popularity and becoming a new important tool in the data scientistsā toolbox. A few months ago, I introduced H2O, an open-source platform forā¦
Sep 29, 2021

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI
Categories: Data Engineering, Learning | Tags: Cloud, Data Lake, Databricks, Delta Lake, MLflow
Self-paced trainings are proposed by Databricks inside their Academy program. The price is $ 2000 USD for unlimited access to the training courses for a period of 1 year, but also free for customersā¦
May 26, 2021

Apache Liminal: when MLOps meets GitOps
Categories: Big Data, Containers Orchestration, Data Engineering, Data Science, Tech Radar | Tags: Data Engineering, CI/CD, Data Science, Deep Learning, Deployment, Docker, GitOps, Kubernetes, Machine Learning, MLOps, Open source, Python, TensorFlow
Apache Liminal is an open-source software which proposes a solution to deploy end-to-end Machine Learning pipelines. Indeed it permits to centralize all the steps needed to construct Machine Learningā¦
Mar 31, 2021

TensorFlow Extended (TFX): the components and their functionalities
Categories: Big Data, Data Engineering, Data Science, Learning | Tags: Beam, Data Engineering, Pipeline, CI/CD, Data Science, Deep Learning, Deployment, Machine Learning, MLOps, Open source, Python, TensorFlow
Putting Machine Learning (ML) and Deep Learning (DL) models in production certainly is a difficult task. It has been recognized as more failure-prone and time consuming than the modeling itself, yetā¦
Mar 5, 2021

Faster model development with H2O AutoML and Flow
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python
Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate aā¦
Dec 10, 2020

Data versioning and reproducible ML with DVC and MLflow
Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th Novemberā¦
Sep 30, 2020

Experiment tracking with MLflow on Databricks Community Edition
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn
Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and theā¦
Sep 10, 2020

Importing data to Databricks: external tables and Delta Lake
Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python
During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a laterā¦
May 21, 2020

MLflow tutorial: an open source Machine Learning (ML) platform
Categories: Data Engineering, Data Science, Learning | Tags: AWS, Azure, Databricks, Deep Learning, Deployment, Machine Learning, MLflow, MLOps, Python, Scikit-learn
Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Scienceā¦
Mar 23, 2020

Introduction to Ludwig and how to deploy a Deep Learning model via Flask
Categories: Data Science, Tech Radar | Tags: Learning and tutorial, Deep Learning, Ludwig Deep Learning Toolbox, Machine Learning, Python
Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomousā¦
Mar 2, 2020

Spark Streaming part 4: clustering with Spark MLlib
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming
Spark MLlib is an Apacheās Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform forā¦
Jun 27, 2019

Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod
Categories: Data Science | Tags: GPU, Deep Learning, Horovod, Keras, TensorFlow
The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle whichā¦
Nov 15, 2019

Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
āEnterprise Machine Learning requires looking at the big picture [ā¦] from a data engineering and a data platform perspective,ā lectured Justin Norman during the talk on the deployment of Machineā¦
Sep 30, 2019

Introduction to Cloudera Data Science Workbench
Categories: Data Science | Tags: Azure, Cloudera, Docker, Git, Kubernetes, Machine Learning, MLOps, Notebook
Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their mainā¦
Feb 28, 2019

Applying Deep Reinforcement Learning to Poker
Categories: Data Science | Tags: Algorithm, Gaming, Q-learning, Deep Learning, Machine Learning, Neural Network, Python
We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then weāll apply a version of this algorithm to the game of Pokerā¦
Jan 9, 2019

CodaLab ā Data Science competitions
Categories: Data Science, Adaltas Summit 2018, Learning | Tags: Database, Infrastructure, Machine Learning, MySQL, Node.js, Python
CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Letās see how itā¦
Dec 17, 2018

Apache Flink: past, present and future
Categories: Data Engineering | Tags: Pipeline, Flink, Kubernetes, Machine Learning, SQL, Streaming
Apache Flink is a little gem which deserves a lot more attention. Letās dive into Flinkās past, its current state and the future it is heading to by following the keynotes and presentations at Flinkā¦
Nov 5, 2018

YARN and GPU Distribution for Machine Learning
Categories: Data Science, DataWorks Summit 2018 | Tags: GPU, YARN, Machine Learning, Neural Network, Storage
This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can beā¦
May 30, 2018

TensorFlow on Spark 2.3: The Best of Both Worlds
Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, C++, CPU, GPU, Tuning, Spark, YARN, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow
The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the newā¦
By Yliess HATI
May 29, 2018

Apache Apex with Apache SAMOA
Categories: Data Science, Events, Tech Radar | Tags: Apex, Samoa, Storm, Tools, Flink, Hadoop, Machine Learning
Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Onlineā¦
Jul 17, 2016

Apache Apex: next gen Big Data analytics
Categories: Data Science, Events, Tech Radar | Tags: Apex, Storm, Tools, Flink, Hadoop, Kafka, Data Science, Machine Learning
Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallelā¦
Jul 17, 2016

Definitions of machine learning algorithms present in Apache Mahout
Categories: Data Science | Tags: Algorithm, Š”lassification, Hadoop, Mahout, Clustering, Machine Learning
Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoopā¦
By David WORMS
Mar 8, 2013