Data Science
La data science, et plus généralement l'Intelligence Artificielle (IA), se distingue de la programmation et de l'analyse traditionnelle par sa capacité à extraire des connaissances à partir de données et modifier son comportement (c’est-à-dire apprendre) sans programmation spécifique. Alors que les logiciels traditionnels prédéfinissent la logique qui régit leurs processus, les algorithmes de data science construisent et découvrent des modèles et sont en capacité de les améliorer continuellement.
La data science regroupe un ensemble de compétence incluant le Machine Learning, le traitement automatique du langage naturel (NLP pour Natural Language Processing), ou encore la reconnaissance de la parole, des images et des visages (entre autres applications). Dans certaines applications, les algorithmes vont jusqu'à simuler l’intelligence humaine.
Points clés essentiels
- Les data scientists construisent, entrainent, et valident les modèles pour prendre des décisions critiques.
- Les Data Scientists gèrent l'accès aux données, la reproductibilité et la collaboration afin de créer rapidement des modèles déployables à grande échelle.
- Adaltas permet aux Data Scientists de créer, mettre à l'échelle et déployer facilement des modèles de machine learning en quelques minutes, contribuant ainsi à stimuler l'innovation dans l'ensemble de l'entreprise.
Articles associés à la data science
Deploy your containerized AI applications with nvidia-docker
Catégories : Containers Orchestration, Data Science | Tags : containerd, DevOps, Learning and tutorial, NVIDIA, Docker, Keras, TensorFlow
More and more products and services are taking advantage of the modeling and prediction capabilities of AI. This article presents the nvidia-docker tool for integrating AI (Artificial Intelligence…
24 mars 2022
Spring 2022 internship - building a Data Lab
Catégories : Data Science, Learning | Tags : MongoDB, Spark, Argo CD, Elasticsearch, Internship, Keycloak, Kubernetes, OpenID Connect, PostgreSQL
Job Description Over the last few years, we developed the ability to use computers to process large amounts of data. The ecosystem evolved over a large offering of tools and libraries and the creation…
Par David WORMS
24 nov. 2021
H2O in practice: a protocol combining AutoML with traditional modeling approaches
Catégories : Data Science, Learning | Tags : Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python, XGBoost
H20 comes with a lot of functionalities. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with traditional modeling and optimization approach. The objective…
12 nov. 2021
H2O in practice: a Data Scientist feedback
Catégories : Data Science, Learning | Tags : Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python
Automated machine learning (AutoML) platforms are gaining popularity and becoming a new important tool in the data scientists’ toolbox. A few months ago, I introduced H2O, an open-source platform for…
29 sept. 2021
Apache Liminal: when MLOps meets GitOps
Catégories : Big Data, Containers Orchestration, Data Engineering, Data Science, Tech Radar | Tags : Data Engineering, CI/CD, Data Science, Deep Learning, Deployment, Docker, GitOps, Kubernetes, Machine Learning, MLOps, Open source, Python, TensorFlow
Apache Liminal is an open-source software which proposes a solution to deploy end-to-end Machine Learning pipelines. Indeed it permits to centralize all the steps needed to construct Machine Learning…
Par Aargan COINTEPAS
31 mars 2021
Storage size and generation time in popular file formats
Catégories : Data Engineering, Data Science | Tags : Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)
Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…
Par Barthelemy NGOM
22 mars 2021
TensorFlow Extended (TFX): the components and their functionalities
Catégories : Big Data, Data Engineering, Data Science, Learning | Tags : Beam, Data Engineering, Pipeline, CI/CD, Data Science, Deep Learning, Deployment, Machine Learning, MLOps, Open source, Python, TensorFlow
Putting Machine Learning (ML) and Deep Learning (DL) models in production certainly is a difficult task. It has been recognized as more failure-prone and time consuming than the modeling itself, yet…
5 mars 2021
Faster model development with H2O AutoML and Flow
Catégories : Data Science, Learning | Tags : Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python
Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate a…
10 déc. 2020
Data versioning and reproducible ML with DVC and MLflow
Catégories : Data Science, DevOps & SRE, Events | Tags : Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November…
30 sept. 2020
Experiment tracking with MLflow on Databricks Community Edition
Catégories : Data Engineering, Data Science, Learning | Tags : Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn
Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…
10 sept. 2020
Version your datasets with Data Version Control (DVC) and Git
Catégories : Data Science, DevOps & SRE | Tags : DevOps, Infrastructure, Operation, Git, GitOps, SCM
Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such…
Par Grégor JOUET
3 sept. 2020
Importing data to Databricks: external tables and Delta Lake
Catégories : Data Engineering, Data Science, Learning | Tags : Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python
During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…
21 mai 2020
MLflow tutorial: an open source Machine Learning (ML) platform
Catégories : Data Engineering, Data Science, Learning | Tags : AWS, Azure, Databricks, Deep Learning, Deployment, Machine Learning, MLflow, MLOps, Python, Scikit-learn
Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…
23 mars 2020
Introduction to Ludwig and how to deploy a Deep Learning model via Flask
Catégories : Data Science, Tech Radar | Tags : Learning and tutorial, Deep Learning, Ludwig Deep Learning Toolbox, Machine Learning, Python
Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomous…
2 mars 2020
Internship Data Science & Data Engineer - ML in production and streaming data ingestion
Catégories : Data Engineering, Data Science | Tags : DevOps, Flink, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python
Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…
Par David WORMS
26 nov. 2019
Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod
Catégories : Data Science | Tags : GPU, Deep Learning, Horovod, Keras, TensorFlow
The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle which…
Par Grégor JOUET
15 nov. 2019
Innovation, project vs product culture in Data Science
Catégories : Data Science, Data Governance | Tags : DevOps, Agile, Scrum
Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…
Par David WORMS
8 oct. 2019
Machine Learning model deployment
Catégories : Big Data, Data Engineering, Data Science, DevOps & SRE | Tags : DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…
Par Oskar RYNKIEWICZ
30 sept. 2019
TensorFlow installation on Docker
Catégories : Containers Orchestration, Data Science, Learning | Tags : CPU, Jupyter, Linux, AI, Deep Learning, Docker, TensorFlow
TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array…
Par Pierre SAUVAGE
5 août 2019
Spark Streaming part 4: clustering with Spark MLlib
Catégories : Data Engineering, Data Science, Learning | Tags : Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming
Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…
Par Oskar RYNKIEWICZ
27 juin 2019
Introduction to Cloudera Data Science Workbench
Catégories : Data Science | Tags : Azure, Cloudera, Docker, Git, Kubernetes, Machine Learning, MLOps, Notebook
Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main…
Par Mehdi ELALAMI
28 févr. 2019
Applying Deep Reinforcement Learning to Poker
Catégories : Data Science | Tags : Algorithm, Gaming, Q-learning, Deep Learning, Machine Learning, Neural Network, Python
We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then we’ll apply a version of this algorithm to the game of Poker…
9 janv. 2019
CodaLab – Data Science competitions
Catégories : Data Science, Adaltas Summit 2018, Learning | Tags : Database, Infrastructure, Machine Learning, MySQL, Node.js, Python
CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it…
17 déc. 2018
Nvidia and AI on the edge
Catégories : Data Science | Tags : Caffe, GPU, NVIDIA, AI, Deep Learning, Edge computing, Keras, PyTorch, TensorFlow
In the last four years, corporations have been investing a lot in AI and particularly in Deep Learning and Edge Computing. While the theory has taken huge steps forward and new algorithms are invented…
Par Yliess HATI
10 oct. 2018
Lando: Deep Learning used to summarize conversations
Catégories : Data Science, Learning | Tags : Micro Services, Open API, Deep Learning, Internship, Kubernetes, Neural Network, Node.js
Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users to…
Par Yliess HATI
18 sept. 2018
Deep learning on YARN: running Tensorflow and friends on Hadoop cluster
Catégories : Data Science | Tags : GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost
With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…
Par Louis BIANCHERIN
24 juil. 2018
YARN and GPU Distribution for Machine Learning
Catégories : Data Science, DataWorks Summit 2018 | Tags : GPU, YARN, Machine Learning, Neural Network, Storage
This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…
Par Grégor JOUET
30 mai 2018
TensorFlow on Spark 2.3: The Best of Both Worlds
Catégories : Data Science, DataWorks Summit 2018 | Tags : Mesos, C++, CPU, GPU, Tuning, Spark, YARN, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow
The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…
Par Yliess HATI
29 mai 2018
Apache Apex with Apache SAMOA
Catégories : Data Science, Events, Tech Radar | Tags : Apex, Samoa, Storm, Tools, Flink, Hadoop, Machine Learning
Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Online…
Par Pierre SAUVAGE
17 juil. 2016
Apache Apex: next gen Big Data analytics
Catégories : Data Science, Events, Tech Radar | Tags : Apex, Storm, Tools, Flink, Hadoop, Kafka, Data Science, Machine Learning
Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallel…
Par César BEREZOWSKI
17 juil. 2016
Definitions of machine learning algorithms present in Apache Mahout
Catégories : Data Science | Tags : Algorithm, Сlassification, Hadoop, Mahout, Clustering, Machine Learning
Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop…
Par David WORMS
8 mars 2013
Hadoop and R with RHadoop
Catégories : Business Intelligence, Data Science | Tags : Thrift, Learning and tutorial, R, Hadoop, HBase, HDFS, MapReduce, Data Analytics
RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of…
Par David WORMS
19 juil. 2012
Installing and using MADlib with PostgreSQL on OSX
Catégories : Data Science | Tags : Database, Greenplum, Statistics, PostgreSQL, SQL
We cover basic installation and usage of PostgreSQL and MADlib on OSX and Ubuntu. Instructions for other environments should be similar. PostgreSQL is an Open Source database with enterprise…
Par David WORMS
7 juil. 2012