Apache Spark
Apache Spark is a unified in-memory analytics platform for Big Data processing, data streaming, SQL, Machine Learning and graph processing.
The open source project, classified by the Apache Foundation as a top-level project since 2014, originated from UC Berkeley in the AMP Lab. It has since become an major actor of the Big Data ecosystem as an alternative and an evolution of MapReduce.
Due to its distributed architecture in a cluster, Apache Spark execute in a cluster to process large amounts of data with high performance and in parallel. Apache Spark processes the data in memory and is optimize to limit the usage of disks.
Many users use Spark DataFrames, which have been integrated in Scala, Python and Java since Spark version 2. Spark DataFrames, comparable to R DataFrames or Pandas DataFrames, enable data to be queried in a table structure. Its integration with Machine Learning enables analytical models to be applied to Big Data with Apache Spark. This is why the system is often referred to as the Swiss Army Knife of data processing.
Spark executes on various platforms including in standalone hosts and clusters, in Hadoop clusters with YARN and in the Databricks platform.
Related articles
CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP
Categories: Big Data, Data Engineering, Learning | Tags: NiFi, Business intelligence, Data Engineering, Iceberg, Spark, Big Data, Cloudera, CDP, Data Analytics, Data Lake, Data Warehouse
In this hands-on lab session we demonstrate how to build an end-to-end big data solution with Cloudera Data Platform (CDP) Public Cloud, using the infrastructure we have deployed and configured overā¦
Jul 24, 2023
Spark on Hadoop integration with Jupyter
Categories: Adaltas Summit 2021, Infrastructure, Tech Radar | Tags: Infrastructure, Jupyter, Spark, YARN, CDP, HDP, Notebook, TDP
For several years, Jupyter notebook has established itself as the notebook solution in the Python universe. Historically, Jupyter is the tool of choice for data scientists who mainly develop in Pythonā¦
Sep 1, 2022
Comparison of database architectures: data warehouse, data lake and data lakehouse
Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data lakehouse, Data Warehouse, File Format
Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparingā¦
By Gonzalo ETSE
May 17, 2022
Introducing Trunk Data Platform: the Open-Source Big Data Distribution Curated by TOSIT
Categories: Big Data, DevOps & SRE, Infrastructure | Tags: DevOps, Hortonworks, Ansible, Hadoop, HBase, Knox, Ranger, Spark, Cloudera, CDP, CDH, Open source, TDP
Ever since Cloudera and Hortonworks merged, the choice of commercial Hadoop distributions for on-prem workloads essentially boils down to CDP Private Cloud. CDP can be seen as the ābest of both worldsā¦
Apr 14, 2022
Databricks logs collection with Azure Monitor at a Workspace Scale
Categories: Cloud Computing, Data Engineering, Adaltas Summit 2021 | Tags: Metrics, Monitoring, Spark, Azure, Databricks, Log4j
Databricks is an optimized data analytics platform based on Apache Spark. Monitoring Databricks plateform is crucial to ensure data quality, job performance, and security issues by limiting access toā¦
By Claire PLAYE
May 10, 2022
Spring 2022 internship - building a Data Lab
Categories: Data Science, Learning | Tags: MongoDB, Spark, Argo CD, Elasticsearch, Internship, Keycloak, Kubernetes, OpenID Connect, PostgreSQL
Job Description Over the last few years, we developed the ability to use computers to process large amounts of data. The ecosystem evolved over a large offering of tools and libraries and the creationā¦
By David WORMS
Nov 24, 2021
Internship in Big Data infrastructure with TDP
Categories: Infrastructure, Learning | Tags: Cyber Security, DevOps, Java, Hadoop, IaC, Internship, TDP
Job Description Big Data and distributed computing is at Adaltasā core. We support our partners in the deployment, maintenance and optimization of some of Franceās largest clusters. Adaltas is also anā¦
By Daniel HARTY
Oct 25, 2021
Adaltas Summit 2021, 2nd edition in corsica
Categories: Adaltas Summit 2021, Learning | Tags: Ansible, Hadoop, Spark, Azure, Blockchain, Deep Learning, Docker, Terraform, Kubernetes, Node.js
For its second edition, the whole Adaltas crew is gathering in Corsica for a whole week with 2 days dedicated to technology the 23rd and the 24th of september 2021. After a year and a half of sanitaryā¦
By David WORMS
Sep 21, 2021
Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin
Categories: Big Data, Infrastructure | Tags: Maven, Hadoop, HBase, Hive, Spark, Git, Release and features, TDP, Unit tests
The Hadoop ecosystem gave birth to many popular projects including HBase, Spark and Hive. While technologies like Kubernetes and S3 compatible object storages are growing in popularity, HDFS and YARNā¦
Dec 18, 2020
Data versioning and reproducible ML with DVC and MLflow
Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th Novemberā¦
Sep 30, 2020
Experiment tracking with MLflow on Databricks Community Edition
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn
Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and theā¦
Sep 10, 2020
Comparison of different file formats in Big Data
Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers
In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposesā¦
By Aida NGOM
Jul 23, 2020
Automate a Spark routine workflow from GitLab to GCP
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: Learning and tutorial, Airflow, Spark, CI/CD, GitLab, GitOps, GCP, Terraform
A workflow consists in automating a succession of tasks to be carried out without human intervention. It is an important and widespread concept which particularly apply to operational environmentsā¦
Jun 16, 2020
Introducing Apache Airflow on AWS
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: PySpark, Learning and tutorial, Airflow, Oozie, Spark, AWS, Docker, Python
Apache Airflow offers a potential solution to the growing challenge of managing an increasingly complex landscape of data management tools, scripts and analytics processes. It is an open-sourceā¦
May 5, 2020
Optimization of Spark applications in Hadoop YARN
Categories: Data Engineering, Learning | Tags: Tuning, Hadoop, Spark, Python
Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This articleā¦
Mar 30, 2020
Spark Streaming part 4: clustering with Spark MLlib
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming
Spark MLlib is an Apacheās Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform forā¦
Jun 27, 2019
Spark Streaming part 3: DevOps, tools and tests for Spark applications
Categories: Big Data, Data Engineering, DevOps & SRE | Tags: DevOps, Learning and tutorial, Spark, Apache Spark Streaming
Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on dataā¦
May 31, 2019
Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop
Categories: Data Engineering, Learning | Tags: Spark, Apache Spark Streaming, Python, Streaming
Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Dataā¦
May 28, 2019
Spark Streaming part 1: build data pipelines with Spark Structured Streaming
Categories: Data Engineering, Learning | Tags: Kafka, Spark, Apache Spark Streaming, Big Data, Streaming
Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. Theā¦
Apr 18, 2019
Cloudera CDP and Cloud migration of your Data Warehouse
Categories: Big Data, Cloud Computing | Tags: Azure, Cloudera, Data Hub, Data Lake, Data Warehouse
While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriateā¦
By David WORMS
Dec 16, 2019
Should you move your Big Data and Data Lake to the Cloud
Categories: Big Data, Cloud Computing | Tags: DevOps, AWS, Azure, Cloud, CDP, Databricks, GCP
Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customerās Bigā¦
Dec 9, 2019
Hadoop Ozone part 1: an introduction of the new filesystem
Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes
Hadoop Ozone is an object store for Hadoop. It is designed to scale to billions of objects of varying sizes. It is currently in development. The roadmap is available on the project wiki. This articleā¦
Dec 3, 2019
InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS
Categories: Big Data, Containers Orchestration | Tags: DevOps, LXD, Hadoop, Kafka, Spark, Ceph, Internship, Kubernetes, NoSQL
Context The acquisition of a high-capacity cluster is in line with Adaltasā desire to build a PAAS-type offering to use and to provide Big Data and container orchestration platforms. The platforms areā¦
By David WORMS
Nov 26, 2019
Internship Data Science & Data Engineer - ML in production and streaming data ingestion
Categories: Data Engineering, Data Science | Tags: DevOps, Flink, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python
Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitatesā¦
By David WORMS
Nov 26, 2019
Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
āEnterprise Machine Learning requires looking at the big picture [ā¦] from a data engineering and a data platform perspective,ā lectured Justin Norman during the talk on the deployment of Machineā¦
Sep 30, 2019
Publish Spark SQL DataFrame and RDD with Spark Thrift Server
Categories: Data Engineering | Tags: Thrift, JDBC, Hadoop, Hive, Spark, SQL
The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reportsā¦
Mar 25, 2019
Clusters and workloads migration from Hadoop 2 to Hadoop 3
Categories: Big Data, Infrastructure | Tags: Slider, Erasure Coding, Rolling Upgrade, HDFS, Spark, YARN, Docker
Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your currentā¦
Jul 25, 2018
Deep learning on YARN: running Tensorflow and friends on Hadoop cluster
Categories: Data Science | Tags: GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost
With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveragingā¦
Jul 24, 2018
Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customersā¦
By David WORMS
Jun 18, 2018
TensorFlow on Spark 2.3: The Best of Both Worlds
Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, C++, CPU, GPU, Tuning, Spark, YARN, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow
The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the newā¦
By Yliess HATI
May 29, 2018
Apache Metron in the Real World
Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithm, NiFi, Solr, Storm, pcap, RDBMS, HDFS, Kafka, Metron, Spark, Data Science, Elasticsearch, SQL
Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation wasā¦
May 29, 2018
Apache Beam: a unified programming model for data processing pipelines
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Pipeline, Flink, Spark
In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 inā¦
May 24, 2018
What's new in Apache Spark 2.3?
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, PySpark, Tuning, ORC, Spark, Spark MLlib, Data Science, Docker, Kubernetes, pandas, Streaming
Letās dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apacheā¦
May 23, 2018
EclairJS - Putting a Spark in Web Apps
Categories: Data Engineering, Front End | Tags: Jupyter, Spark, JavaScript
Presentation by David Fallside from IBM, images extracted from the presentation. Introduction Web Apps development has moved from Java to NodeJS and Javascript. It provides a simple and richā¦
By David WORMS
Jul 17, 2016