Apache Spark
Apache Spark is a distributed data processing engine (big data processing, analytics) for large-scale batch and streaming workloads.
- Unified engine for large-scale batch and streaming data processing (big data processing, data streaming)
- APIs in multiple languages, including Scala, Java, Python, R, and Structured Query Language (SQL) (developer framework)
- Libraries for SQL queries, structured data processing, Machine Learning (ML), graph processing, and stream processing (analytics, ML)
- In-memory computation support to reuse data across operations (big data processing)
- Deployment on clusters, standalone, or with resource managers such as Kubernetes, YARN, and Mesos (cluster computing)
More About Apache Spark
Apache Spark is an open-source unified analytics engine (big data processing, analytics) designed for large-scale data processing across distributed computing environments. It addresses workloads that process large datasets for batch analytics, real-time streaming, ML, and graph computation. Spark runs computations in parallel across clusters of machines and provides an abstraction that hides low-level cluster management from application developers.
The core of Apache Spark is a general-purpose distributed execution engine built around resilient distributed datasets and structured APIs (big data processing). It provides a programming model that allows users to express transformations and actions on large datasets, which Spark schedules and executes across cluster nodes. Spark supports in-memory processing, which allows reuse of intermediate data across multiple operations, and also reads and writes data from external storage systems such as distributed file systems and object stores (data integration).
Apache Spark includes several libraries that extend the core engine into specific data domains. Spark SQL (data warehousing, analytics) provides a structured data processing module with DataFrame and Dataset APIs, and a SQL query interface. Structured Streaming (data streaming) supports incremental processing of streaming data using the same APIs as batch workloads. MLlib (machine learning) is a library for scalable ML algorithms and utilities. GraphX (graph processing) provides APIs and operators for graph-parallel computations.
In enterprise environments, Apache Spark is used for Extract, Transform, Load (ETL) pipelines, data preparation, data warehousing workloads, interactive analytics, and training and scoring of ML models (analytics, data engineering, ML). Organizations deploy Spark on dedicated clusters or on shared resource managers such as Kubernetes, Hadoop YARN, and Apache Mesos (cluster orchestration). Spark connects to a range of storage systems that expose data via file, table, or object interfaces, which allows integration with existing data lakes and data platforms.
Apache Spark supports multiple programming interfaces, including Scala, Java, Python, R, and SQL (developer framework, analytics). This multi-language support allows data engineers, data scientists, and application developers to build applications that run on the same execution engine while using their preferred language. Spark also provides a catalog and integration with metadata services when used with compatible data platforms.
The Spark ecosystem includes modules and APIs that can be extended with custom data sources, user-defined functions, and integrations with external ML frameworks (extensibility). It is released under the Apache License 2.0 and governed by The Apache Software Foundation, with development following an open, community-driven process. In a technical taxonomy, Apache Spark fits into categories such as big data processing engines, distributed analytics platforms, and ML execution frameworks used in on-premises (on-prem) and cloud environments.