Skip to main content

Apache Tez

Apache Tez is a distributed execution framework for building data processing applications on Hadoop YARN that express complex directed acyclic graphs (DAGs) of tasks for batch and interactive workloads (big data processing).

  • DAG-based application framework on Hadoop YARN (big data processing)
  • Configurable execution engine for batch and interactive data processing jobs (data processing engine)
  • APIs for expressing dataflow graphs with vertices and edges (developer framework)
  • Integration foundation for higher-level data processing tools and query engines on Hadoop (data platform integration)
  • Resource management integration through YARN for scheduling and scaling Tez applications (cluster resource management)

More About Apache Tez

Apache Tez is a framework for executing complex data processing pipelines on top of Apache Hadoop YARN (big data processing), enabling applications to model workflows as directed acyclic graphs (DAGs) composed of tasks connected through data movement and processing relationships. It provides a configurable execution engine that replaces rigid, fixed-function patterns with a programmable model in which developers and higher-level systems describe their dataflows explicitly.

At its core, Tez defines a DAG application model (workflow orchestration) where vertices represent processing tasks and edges represent data transfer and dependency relationships between those tasks. This model supports features such as parallelism, pipelining, and data partitioning, enabling a variety of processing patterns, including batch-style workloads and interactive query-style workloads. The framework manages task execution, data shuffles, and communication between vertices based on this DAG description.

Tez runs as a user application on YARN and integrates with its resource management capabilities (cluster resource management). It requests containers, manages their lifecycle, and coordinates distributed tasks across the cluster, relying on YARN for scheduling and resource isolation. This design allows Tez-based applications to share a common Hadoop cluster with other YARN workloads while using custom execution logic tailored to their data processing needs.

For developers and system integrators, Tez exposes APIs (developer framework) for constructing DAGs programmatically, defining processors for vertex logic, and configuring data sources, sinks, and edge properties such as data movement type and partitioning. This supports integration by query engines and higher-level tools (data platform integration) that compile user queries or scripts into Tez DAGs, delegating distributed execution to the Tez engine while maintaining their own language or interface.

In enterprise environments, Tez is positioned as a general-purpose execution substrate for Hadoop-based analytical workloads (analytics infrastructure). It allows organizations to implement custom or tool-generated data processing pipelines while reusing existing Hadoop storage and YARN-based compute clusters. Its focus on DAG-oriented processing, configurability, and YARN integration places it in the category of distributed data processing frameworks that operate as engines underneath query systems and data processing platforms.