Apache Avro
Apache Avro is a data serialization system (data serialization) that defines compact binary formats, schemas, and Resource Provisioning Controller (RPC) mechanisms for structured data exchange across diverse applications and languages.
- Row-oriented data serialization framework with a compact binary format (data serialization)
- Schema-based data definition using JSON for data structures and types (data modeling)
- Support for schema evolution and resolution between writer and reader schemas (data governance)
- Remote Procedure Call facilities over Avro data, including protocol definitions (application integration)
- Interoperability with multiple programming languages and Hadoop ecosystem tools (big data integration)
More About Apache Avro
Apache Avro is a data serialization system (data serialization) under The Apache Software Foundation that provides a compact, fast, binary data format, a rich data structure definition model, and an associated Remote Procedure Call mechanism. It is designed for scenarios where applications need to serialize structured data, persist it, or exchange it between services, components, or systems that may be written in different programming languages.
Avro defines data schemas using JSON (data modeling), describing record types, primitive and complex types, and nested structures. These schemas govern how data is serialized into Avros binary format and how it is deserialized by consumers. The system separates the writer schema (used when data is produced) from the reader schema (used when data is consumed), and includes rules for resolving differences between them, which supports schema evolution in environments where data formats change over time.
The Avro object container file format (data storage) embeds schema information with the data payload. This allows data files to be self-describing: a consumer can read the schema directly from the file and interpret the contained records without an external schema registry. Avro also defines a simple, compact binary encoding (data transport) that aims to minimize overhead in big data and streaming pipelines and is well suited for batch or streaming processing frameworks that operate on large datasets.
On the RPC side, Apache Avro provides facilities for defining protocols (application integration) in JSON, specifying messages, request and response schemas, and errors. These protocol definitions can be used to generate client and server code in supported languages, enabling services to communicate using Avro serialization over a chosen transport. This aligns with architectures where services exchange structured, strongly-typed messages while keeping wire formats compact and language-neutral.
Avro has language bindings and tool support for multiple programming environments (multi-language interoperability), which enables use in heterogeneous enterprise stacks. It is widely used with Apache Hadoop and related Apache projects (big data integration), including for storing datasets in Avro files and for exchanging data between processing jobs. Because schemas are explicit and machine-readable, Avro often fits into data governance and metadata management workflows, where enterprises track schema versions and data contracts between producers and consumers.
In an enterprise taxonomy, Apache Avro can be categorized primarily as a data serialization framework (data serialization) with associated schema definition and evolution capabilities (data modeling, data governance) and an RPC facility (application integration). It functions as a core building block for interoperable data pipelines, storage formats, and service interfaces that require compact binary encoding and JSON-based schemas.