Skip to main content

Hadoop Distributed File System (HDFS)

The Hadoop

Distributed File System (DFS) (HDFS) is a distributed, scalable file system (distributed storage) designed to store very large datasets across clusters of commodity servers and provide high-throughput access to application data.

  • Distributed, fault-tolerant storage for large files across clusters of commodity hardware (distributed storage)
  • Master/worker architecture using a NameNode for metadata and DataNodes for block storage (storage architecture)
  • Optimized for streaming data access and high aggregate throughput rather than low-latency random I/O (data access pattern)
  • Replication of data blocks across multiple DataNodes for reliability and fault tolerance (data protection)
  • Support for scaling to thousands of nodes and petabytes of data within a single file system namespace (scalable infrastructure)

More About HDFS

The Hadoop DFS (HDFS) is a DFS (distributed storage) designed to run on clusters of commodity hardware and to provide reliable, scalable storage for large datasets. It is a core component of the Apache Hadoop project under The Apache Software Foundation and is built to support data-intensive applications that process multi-gigabyte to terabyte-scale files.

HDFS uses a master/worker architecture (storage architecture) consisting of a central NameNode and multiple DataNodes. The NameNode manages the file system namespace, directory hierarchy, and metadata such as file-to-block mappings, permissions, and replication factors. DataNodes store the actual data blocks on local disks and serve read and write requests from clients. This separation of metadata and data storage enables centralized namespace management with distributed I/O throughput across the cluster.

Files in HDFS are split into fixed-size blocks (data layout), which are distributed across DataNodes. Each block is replicated across multiple nodes according to a configurable replication factor (data protection), which provides fault tolerance against DataNode or disk failures. The system is designed so that if a DataNode fails, the NameNode detects the loss of block replicas and coordinates re-replication to maintain the desired number of replicas.

HDFS is optimized for streaming access to large files (analytics storage) rather than low-latency random access. The design favors high aggregate throughput and batch processing workloads, where applications read and write data in large, sequential operations. The append-only write model and large block sizes reduce metadata overhead and improve efficiency for large-scale jobs.

From a deployment perspective, HDFS typically runs on clusters that may contain thousands of DataNodes (cluster infrastructure). The architecture supports a single namespace managed by the NameNode, while DataNodes contribute storage capacity and I/O bandwidth. The file system includes mechanisms for heartbeat and block reports from DataNodes to the NameNode to track node health and block locations.

In enterprise environments, HDFS is used as a base storage layer for data processing frameworks that operate on distributed datasets (big data infrastructure). Its design to run on commodity hardware allows organizations to build large storage clusters with standard servers and disks. The system’s replication model and rack-aware placement policies support reliability and help maintain data availability in the presence of node or rack failures.

Within a technical taxonomy, HDFS is positioned as a DFS and storage substrate for large-scale batch and analytical workloads (data infrastructure). It provides namespace management, block storage and replication, and cluster coordination primitives that are used by higher-level computation frameworks within the Apache Hadoop ecosystem.