Last time I briefly introduced HDFS and already knew the definition of Block and the architecture of HDFS. In this blog, I will continue to explain the process of writing and reading data in HDFS and I tried to make some pics to simplify the description of the text.

The process of writing data in HDFS

The process of writing data in HDFS

Step 1: The client requests the NameNode to upload files through the Distributed FileSystem module, and the NameNode checks whether the target file already exists and whether the parent directory exists.

Step 2: NameNode returns whether it can be uploaded.

Step 3: The client requests to upload the first block.

Step 4: NameNode returns 3 DataNode nodes, namely DataNode 1, DataNode 2, and DataNode 3.

Step 5: The client requests DataNode 1 to upload data through the FSDataOutputStream module. When DataNode 1 receives the request, it will continue to call DataNode 2, and then DataNode 2 calls DataNode 3 to complete the communication pipeline.

Step 6: DataNode 1, DataNode 2, and DataNode 3 respond to client step by step.

Step 7: The client starts to upload the first block to DataNode 1 (first read data from the disk and put it into a local memory cache). DataNode 1 receives a Packet that will be passed to DataNode 2. DataNode 2 will give the data to DataNode 2, then to DataNode 3; DataNode 1 will put a reply queue to wait for a reply after a packet is sent.

Step 8: When the transmission of a block is completed, the client will request the NameNode again to upload the server of the second block. (then repeat steps 3-7 for the next Block).


The process of reading data in HDFS

The process of reading data in HDFS

Step 1: The client requests the NameNode to download the file through DistributedFileSystem, and the NameNode finds the DataNode address where the file block is located by querying the metadata.

Step 2: Choose a DataNode (usually the nearest) server and request to read the data.

Step 3: DataNode starts transmitting data to the client (reading the data input stream from the disk and verifying it).

Step 4: The client receives files in packets, caches them locally, and writes to the target file.


Summary

In fact, the above description is not complete. In HDFS storage, the distance of nodes is also a very important factor that determines the location of nodes for replica storage.

Similarly, the way DataNode stores data is also very interesting, so I'd like to mention it again here. After the communication pipeline is established, the DataNode will write data into the memory and disk simultaneously, which will improve the storage more efficient. At the same time, the DataNode will also open a new queue to receive the reply message from each node to ensure the successful writing of the data.

After having a basic understanding of the detailed process of writing and reading data in HDFS, it can help us better understand the working principle of HDFS and also make the concept of distribution of big data more clear.

Last modification:March 21, 2024
给阿姨倒一杯卡布奇诺~