You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2007/10/13 00:23:50 UTC

[jira] Commented: (HADOOP-1707) Remove the DFS Client disk-based cache

    [ https://issues.apache.org/jira/browse/HADOOP-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534427 ] 

dhruba borthakur commented on HADOOP-1707:
------------------------------------------

I have the following proposal in mind:

1. The Client uses a small pool of memory buffers per dfs-output stream. Say, 10 buffers of size 64K each.
2. A write to the output stream actually copies the user data into one of the buffers, if available. Otherwise the user-write blocks.
3. A separate thread (one per output stream), sends buffers that are full. Each buffer has metadata that contains a sequence number (locally generated on the client) , the length of the buffer and its offset in this block. 
4. Another thread(one per output stream) process incoming responses. The incoming response has the sequence number of the buffer that the datanode had processed. The client removes that buffer from its queue.
5. The client gets an exception if the primary datanode fails. If a secondary datanode fails, the primary informs the client about this event.
6. In any datanodes fail, the client removes it from the pipeline and resends all pending buffers to all known good datanodes.
7. A target datanode remembers the last sequencenumber that it has previously processed. It forwards the buffer to the next datanode in the pipeline. If the datanode receives a buffer that it has not processed earlier, it writes it to local disk. When the response arrives, it forwards the response back to the client.





> Remove the DFS Client disk-based cache
> --------------------------------------
>
>                 Key: HADOOP-1707
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1707
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>             Fix For: 0.16.0
>
>
> The DFS client currently uses a staging file on local disk to cache all user-writes to a file. When the staging file accumulates 1 block worth of data, its contents are flushed to a HDFS datanode. These operations occur sequentially.
> A simple optimization of allowing the user to write to another staging file while simultaneously uploading the contents of the first staging file to HDFS will improve file-upload performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.