You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Marton Elek (Jira)" <ji...@apache.org> on 2020/03/11 12:12:00 UTC

[jira] [Commented] (HDDS-3155) Improved ozone flush implementation to make it faster.

    [ https://issues.apache.org/jira/browse/HDDS-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056943#comment-17056943 ] 

Marton Elek commented on HDDS-3155:
-----------------------------------

Thank you to open the issue  [~micahzhao]. Very interesting problem:

> This approach is not very efficient at the moment

Just FYI. HDDS-2717 will be merged very soon. It create a file per block not per chunk. I measured significant better throughput (>4k instead of 1.2k chunk write / second on one single pipline with 1k chunks).

> Instead of writing to disk each time flush writes the contents of the buffer to the datanode's OS buffer. In the first place, we need to ensure that this content can be read by other datanodes.

Can you please share more details. I don't understand it very well, maybe because I am not familiar with the HDFS flush. AFAIK we don't do any fdatasync call on the datanode size, so even now we write everything to the OS cache.

I think our current flush just ensures that the ratis transaction is committed.

But it's true that we always do a sync on the Ratis side after every log entry (which makes the ratis log safe but slow). Is it the part what you would like to change?

> Improved ozone flush implementation to make it faster.
> ------------------------------------------------------
>
>                 Key: HDDS-3155
>                 URL: https://issues.apache.org/jira/browse/HDDS-3155
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>            Reporter: mingchao zhao
>            Priority: Major
>         Attachments: amlog, stdout
>
>
> Background:
>     When we execute mapreduce in the ozone, we find that the task will be stuck for a long time after the completion of Map and Reduce. The log is as follows:
> {code:java}
> //Refer to the attachment: stdout
> 20/03/05 14:43:30 INFO mapreduce.Job: map 100% reduce 33% 
> 20/03/05 14:43:33 INFO mapreduce.Job: map 100% reduce 100% 
> 20/03/05 15:29:52 INFO mapreduce.Job: Job job_1583385253878_0002 completed successfully{code}
>     By looking at AM's log(Refer to the amlog for details), we found that the time of over 40 minutes is AM writing a task log into ozone.
>     At present, after MR execution, the Task information is recorded into the log on HDFS or ozone by AM.  Moreover, the task information is flush to HDFS or ozone one by one ([details|https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryEventHandler.java#L1640]). The problem occurs when the number of task maps is large. 
>      Currently, each flush operation in ozone generates a new chunk file in real time on the disk. This approach is not very efficient at the moment. For this we can refer to the implementation of HDFS flush. Instead of writing to disk each time flush writes the contents of the buffer to the datanode's OS buffer. In the first place, we need to ensure that this content can be read by other datanodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org