You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Devaraj K (JIRA)" <ji...@apache.org> on 2018/10/17 00:00:00 UTC

[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

    [ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652687#comment-16652687 ] 

Devaraj K commented on SPARK-24787:
-----------------------------------

It seems here the overhead is coming due the force call FileChannel.force in Datanode which is part of the hsync to write the data to the storage device. And the hsync is not making much difference with and without the flag SyncFlag.UPDATE_LENGTH, it might be because the update length is simple call to NameNode to update the length.

I think the hsync change can be reverted, and the history server can get the latest file length using the DFSInputStream.getFileLength() which includes lastBlockBeingWrittenLength, if the cached length is same as FileStatus.getLen() then history server can make additional call to get the latest length using DFSInputStream.getFileLength() and decide whether to update the history log or not.

> Events being dropped at an alarming rate due to hsync being slow for eventLogging
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-24787
>                 URL: https://issues.apache.org/jira/browse/SPARK-24787
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Web UI
>    Affects Versions: 2.3.0, 2.3.1
>            Reporter: Sanket Reddy
>            Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the inprogress files allowing history server being responsive.
> Although we have a production job that has 60000 tasks per stage and due to hsync being slow it starts dropping events and the history server has wrong stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org