You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by devaraj-kavali <gi...@git.apache.org> on 2018/10/16 23:59:44 UTC

[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

GitHub user devaraj-kavali opened a pull request:

    https://github.com/apache/spark/pull/22752

    [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs

    ## What changes were proposed in this pull request?
    
    `hsync` has been added as part of SPARK-19531 to get the latest data in the history sever ui, but that is causing the performance overhead and also leading to drop many history log events. `hsync` uses the force `FileChannel.force` to sync the data to the disk and happens for the data pipeline, it is costly operation and making the application to face overhead and drop the events.
    
    I think getting the latest data in history server can be done in different way (no impact to application while writing events), there is an api `DFSInputStream.getFileLength()` which gives the file length including the `lastBlockBeingWrittenLength`(different from `FileStatus.getLen()`), this api can be used when the file status length and previously cached length are equal to verify whether any new data has been written or not, if there is any update in data length then the history server can update the in progress history log. And also I made this change as configurable with the default value false, and can be enabled for history server if users want to see the updated data in ui.
    
    ## How was this patch tested?
    
    Added new test and verified manually, with the added conf `spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true`, history server is reading the logs including the last block data which is being written and updating the Web UI with the latest data.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/devaraj-kavali/spark SPARK-24787

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22752.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22752
    
----
commit a3f53c41879e28d71d4dbd79d80a51e50d82ecee
Author: Devaraj K <de...@...>
Date:   2018-10-16T23:50:20Z

    [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make
    FsHistoryProvider to read lastBlockBeingWritten data for logs

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97554/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    **[Test build #97554 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97554/testReport)** for PR 22752 at commit [`c2f2705`](https://github.com/apache/spark/commit/c2f2705422c00d07753553d1baa433206d15ac75).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by devaraj-kavali <gi...@git.apache.org>.
Github user devaraj-kavali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226014153
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
                   listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen()))
                 }
     
    -            if (info.fileSize < entry.getLen()) {
    +            if (info.fileSize < entry.getLen() || checkAbsoluteLength(info, entry)) {
    --- End diff --
    
    Thanks @steveloughran for looking into this.
    > Have you looked @ this getFileLength() call to see how well it updates?
    
    I looked at the DFSInputStream.getFileLength() api, it gives locatedBlocks.getFileLength() + lastBlockBeingWrittenLength, here locatedBlocks.getFileLength() is the value got from NameNode for all the completed blocks and lastBlockBeingWrittenLength is the lastblock lenth from DataNode which is not the completed block.
    
    > FwIW HADOOP-15606 proposes adding a method like this for all streams
    
    Thanks for the pointer, once this is available we can update to use it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226243409
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
                   listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen()))
                 }
     
    -            if (info.fileSize < entry.getLen()) {
    +            if (info.fileSize < entry.getLen() || checkAbsoluteLength(info, entry)) {
    --- End diff --
    
    ...there's no timetable for that getLength thing, but if HDFS already supports the API, I'm more motivated to implement it. It has benefits in cloud stores in general
    1. saves apps going an up front HEAD/getFileStatus() to know how long their data is; the GET should return it.
    2. for S3 Select, you get back the filtered data so don't know how much you will see until the GET is issued


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    **[Test build #97474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97474/testReport)** for PR 22752 at commit [`a3f53c4`](https://github.com/apache/spark/commit/a3f53c41879e28d71d4dbd79d80a51e50d82ecee).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by devaraj-kavali <gi...@git.apache.org>.
Github user devaraj-kavali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226429600
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/config.scala ---
    @@ -64,4 +64,11 @@ private[spark] object config {
           .bytesConf(ByteUnit.BYTE)
           .createWithDefaultString("1m")
     
    +  val IN_PROGRESS_ABSOLUTE_LENGTH_CHECK =
    +    ConfigBuilder("spark.history.fs.inProgressAbsoluteLengthCheck.enabled")
    --- End diff --
    
    This call makes two invocations, one for getting blocks info from the NameNode and another for getting the last block info from DataNode. I agree this is not a performance critical path, I will remove the config in the update.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226420000
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/config.scala ---
    @@ -64,4 +64,11 @@ private[spark] object config {
           .bytesConf(ByteUnit.BYTE)
           .createWithDefaultString("1m")
     
    +  val IN_PROGRESS_ABSOLUTE_LENGTH_CHECK =
    +    ConfigBuilder("spark.history.fs.inProgressAbsoluteLengthCheck.enabled")
    --- End diff --
    
    How much overhead are we talking about?
    
    That thread is not really in any performance critical path, and in general people won't have so many running apps that this should become a problem... but that kinda depends on how bad this new call is.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by devaraj-kavali <gi...@git.apache.org>.
Github user devaraj-kavali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226014282
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/config.scala ---
    @@ -64,4 +64,11 @@ private[spark] object config {
           .bytesConf(ByteUnit.BYTE)
           .createWithDefaultString("1m")
     
    +  val IN_PROGRESS_ABSOLUTE_LENGTH_CHECK =
    +    ConfigBuilder("spark.history.fs.inProgressAbsoluteLengthCheck.enabled")
    +      .doc("Enable to check the absolute length of the in-progress" +
    --- End diff --
    
    do you have anything in mind to make it better? thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    **[Test build #97474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97474/testReport)** for PR 22752 at commit [`a3f53c4`](https://github.com/apache/spark/commit/a3f53c41879e28d71d4dbd79d80a51e50d82ecee).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    add to whitelist


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by devaraj-kavali <gi...@git.apache.org>.
Github user devaraj-kavali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226416951
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/config.scala ---
    @@ -64,4 +64,11 @@ private[spark] object config {
           .bytesConf(ByteUnit.BYTE)
           .createWithDefaultString("1m")
     
    +  val IN_PROGRESS_ABSOLUTE_LENGTH_CHECK =
    +    ConfigBuilder("spark.history.fs.inProgressAbsoluteLengthCheck.enabled")
    --- End diff --
    
    This new check adds overhead to the checkForLogs thread, made it disabled by default since most of the users may not want to see the history ui for the running applications, they can enable it explicitly if they want to see progress apps in history ui. I can remove this config if you think not much useful.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    Merging to master / 2.4 (will run a couple of tests on 2.4 before merging there).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by devaraj-kavali <gi...@git.apache.org>.
Github user devaraj-kavali commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    @vanzin can you check the updated changes, thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226409012
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/config.scala ---
    @@ -64,4 +64,11 @@ private[spark] object config {
           .bytesConf(ByteUnit.BYTE)
           .createWithDefaultString("1m")
     
    +  val IN_PROGRESS_ABSOLUTE_LENGTH_CHECK =
    +    ConfigBuilder("spark.history.fs.inProgressAbsoluteLengthCheck.enabled")
    --- End diff --
    
    Is there any disadvantage in just leaving this always on? Otherwise this doesn't need to be configurable.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r225845782
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
                   listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen()))
                 }
     
    -            if (info.fileSize < entry.getLen()) {
    +            if (info.fileSize < entry.getLen() || checkAbsoluteLength(info, entry)) {
    --- End diff --
    
    I think we can create a function to get the length of given file:
    1. If the new conf is enabled and the input is DFSInputStream, use `getFileLength` (or `max(getFileLength, entry.getLen()`)
    2. otherwise `entry.getLen()`
    
    The logic can be simpler.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r225846631
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/config.scala ---
    @@ -64,4 +64,11 @@ private[spark] object config {
           .bytesConf(ByteUnit.BYTE)
           .createWithDefaultString("1m")
     
    +  val IN_PROGRESS_ABSOLUTE_LENGTH_CHECK =
    +    ConfigBuilder("spark.history.fs.inProgressAbsoluteLengthCheck.enabled")
    +      .doc("Enable to check the absolute length of the in-progress" +
    --- End diff --
    
    Could you explain a little bit in details? So that general user can know what the benefit is.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226407844
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -541,6 +542,23 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
         }
       }
     
    +  private[history] def checkAbsoluteLength(info: LogInfo, entry: FileStatus): Boolean = {
    --- End diff --
    
    The name of the method and the return value are a little cryptic. What does it mean to check?
    
    Might be better to calls it something like `shouldReloadLog` or something. You could also move the existing check into this function and make the call site simpler.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22752


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r225908701
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
                   listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen()))
                 }
     
    -            if (info.fileSize < entry.getLen()) {
    +            if (info.fileSize < entry.getLen() || checkAbsoluteLength(info, entry)) {
    --- End diff --
    
    Have you looked @ this getFileLength() call to see how well it updates?
     
    FwIW [HADOOP-15606](https://issues.apache.org/jira/browse/HADOOP-15606) proposes adding a method like this for all streams, though that proposal includes the need for specification and tests. Generally the HDFS team are a bit lax about that spec -> test workflow, which doesn't help downstream code or other implementations.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    **[Test build #97554 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97554/testReport)** for PR 22752 at commit [`c2f2705`](https://github.com/apache/spark/commit/c2f2705422c00d07753553d1baa433206d15ac75).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingListener...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22752
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97474/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22752: [SPARK-24787][CORE] Revert hsync in EventLoggingL...

Posted by devaraj-kavali <gi...@git.apache.org>.
Github user devaraj-kavali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226013841
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
                   listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen()))
                 }
     
    -            if (info.fileSize < entry.getLen()) {
    +            if (info.fileSize < entry.getLen() || checkAbsoluteLength(info, entry)) {
    --- End diff --
    
    Thanks @gengliangwang for looking into this. Here it doesn't need to check the checkAbsoluteLength if the FileStatus.getLen() is more than the cached fileSize, if we update to `max(getFileLength, entry.getLen()))` it checks the absoluteLength always which may not be necessary.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org