You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ScrapCodes <gi...@git.apache.org> on 2018/09/05 09:05:03 UTC

[GitHub] spark pull request #22339: SPARK-17159 Significant speed up for running spar...

GitHub user ScrapCodes opened a pull request:

    https://github.com/apache/spark/pull/22339

    SPARK-17159 Significant speed up for running spark streaming against Object store.

    
    ## What changes were proposed in this pull request?
    
    
    Original work by Steve Loughran.
    Based on #17745. 
    
    This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. Each call to file status is 3+ http calls to object store. This patch eliminates the need for it, by using FileStatus objects.
    
    This is a minor optimisation when working with filesystems, but significant when working with object stores.
    
    ## How was this patch tested?
    
    Tests included. Existing tests pass.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ScrapCodes/spark PR_17745

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22339.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22339
    
----
commit 2fba9af597349fc023e04a845d1cfacfc3ab7d9e
Author: Steve Loughran <st...@...>
Date:   2017-04-24T13:04:04Z

    SPARK-17159 Significant speed up for running spark streaming against Object store.
    
    Based on #17745. Original work by Steve Loughran.
    
    This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files.
    
    This is a minor optimisation when working with filesystems, but significant when working with object stores.
    
    Change-Id: I269d98902f615818941c88de93a124c65453756e

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: SPARK-17159 Significant speed up for running spark strea...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Hi, @ScrapCodes . Could you do the followings?
    - Update the title to `[SPARK-17159][SS]...`
    - Remove `Please review http://spark.apache.org/contributing.html ....` from PR description
    - Share the numbers because the PR title has `Significant speed up`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    no, no cost penalties. Slightly lower namenode load too. If you had many, many spark streaming clients scanning directories, HDFS ops teams would eventually get upset. This will postpone the day


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96047/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22339: [SPARK-17159][STREAM] Significant speed up for ru...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22339#discussion_r221322224
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala ---
    @@ -196,29 +191,29 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
           logDebug(s"Getting new files for time $currentTime, " +
             s"ignoring files older than $modTimeIgnoreThreshold")
     
    -      val newFileFilter = new PathFilter {
    -        def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
    -      }
    -      val directoryFilter = new PathFilter {
    -        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
    -      }
    -      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
    +      val directories = Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])
    +          .filter(_.isDirectory)
    +          .map(_.getPath)
           val newFiles = directories.flatMap(dir =>
    -        fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
    +        fs.listStatus(dir)
    +            .filter(isNewFile(_, currentTime, modTimeIgnoreThreshold))
    +            .map(_.getPath.toString))
           val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
    -      logInfo("Finding new files took " + timeTaken + " ms")
    -      logDebug("# cached file times = " + fileToModTime.size)
    +      logInfo(s"Finding new files took $timeTaken ms")
    --- End diff --
    
    depends on how big it grows over time


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96885/testReport)** for PR 22339 at commit [`542872c`](https://github.com/apache/spark/commit/542872cb5459fae1a66ee45aa193986e9a37fb96).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3647/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: SPARK-17159 Significant speed up for running spark strea...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #95706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95706/testReport)** for PR 22339 at commit [`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Why the speedups? Comes from that glob filter calling getFileStatus() on every entry, which is is 1-3 HTTP requests and a few hundred millis per call, when instead that can be handled later. As a result, the more files you have in a path, the more time the scan takes, until eventually the scan time > window interval at which point your code is dead.
    
    The other stuff is simply associated optimisations.
    
    Now, I'm obviously happy with this, especially as I seem I getting credit. And it will help speedup working with any store. But I need to warn people: it is not sufficient
    
    The key problem here is: files uploaded by S3 multipart upload get a timestamp on when the upload began, not finished —yet only become visible at the end of the upload. If a caller starts up an upload in window t, and doesn't complete it until window t+1, then it may get missed.
    
    There's not much which can be done here, except in documenting the risk.
    
    What is a good solution? It'd be to use the cloud-infra-providers own event notification mechanism and subscribe to changes in a store. AWS, Azure and GCS all offer something like this. 
    
    There's a home for the S3 one of those in aws-kinesis, perhaps


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96886/testReport)** for PR 22339 at commit [`dab9bf3`](https://github.com/apache/spark/commit/dab9bf3771994989e5de2857f91d117dc8b74623).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: SPARK-17159 Significant speed up for running spark strea...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #95706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95706/testReport)** for PR 22339 at commit [`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22339: [SPARK-17159][STREAM] Significant speed up for ru...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22339#discussion_r221326673
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala ---
    @@ -196,29 +191,29 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
           logDebug(s"Getting new files for time $currentTime, " +
             s"ignoring files older than $modTimeIgnoreThreshold")
     
    -      val newFileFilter = new PathFilter {
    -        def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
    -      }
    -      val directoryFilter = new PathFilter {
    -        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
    -      }
    -      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
    +      val directories = Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])
    +          .filter(_.isDirectory)
    +          .map(_.getPath)
           val newFiles = directories.flatMap(dir =>
    -        fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
    +        fs.listStatus(dir)
    +            .filter(isNewFile(_, currentTime, modTimeIgnoreThreshold))
    +            .map(_.getPath.toString))
           val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
    -      logInfo("Finding new files took " + timeTaken + " ms")
    -      logDebug("# cached file times = " + fileToModTime.size)
    +      logInfo(s"Finding new files took $timeTaken ms")
    --- End diff --
    
    It was originally @ info, so if it it filled up logs *too much* there'd be complaints. What's important is that the time to scan is printed, either @ info or debug, so someone can see what's happening. Probably what does need logging @ warn is when the time to scan is greater than the window, or just getting close to it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96887 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96887/testReport)** for PR 22339 at commit [`d91c815`](https://github.com/apache/spark/commit/d91c815774bff070bdb3cb149678ff080bc06b45).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22339: [SPARK-17159][STREAM] Significant speed up for ru...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22339


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Hi @srowen, would you like to take a look? Is there anything I can do, if this patch is missing something? I have tested it thoroughly against an object store.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96885 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96885/testReport)** for PR 22339 at commit [`542872c`](https://github.com/apache/spark/commit/542872cb5459fae1a66ee45aa193986e9a37fb96).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96886/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96843/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Thank you @srowen and @steveloughran.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96885/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: SPARK-17159 Significant speed up for running spark strea...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2867/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22339: [SPARK-17159][STREAM] Significant speed up for ru...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22339#discussion_r221267633
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala ---
    @@ -196,29 +191,29 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
           logDebug(s"Getting new files for time $currentTime, " +
             s"ignoring files older than $modTimeIgnoreThreshold")
     
    -      val newFileFilter = new PathFilter {
    -        def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
    -      }
    -      val directoryFilter = new PathFilter {
    -        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
    -      }
    -      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
    +      val directories = Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])
    --- End diff --
    
    I guess the `.getOrElse` could come at the end, but it hardly matters.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96887/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    For numbers, while testing with object store having 50 files/dirs, without this patch it took 130 REST requests for 2 batches to complete and with this patch it took 56 rest requests. So number of rest calls are reduced, and this translates to speedup. How much speed up is dependent on number of files, but for the particular test, I have run, it was 2x.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3648/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96886/testReport)** for PR 22339 at commit [`dab9bf3`](https://github.com/apache/spark/commit/dab9bf3771994989e5de2857f91d117dc8b74623).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22339: [SPARK-17159][STREAM] Significant speed up for ru...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22339#discussion_r221267480
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala ---
    @@ -196,29 +191,29 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
           logDebug(s"Getting new files for time $currentTime, " +
             s"ignoring files older than $modTimeIgnoreThreshold")
     
    -      val newFileFilter = new PathFilter {
    -        def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
    -      }
    -      val directoryFilter = new PathFilter {
    -        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
    -      }
    -      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
    +      val directories = Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])
    +          .filter(_.isDirectory)
    +          .map(_.getPath)
           val newFiles = directories.flatMap(dir =>
    -        fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
    +        fs.listStatus(dir)
    +            .filter(isNewFile(_, currentTime, modTimeIgnoreThreshold))
    --- End diff --
    
    Nit: I think the indent is too deep here? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96843/testReport)** for PR 22339 at commit [`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96047 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96047/testReport)** for PR 22339 at commit [`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3093/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96887 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96887/testReport)** for PR 22339 at commit [`d91c815`](https://github.com/apache/spark/commit/d91c815774bff070bdb3cb149678ff080bc06b45).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Yeah I agree, I was saying I do think it will speed things up. If it's a non-trivial win it's worthwhile even if it isn't the last optimization here. Is there any downside to this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: SPARK-17159 Significant speed up for running spark strea...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3616/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22339: [SPARK-17159][STREAM] Significant speed up for ru...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22339#discussion_r221267570
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala ---
    @@ -196,29 +191,29 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
           logDebug(s"Getting new files for time $currentTime, " +
             s"ignoring files older than $modTimeIgnoreThreshold")
     
    -      val newFileFilter = new PathFilter {
    -        def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
    -      }
    -      val directoryFilter = new PathFilter {
    -        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
    -      }
    -      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
    +      val directories = Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])
    +          .filter(_.isDirectory)
    +          .map(_.getPath)
           val newFiles = directories.flatMap(dir =>
    -        fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
    +        fs.listStatus(dir)
    +            .filter(isNewFile(_, currentTime, modTimeIgnoreThreshold))
    +            .map(_.getPath.toString))
           val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
    -      logInfo("Finding new files took " + timeTaken + " ms")
    -      logDebug("# cached file times = " + fileToModTime.size)
    +      logInfo(s"Finding new files took $timeTaken ms")
    --- End diff --
    
    I wonder if this should be a debug statement. I don't feel strongly about it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: SPARK-17159 Significant speed up for running spark strea...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95706/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: SPARK-17159 Significant speed up for running spark strea...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3649/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96843/testReport)** for PR 22339 at commit [`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22339
  
    **[Test build #96047 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96047/testReport)** for PR 22339 at commit [`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org