You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2016/04/04 18:06:23 UTC

[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0][test-h...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/12153

    [SPARK-14369][SQL][WIP][test-hadoop2.0][test-hadoop2.2][test-hadoop2.3] Locality support for FileScanRDD

    ## What changes were proposed in this pull request?
    
    This PR adds preliminary locality support for `FileScanRDD` by querying block locations of all file blocks contained in each `FilePartition`.
    
    Note:
    
    - Rack information is not considered yet.
    - S3 is special cased since S3 doesn't provide valid locality information and its file metadata operation latency can be super high.
    
    ## How was this patch tested?
    
    Not finished yet. Trying mocked `FileSystem` to return given list of block locations. Opening the PR first to ensure it doesn't break other parts and works for all supported Hadoop versions.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-14369-locality

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12153.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12153
    
----
commit 179ff5898cf1aa2acc52514e1703b0534b5b6d7c
Author: Cheng Lian <li...@databricks.com>
Date:   2016-04-04T15:33:18Z

    Locality support for FileScanRDD

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0] Locali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205439836
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0][test-h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205370186
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206417809
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210092607
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0][test-h...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205370190
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54859/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59057227
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
               val splitFiles = selectedPartitions.flatMap { partition =>
                 partition.files.flatMap { file =>
    +              val blockLocations = getBlockLocations(file)
                   (0L to file.getLen by maxSplitBytes).map { offset =>
                     val remaining = file.getLen - offset
                     val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    -                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size)
    +                // Finds out a list of location hosts where lives the first block that contains the
    +                // starting point the file block starts at `offset` with length `size`.
    +                val hosts = getBlockHosts(blockLocations, offset)
    +                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts)
    --- End diff --
    
    This is a really good pointer, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206446592
  
    **[Test build #55120 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55120/consoleFull)** for PR 12153 at commit [`0c3882f`](https://github.com/apache/spark/commit/0c3882fddb00aa2edb01e0db6f7b16cfb7952a33).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: e[SPARK-14369][SQL][WIP][test-hadoop2.2] Local...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205774157
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210192944
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55834/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.2] Locali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205834944
  
    **[Test build #54988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54988/consoleFull)** for PR 12153 at commit [`a1f527a`](https://github.com/apache/spark/commit/a1f527aa2c91776085818c360f9ea3d0a8d5d616).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0][test-h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205369768
  
    **[Test build #54859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54859/consoleFull)** for PR 12153 at commit [`179ff58`](https://github.com/apache/spark/commit/179ff5898cf1aa2acc52514e1703b0534b5b6d7c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0] Locali...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205428968
  
    @JoshRosen Oh, thanks for reminding. Then I'll test different Hadoop versions manually by changing PR titles.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: e[SPARK-14369][SQL][WIP][test-hadoop2.3] Local...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205574553
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0] Locali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205442396
  
    **[Test build #54873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54873/consoleFull)** for PR 12153 at commit [`e846fd1`](https://github.com/apache/spark/commit/e846fd132f057e283a51cfcc3632f936474b46f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212462174
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58587493
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -620,17 +621,28 @@ class HDFSFileCatalog(
         if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
           HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
         } else {
    -      val statuses = paths.flatMap { path =>
    +      val statuses: Seq[FileStatus] = paths.flatMap { path =>
             val fs = path.getFileSystem(hadoopConf)
             logInfo(s"Listing $path on driver")
             // Dummy jobconf to get to the pathFilter defined in configuration
    -        val jobConf = new JobConf(hadoopConf, this.getClass())
    +        val jobConf = new JobConf(hadoopConf, this.getClass)
             val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
    -        if (pathFilter != null) {
    +        val statuses = if (pathFilter != null) {
               Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty)
             } else {
               Try(fs.listStatus(path)).getOrElse(Array.empty)
             }
    +
    +        fs match {
    +          case _: S3FileSystem =>
    +            // S3 doesn't provide locality information.
    +            statuses.toSeq
    +
    +          case _ =>
    +            statuses.map { f =>
    +              new LocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen))
    --- End diff --
    
    At here, let's check if this file status is already a LocatedFileStatus.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-209578767
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210044469
  
    **[Test build #55813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55813/consoleFull)** for PR 12153 at commit [`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0][test-h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205403589
  
    **[Test build #54866 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54866/consoleFull)** for PR 12153 at commit [`9a68989`](https://github.com/apache/spark/commit/9a689890b32d1c11c401a5aa7b0ff77c0b377c21).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0] Locali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205476802
  
    **[Test build #54873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54873/consoleFull)** for PR 12153 at commit [`e846fd1`](https://github.com/apache/spark/commit/e846fd132f057e283a51cfcc3632f936474b46f1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59595284
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
         case _ => Nil
       }
    +
    +  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match {
    +    case f: LocatedFileStatus => f.getBlockLocations
    +    case f => Array.empty[BlockLocation]
    +  }
    +
    +  // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of
    +  // the same file, find out the index of the block that contains this `offset`. If no such block
    +  // can be found, returns -1.
    +  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = {
    --- End diff --
    
    This one is a little bit tricky, would like to keep it here for readability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0] Locali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205439453
  
    **[Test build #54866 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54866/consoleFull)** for PR 12153 at commit [`9a68989`](https://github.com/apache/spark/commit/9a689890b32d1c11c401a5aa7b0ff77c0b377c21).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59730560
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -624,20 +624,40 @@ class HDFSFileCatalog(
     
       def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path)
     
    +  private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus])
    --- End diff --
    
    Why do we need this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: e[SPARK-14369][SQL][WIP][test-hadoop2.3] Local...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205546965
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211721313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56158/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-208456411
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55529/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59733621
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala ---
    @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
           df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t")
         }
       }
    +
    +  test("Locality support for FileScanRDD") {
    --- End diff --
    
    How about we add the following test cases?
    
    * Every partition of the FileScanRDD has a single file.
    * There are some large files. So, there are some partitions and every such a partition has a part of a file.
    * If we are going to pick the top K locations that host the largest amount of data, let's also add tests.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211721217
  
    **[Test build #56158 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56158/consoleFull)** for PR 12153 at commit [`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.2] Locali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205816083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54981/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210093794
  
    **[Test build #55834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55834/consoleFull)** for PR 12153 at commit [`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206415437
  
    **[Test build #55118 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55118/consoleFull)** for PR 12153 at commit [`4004501`](https://github.com/apache/spark/commit/40045015bbfb9f4e0699534271565efb458b73c9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212424785
  
    All the timeout in the Jenkins builds were due to a deadlock in `DAGScheduler`, and can be steadily reproduced locally by running the following test case
    
    > BucketedReadSuite.only shuffle one side when 2 bucketed tables have different bucket keys.
    
    This test case creates two bucketed tables both with 8 buckets and then joins them. Reducing 8 to 5 eliminates the deadlock. But I haven't figured out the real reason behind the deadlock. The deadlock also disappears if I remove FileScanRDD.preferredLocations(). Maybe that too many tasks are scheduled to the same place and exhausted some thread-pool?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58611815
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -620,17 +621,28 @@ class HDFSFileCatalog(
         if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
           HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
         } else {
    -      val statuses = paths.flatMap { path =>
    +      val statuses: Seq[FileStatus] = paths.flatMap { path =>
             val fs = path.getFileSystem(hadoopConf)
             logInfo(s"Listing $path on driver")
             // Dummy jobconf to get to the pathFilter defined in configuration
    -        val jobConf = new JobConf(hadoopConf, this.getClass())
    +        val jobConf = new JobConf(hadoopConf, this.getClass)
             val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
    -        if (pathFilter != null) {
    +        val statuses = if (pathFilter != null) {
               Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty)
             } else {
               Try(fs.listStatus(path)).getOrElse(Array.empty)
             }
    +
    +        fs match {
    +          case _: S3FileSystem =>
    --- End diff --
    
    Actually, do we need to have this pattern matching? Looks like S3 filesystems just use the default implementation of getFileBlockLocations, which returns `localhost:50010`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59603298
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
         case _ => Nil
       }
    +
    +  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match {
    +    case f: LocatedFileStatus => f.getBlockLocations
    +    case f => Array.empty[BlockLocation]
    +  }
    +
    +  // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of
    +  // the same file, find out the index of the block that contains this `offset`. If no such block
    +  // can be found, returns -1.
    +  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = {
    +    blockLocations.indexWhere { b =>
    +      b.getOffset <= offset && offset < b.getOffset + b.getLength
    +    }
    +  }
    +
    +  // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of
    +  // the same file, find out the block that contains this `offset`, and returns location hosts of
    +  // that block. If no such block can be found, returns an empty array.
    +  private def getBlockHosts(blockLocations: Array[BlockLocation], offset: Long): Array[String] = {
    --- End diff --
    
    Should we pick the hosts based on range that the host hold most of data in that range?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59603636
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
               val splitFiles = selectedPartitions.flatMap { partition =>
                 partition.files.flatMap { file =>
    +              val blockLocations = getBlockLocations(file)
                   (0L to file.getLen by maxSplitBytes).map { offset =>
                     val remaining = file.getLen - offset
                     val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    -                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size)
    +                // Finds out a list of location hosts where lives the first block that contains the
    +                // starting point the file block starts at `offset` with length `size`.
    +                val hosts = getBlockHosts(blockLocations, offset)
    +                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts)
    --- End diff --
    
    We could do that in a separate PR (improve the locality with bounded complicity).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-208456408
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0] Locali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205439841
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54866/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210042553
  
    **[Test build #55812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55812/consoleFull)** for PR 12153 at commit [`c372a90`](https://github.com/apache/spark/commit/c372a90f2fc03e116f095c7cd1dce534b47104cc).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210192942
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212101095
  
    **[Test build #56234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56234/consoleFull)** for PR 12153 at commit [`86f1195`](https://github.com/apache/spark/commit/86f1195c3f114aaf3a01f9043302379c09f0d755).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class FakeBlockLocation(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211594306
  
    **[Test build #56090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56090/consoleFull)** for PR 12153 at commit [`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206417192
  
    **[Test build #55120 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55120/consoleFull)** for PR 12153 at commit [`0c3882f`](https://github.com/apache/spark/commit/0c3882fddb00aa2edb01e0db6f7b16cfb7952a33).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206974141
  
    **[Test build #55225 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55225/consoleFull)** for PR 12153 at commit [`17aad3e`](https://github.com/apache/spark/commit/17aad3e7debdf6c163ab5cdd76df25794a41bbca).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206446835
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211446073
  
    Seems that Jenkins is down...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206417814
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55118/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210044527
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-209916891
  
    **[Test build #55813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55813/consoleFull)** for PR 12153 at commit [`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0] Locali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205477210
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206417771
  
    **[Test build #55118 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55118/consoleFull)** for PR 12153 at commit [`4004501`](https://github.com/apache/spark/commit/40045015bbfb9f4e0699534271565efb458b73c9).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58720692
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -137,10 +140,12 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
               val splitFiles = selectedPartitions.flatMap { partition =>
                 partition.files.flatMap { file =>
    +              val blockLocations = getBlockLocations(file)
                   (0L to file.getLen by maxSplitBytes).map { offset =>
                     val remaining = file.getLen - offset
                     val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    -                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size)
    +                val hosts = getBlockHosts(blockLocations, offset)
    +                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts)
    --- End diff --
    
    Added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-207030472
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-207030280
  
    **[Test build #55228 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55228/consoleFull)** for PR 12153 at commit [`3a0e2f7`](https://github.com/apache/spark/commit/3a0e2f71e9ea147083d4d26167fc58ce0b3323cb).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-209578718
  
    **[Test build #55713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55713/consoleFull)** for PR 12153 at commit [`cef40bf`](https://github.com/apache/spark/commit/cef40bfb940a24e43ac01e4c521be6de51f5662d).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.2] Locali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205776334
  
    **[Test build #54981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54981/consoleFull)** for PR 12153 at commit [`e846fd1`](https://github.com/apache/spark/commit/e846fd132f057e283a51cfcc3632f936474b46f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-207030473
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55228/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211487877
  
    **[Test build #56090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56090/consoleFull)** for PR 12153 at commit [`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206974448
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-209915502
  
    **[Test build #55812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55812/consoleFull)** for PR 12153 at commit [`c372a90`](https://github.com/apache/spark/commit/c372a90f2fc03e116f095c7cd1dce534b47104cc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206974456
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55225/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210042693
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58613895
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -137,10 +140,12 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
               val splitFiles = selectedPartitions.flatMap { partition =>
                 partition.files.flatMap { file =>
    +              val blockLocations = getBlockLocations(file)
                   (0L to file.getLen by maxSplitBytes).map { offset =>
                     val remaining = file.getLen - offset
                     val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    -                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size)
    +                val hosts = getBlockHosts(blockLocations, offset)
    +                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts)
    --- End diff --
    
    How about we add document to this FileSourceStrategy to explain our strategy? So, readers can quickly understand what we are doing at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211637249
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58909434
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
               val splitFiles = selectedPartitions.flatMap { partition =>
                 partition.files.flatMap { file =>
    +              val blockLocations = getBlockLocations(file)
                   (0L to file.getLen by maxSplitBytes).map { offset =>
                     val remaining = file.getLen - offset
                     val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    -                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size)
    +                // Finds out a list of location hosts where lives the first block that contains the
    +                // starting point the file block starts at `offset` with length `size`.
    +                val hosts = getBlockHosts(blockLocations, offset)
    +                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts)
    --- End diff --
    
    Each `FilePartition` derives their own preferred location in `FileScanRDD.preferredLocations()` from location hosts of all its `PartitionedFile`s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211594408
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-208456399
  
    **[Test build #55529 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55529/consoleFull)** for PR 12153 at commit [`739420b`](https://github.com/apache/spark/commit/739420b20b4cd398b4b43f3f5a3db36f4023cdb3).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211988625
  
    **[Test build #56234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56234/consoleFull)** for PR 12153 at commit [`86f1195`](https://github.com/apache/spark/commit/86f1195c3f114aaf3a01f9043302379c09f0d755).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210042698
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55812/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212043186
  
    Seems it is hanging in BucketedReadSuite again... Let's look at it together and figure out what's wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210044529
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55813/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-209464750
  
    **[Test build #55713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55713/consoleFull)** for PR 12153 at commit [`cef40bf`](https://github.com/apache/spark/commit/cef40bfb940a24e43ac01e4c521be6de51f5662d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210521600
  
    @liancheng Add more tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211721311
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59595123
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala ---
    @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
           df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t")
         }
       }
    +
    +  test("Locality support for FileScanRDD") {
    +    withHadoopConf(
    +      "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
    +      "fs.file.impl.disable.cache" -> "true"
    +    ) {
    +      withTempPath { dir =>
    +        val path = "file://" + dir.getCanonicalPath
    +        val df1 = sqlContext.range(4)
    +        df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
    +        df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
    +
    +        val df2 = sqlContext.read
    +          .format(dataSourceName)
    +          .option("dataSchema", df1.schema.json)
    +          .load(path)
    +
    +        val Some(fileScanRDD) = {
    --- End diff --
    
    Yea, this is better, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58720664
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -620,17 +621,28 @@ class HDFSFileCatalog(
         if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
           HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
         } else {
    -      val statuses = paths.flatMap { path =>
    +      val statuses: Seq[FileStatus] = paths.flatMap { path =>
             val fs = path.getFileSystem(hadoopConf)
             logInfo(s"Listing $path on driver")
             // Dummy jobconf to get to the pathFilter defined in configuration
    -        val jobConf = new JobConf(hadoopConf, this.getClass())
    +        val jobConf = new JobConf(hadoopConf, this.getClass)
             val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
    -        if (pathFilter != null) {
    +        val statuses = if (pathFilter != null) {
               Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty)
             } else {
               Try(fs.listStatus(path)).getOrElse(Array.empty)
             }
    +
    +        fs match {
    +          case _: S3FileSystem =>
    +            // S3 doesn't provide locality information.
    +            statuses.toSeq
    +
    +          case _ =>
    +            statuses.map { f =>
    +              new LocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen))
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205896642
  
    **[Test build #54998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54998/consoleFull)** for PR 12153 at commit [`5fef611`](https://github.com/apache/spark/commit/5fef61170f5b63845f8b5ee72fb3413e1ce4477d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58720668
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -620,17 +621,28 @@ class HDFSFileCatalog(
         if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
           HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
         } else {
    -      val statuses = paths.flatMap { path =>
    +      val statuses: Seq[FileStatus] = paths.flatMap { path =>
             val fs = path.getFileSystem(hadoopConf)
             logInfo(s"Listing $path on driver")
             // Dummy jobconf to get to the pathFilter defined in configuration
    -        val jobConf = new JobConf(hadoopConf, this.getClass())
    +        val jobConf = new JobConf(hadoopConf, this.getClass)
             val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
    -        if (pathFilter != null) {
    +        val statuses = if (pathFilter != null) {
               Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty)
             } else {
               Try(fs.listStatus(path)).getOrElse(Array.empty)
             }
    +
    +        fs match {
    +          case _: S3FileSystem =>
    --- End diff --
    
    Thanks, made corresponding changes and added comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58917033
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
               val splitFiles = selectedPartitions.flatMap { partition =>
                 partition.files.flatMap { file =>
    +              val blockLocations = getBlockLocations(file)
                   (0L to file.getLen by maxSplitBytes).map { offset =>
                     val remaining = file.getLen - offset
                     val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    -                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size)
    +                // Finds out a list of location hosts where lives the first block that contains the
    +                // starting point the file block starts at `offset` with length `size`.
    +                val hosts = getBlockHosts(blockLocations, offset)
    +                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts)
    --- End diff --
    
    This is not likely to generate good locality right?
    
    Is it reasonable to repurpose this?
    https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L124


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205929293
  
    **[Test build #54998 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54998/consoleFull)** for PR 12153 at commit [`5fef611`](https://github.com/apache/spark/commit/5fef61170f5b63845f8b5ee72fb3413e1ce4477d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class LocalityTestFileSystem extends RawLocalFileSystem `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.2] Locali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205816076
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58720699
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -205,4 +210,20 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
         case _ => Nil
       }
    +
    +  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match {
    +    case f: LocatedFileStatus => f.getBlockLocations
    +    case f => Array.empty[BlockLocation]
    +  }
    +
    +  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = {
    --- End diff --
    
    Added comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59766576
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -624,20 +624,40 @@ class HDFSFileCatalog(
     
       def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path)
     
    +  private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus])
    --- End diff --
    
    It's used [here][1]. `RemoteIterator` doesn't extend from `Iterator`. This implicit conversion helps to convert contents behind a `RemoteIterator` into an array.
    
    [1]: https://github.com/apache/spark/pull/12153/files#diff-40c347747af9101e7e9fee52fc4120b8R656


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-209578771
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55713/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59577383
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala ---
    @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
           df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t")
         }
       }
    +
    +  test("Locality support for FileScanRDD") {
    +    withHadoopConf(
    +      "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
    +      "fs.file.impl.disable.cache" -> "true"
    +    ) {
    +      withTempPath { dir =>
    +        val path = "file://" + dir.getCanonicalPath
    +        val df1 = sqlContext.range(4)
    +        df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
    +        df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
    +
    +        val df2 = sqlContext.read
    +          .format(dataSourceName)
    +          .option("dataSchema", df1.schema.json)
    +          .load(path)
    +
    +        val Some(fileScanRDD) = {
    +          def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = {
    +            rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG)
    +          }
    +
    +          // We have to search for the `FileScanRDD` along the RDD DAG since
    +          // RDD.preferredLocations doesn't propagate along the DAG to the root RDD.
    +          allRDDsInDAG(df2.rdd).collectFirst {
    +            case f: FileScanRDD => f
    +          }
    +        }
    +
    +        val partitions = fileScanRDD.partitions
    --- End diff --
    
    do we always return 2 partitions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-210192829
  
    **[Test build #55834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55834/consoleFull)** for PR 12153 at commit [`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0][test-h...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205427208
  
    FYI, I don't think you can necessarily put multiple `test-hadoop` commands in the same PR title and have it run multiple tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206446838
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55120/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59303595
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -86,4 +87,20 @@ class FileScanRDD(
       }
     
       override protected def getPartitions: Array[Partition] = filePartitions.toArray
    +
    +  override protected def getPreferredLocations(split: Partition): Seq[String] = {
    +    val files = split.asInstanceOf[FilePartition].files
    +    val partitionSize = files.map(_.length).sum
    --- End diff --
    
    how about:
    ```
    val hostToLength = Map[String, Long]()
    for (file <- files) {
      for (host <- file.hosts) {
        if (hostToLength.contains(host)) {
          hostToLength(host) += file.length
        } else {
          hostToLength(host) = file.length
        }
      }
      hostToLength.toSeq.sortBy(_._2).take(3).map(_._1)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205878760
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58906143
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
               val splitFiles = selectedPartitions.flatMap { partition =>
                 partition.files.flatMap { file =>
    +              val blockLocations = getBlockLocations(file)
                   (0L to file.getLen by maxSplitBytes).map { offset =>
                     val remaining = file.getLen - offset
                     val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    -                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size)
    +                // Finds out a list of location hosts where lives the first block that contains the
    +                // starting point the file block starts at `offset` with length `size`.
    +                val hosts = getBlockHosts(blockLocations, offset)
    +                PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts)
    --- End diff --
    
    How is this locality information maintained when we coalesce in the logic below?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r60174480
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -621,20 +621,40 @@ class HDFSFileCatalog(
     
       def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path)
     
    +  private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus])
    +    extends Iterator[LocatedFileStatus] {
    +
    +    override def hasNext: Boolean = iterator.hasNext
    +
    +    override def next(): LocatedFileStatus = iterator.next()
    +  }
    +
       private def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = {
         if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
           HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
    --- End diff --
    
    Let's also have a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205878364
  
    **[Test build #54988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54988/consoleFull)** for PR 12153 at commit [`a1f527a`](https://github.com/apache/spark/commit/a1f527aa2c91776085818c360f9ea3d0a8d5d616).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.2] Locali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205815335
  
    **[Test build #54981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54981/consoleFull)** for PR 12153 at commit [`e846fd1`](https://github.com/apache/spark/commit/e846fd132f057e283a51cfcc3632f936474b46f1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59707891
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -86,4 +87,9 @@ class FileScanRDD(
       }
     
       override protected def getPartitions: Array[Partition] = filePartitions.toArray
    +
    +  override protected def getPreferredLocations(split: Partition): Seq[String] = {
    +    val files = split.asInstanceOf[FilePartition].files
    +    if (files.isEmpty) Seq.empty else files.maxBy(_.length).locations
    --- End diff --
    
    Updated. @cloud-fan also suggested the same strategy. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212101258
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59594996
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala ---
    @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
           df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t")
         }
       }
    +
    +  test("Locality support for FileScanRDD") {
    +    withHadoopConf(
    +      "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
    +      "fs.file.impl.disable.cache" -> "true"
    +    ) {
    +      withTempPath { dir =>
    +        val path = "file://" + dir.getCanonicalPath
    +        val df1 = sqlContext.range(4)
    +        df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
    +        df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
    +
    +        val df2 = sqlContext.read
    +          .format(dataSourceName)
    +          .option("dataSchema", df1.schema.json)
    +          .load(path)
    +
    +        val Some(fileScanRDD) = {
    +          def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = {
    +            rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG)
    +          }
    +
    +          // We have to search for the `FileScanRDD` along the RDD DAG since
    +          // RDD.preferredLocations doesn't propagate along the DAG to the root RDD.
    +          allRDDsInDAG(df2.rdd).collectFirst {
    +            case f: FileScanRDD => f
    +          }
    +        }
    +
    +        val partitions = fileScanRDD.partitions
    --- End diff --
    
    Not necessarily, and probably only 1, since both files written are pretty small and will be merged into a single partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0][test-h...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205370177
  
    **[Test build #54859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54859/consoleFull)** for PR 12153 at commit [`179ff58`](https://github.com/apache/spark/commit/179ff5898cf1aa2acc52514e1703b0534b5b6d7c).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205929503
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205929511
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54998/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng closed the pull request at:

    https://github.com/apache/spark/pull/12153


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205878765
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54988/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212101265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56234/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212424940
  
    I'm closing this one for #12527, which is a rebased version of this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.0] Locali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205477212
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54873/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58587457
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -620,17 +621,28 @@ class HDFSFileCatalog(
         if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
           HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
         } else {
    -      val statuses = paths.flatMap { path =>
    +      val statuses: Seq[FileStatus] = paths.flatMap { path =>
             val fs = path.getFileSystem(hadoopConf)
             logInfo(s"Listing $path on driver")
             // Dummy jobconf to get to the pathFilter defined in configuration
    -        val jobConf = new JobConf(hadoopConf, this.getClass())
    +        val jobConf = new JobConf(hadoopConf, this.getClass)
             val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
    -        if (pathFilter != null) {
    +        val statuses = if (pathFilter != null) {
               Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty)
             } else {
               Try(fs.listStatus(path)).getOrElse(Array.empty)
             }
    +
    +        fs match {
    +          case _: S3FileSystem =>
    --- End diff --
    
    S3AFileSystem / NativeS3FileSystem. S3FileSystem is for the block store.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212461507
  
    **[Test build #56354 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56354/consoleFull)** for PR 12153 at commit [`79beca6`](https://github.com/apache/spark/commit/79beca6ad09df76a39cc1a80eebed378e3b2fffc).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206941522
  
    **[Test build #55225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55225/consoleFull)** for PR 12153 at commit [`17aad3e`](https://github.com/apache/spark/commit/17aad3e7debdf6c163ab5cdd76df25794a41bbca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58612163
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -86,4 +87,8 @@ class FileScanRDD(
       }
     
       override protected def getPartitions: Array[Partition] = filePartitions.toArray
    +
    +  override protected def getPreferredLocations(split: Partition): Seq[String] = {
    +    split.asInstanceOf[FilePartition].files.flatMap(_.locations).distinct
    --- End diff --
    
    Looks we want to pick the location based on the size of a block? In MapOutputTracker we have `getPreferredLocationsForShuffle`, which can be related to us.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r60174434
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -621,20 +621,40 @@ class HDFSFileCatalog(
     
       def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path)
     
    +  private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus])
    +    extends Iterator[LocatedFileStatus] {
    +
    +    override def hasNext: Boolean = iterator.hasNext
    +
    +    override def next(): LocatedFileStatus = iterator.next()
    +  }
    +
       private def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = {
         if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) {
           HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext)
    --- End diff --
    
    Seems we also need to update the `listLeafFiles` that is called by `listLeafFilesInParallel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211594412
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56090/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: e[SPARK-14369][SQL][WIP][test-hadoop2.3] Local...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205573868
  
    **[Test build #54910 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54910/consoleFull)** for PR 12153 at commit [`e846fd1`](https://github.com/apache/spark/commit/e846fd132f057e283a51cfcc3632f936474b46f1).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59577970
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
         case _ => Nil
       }
    +
    +  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match {
    +    case f: LocatedFileStatus => f.getBlockLocations
    +    case f => Array.empty[BlockLocation]
    +  }
    +
    +  // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of
    +  // the same file, find out the index of the block that contains this `offset`. If no such block
    +  // can be found, returns -1.
    +  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = {
    --- End diff --
    
    This method is only called once, should we inline it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58720687
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -86,4 +87,8 @@ class FileScanRDD(
       }
     
       override protected def getPartitions: Array[Partition] = filePartitions.toArray
    +
    +  override protected def getPreferredLocations(split: Partition): Seq[String] = {
    +    split.asInstanceOf[FilePartition].files.flatMap(_.locations).distinct
    --- End diff --
    
    Added a constraint so that we only pick locations of blocks whose sizes are larger than 10% of the whole file partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: e[SPARK-14369][SQL][WIP][test-hadoop2.3] Local...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205547359
  
    **[Test build #54910 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54910/consoleFull)** for PR 12153 at commit [`e846fd1`](https://github.com/apache/spark/commit/e846fd132f057e283a51cfcc3632f936474b46f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-208456006
  
    **[Test build #55529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55529/consoleFull)** for PR 12153 at commit [`739420b`](https://github.com/apache/spark/commit/739420b20b4cd398b4b43f3f5a3db36f4023cdb3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211638453
  
    **[Test build #56158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56158/consoleFull)** for PR 12153 at commit [`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r58613973
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -205,4 +210,20 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
         case _ => Nil
       }
    +
    +  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match {
    +    case f: LocatedFileStatus => f.getBlockLocations
    +    case f => Array.empty[BlockLocation]
    +  }
    +
    +  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = {
    --- End diff --
    
    What is the semantic of `offset`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212408238
  
    **[Test build #56354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56354/consoleFull)** for PR 12153 at commit [`79beca6`](https://github.com/apache/spark/commit/79beca6ad09df76a39cc1a80eebed378e3b2fffc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: e[SPARK-14369][SQL][WIP][test-hadoop2.3] Local...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-205574559
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54910/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-211423901
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59733711
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala ---
    @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
           df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t")
         }
       }
    +
    +  test("Locality support for FileScanRDD") {
    +    withHadoopConf(
    +      "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
    +      "fs.file.impl.disable.cache" -> "true"
    +    ) {
    +      withTempPath { dir =>
    +        val path = "file://" + dir.getCanonicalPath
    +        val df1 = sqlContext.range(4)
    +        df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
    +        df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
    +
    +        val df2 = sqlContext.read
    +          .format(dataSourceName)
    +          .option("dataSchema", df1.schema.json)
    +          .load(path)
    +
    +        val Some(fileScanRDD) = {
    +          def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = {
    +            rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG)
    +          }
    +
    +          // We have to search for the `FileScanRDD` along the RDD DAG since
    +          // RDD.preferredLocations doesn't propagate along the DAG to the root RDD.
    +          allRDDsInDAG(df2.rdd).collectFirst {
    +            case f: FileScanRDD => f
    +          }
    +        }
    +
    +        val partitions = fileScanRDD.partitions
    --- End diff --
    
    If we expect them be packed in the same partition, let's add an `assert`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-206988612
  
    **[Test build #55228 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55228/consoleFull)** for PR 12153 at commit [`3a0e2f7`](https://github.com/apache/spark/commit/3a0e2f71e9ea147083d4d26167fc58ce0b3323cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59707908
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ---
    @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
     
         case _ => Nil
       }
    +
    +  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match {
    +    case f: LocatedFileStatus => f.getBlockLocations
    +    case f => Array.empty[BlockLocation]
    +  }
    +
    +  // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of
    +  // the same file, find out the index of the block that contains this `offset`. If no such block
    +  // can be found, returns -1.
    +  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = {
    +    blockLocations.indexWhere { b =>
    +      b.getOffset <= offset && offset < b.getOffset + b.getLength
    +    }
    +  }
    +
    +  // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of
    +  // the same file, find out the block that contains this `offset`, and returns location hosts of
    +  // that block. If no such block can be found, returns an empty array.
    +  private def getBlockHosts(blockLocations: Array[BlockLocation], offset: Long): Array[String] = {
    --- End diff --
    
    Makes sense, updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59598886
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -86,4 +87,9 @@ class FileScanRDD(
       }
     
       override protected def getPartitions: Array[Partition] = filePartitions.toArray
    +
    +  override protected def getPreferredLocations(split: Partition): Seq[String] = {
    +    val files = split.asInstanceOf[FilePartition].files
    +    if (files.isEmpty) Seq.empty else files.maxBy(_.length).locations
    --- End diff --
    
    Should we pick the top 3 host that hold most of the data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-209528671
  
    Overall LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12153#issuecomment-212462180
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56354/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12153#discussion_r59576310
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala ---
    @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
           df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t")
         }
       }
    +
    +  test("Locality support for FileScanRDD") {
    +    withHadoopConf(
    +      "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
    +      "fs.file.impl.disable.cache" -> "true"
    +    ) {
    +      withTempPath { dir =>
    +        val path = "file://" + dir.getCanonicalPath
    +        val df1 = sqlContext.range(4)
    +        df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
    +        df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
    +
    +        val df2 = sqlContext.read
    +          .format(dataSourceName)
    +          .option("dataSchema", df1.schema.json)
    +          .load(path)
    +
    +        val Some(fileScanRDD) = {
    --- End diff --
    
    how about:
    ```
    val fileScanRDD = df2.queryExecution.executedPlan.collectFirst {
      case scan: DataSourceScan if scan.rdd.isInstanceOf[FileScanRDD] => scan.rdd.asInstanceOf[FileScanRDD]
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org