You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by rxin <gi...@git.apache.org> on 2017/03/30 04:48:07 UTC

[GitHub] spark pull request #17476: [SPARK-20151][SQL] Account for partition pruning ...

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/17476

    [SPARK-20151][SQL] Account for partition pruning in scan metadataTime metrics

    ## What changes were proposed in this pull request?
    After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. This patch adds that time measurement to the scan metrics.
    
    ## How was this patch tested?
    N/A - I tried adding a test in SQLMetricsSuite but it was extremely convoluted to the point that I'm not sure if this is worth it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark SPARK-20151

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17476.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17476
    
----
commit 8789cf04ea4f7addbcd8da9d83615ee96d9bd192
Author: Reynold Xin <rx...@databricks.com>
Date:   2017-03-30T04:46:45Z

    [SPARK-20151][SQL] Account for partition pruning in scan metadataTime metrics

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17476: [SPARK-20151][SQL] Account for partition pruning ...

Posted by adrian-ionescu <gi...@git.apache.org>.

Github user adrian-ionescu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17476#discussion_r108906608
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala ---
    @@ -111,7 +113,8 @@ private class PrunedInMemoryFileIndex(
         sparkSession: SparkSession,
         tableBasePath: Path,
         fileStatusCache: FileStatusCache,
    -    override val partitionSpec: PartitionSpec)
    +    override val partitionSpec: PartitionSpec,
    +    override val metadataOpsTimeNs: Option[Long])
    --- End diff --
    
    Add param doc, as it's not immediately obvious what a user is supposed to supply here.
    I'd say something like "time it took to obtain the partitionSpec from the Hive metastore", but maybe that's too specific..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17476: [SPARK-20151][SQL] Account for partition pruning in scan...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17476
  
    **[Test build #75379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75379/testReport)** for PR 17476 at commit [`8789cf0`](https://github.com/apache/spark/commit/8789cf04ea4f7addbcd8da9d83615ee96d9bd192).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17476: [SPARK-20151][SQL] Account for partition pruning in scan...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17476
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75379/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17476: [SPARK-20151][SQL] Account for partition pruning in scan...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/17476
  
    Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17476: [SPARK-20151][SQL] Account for partition pruning ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17476


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17476: [SPARK-20151][SQL] Account for partition pruning in scan...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17476
  
    **[Test build #75379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75379/testReport)** for PR 17476 at commit [`8789cf0`](https://github.com/apache/spark/commit/8789cf04ea4f7addbcd8da9d83615ee96d9bd192).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17476: [SPARK-20151][SQL] Account for partition pruning ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17476#discussion_r109092246
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala ---
    @@ -111,7 +113,8 @@ private class PrunedInMemoryFileIndex(
         sparkSession: SparkSession,
         tableBasePath: Path,
         fileStatusCache: FileStatusCache,
    -    override val partitionSpec: PartitionSpec)
    +    override val partitionSpec: PartitionSpec,
    +    override val metadataOpsTimeNs: Option[Long])
    --- End diff --
    
    It actually includes more than that. We do file listing as part of that ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17476: [SPARK-20151][SQL] Account for partition pruning ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17476#discussion_r108941649
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala ---
    @@ -72,4 +72,14 @@ trait FileIndex {
     
       /** Schema of the partitioning columns, or the empty schema if the table is not partitioned. */
       def partitionSchema: StructType
    +
    +  /**
    +   * Returns an optional metadata operation time, in nanoseconds, for listing files.
    +   *
    +   * We do file listing in query optimization (in order to get the proper statistics) and we want
    +   * to account for file listing time in physical execution (as metrics). To do that, we save the
    +   * file listing time in some implementations and physical execution calls it in this method
    +   * to update the metrics.
    +   */
    +  def metadataOpsTimeNs: Option[Long] = None
    --- End diff --
    
    I think it's hard to define the semantic of this method for general `FileIndex`, as everytime we call `listFiles`, the value of this method should be updated.
    
    how about we only put this method in `PrunedInMemoryFileIndex`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17476: [SPARK-20151][SQL] Account for partition pruning in scan...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/17476
  
    cc @ericl, @bogdanrdc, @adrian-ionescu, @cloud-fan


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17476: [SPARK-20151][SQL] Account for partition pruning in scan...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17476
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17476: [SPARK-20151][SQL] Account for partition pruning ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17476#discussion_r109092194
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala ---
    @@ -72,4 +72,14 @@ trait FileIndex {
     
       /** Schema of the partitioning columns, or the empty schema if the table is not partitioned. */
       def partitionSchema: StructType
    +
    +  /**
    +   * Returns an optional metadata operation time, in nanoseconds, for listing files.
    +   *
    +   * We do file listing in query optimization (in order to get the proper statistics) and we want
    +   * to account for file listing time in physical execution (as metrics). To do that, we save the
    +   * file listing time in some implementations and physical execution calls it in this method
    +   * to update the metrics.
    +   */
    +  def metadataOpsTimeNs: Option[Long] = None
    --- End diff --
    
    I thought about that but there is no API level guarantee that we'd get PrunedInMemoryFileIndex after partition pruning. It is more just a current implementation detail. I'd rather have something that's more specified in the API.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org