You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mallman <gi...@git.apache.org> on 2016/06/21 21:17:01 UTC

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

GitHub user mallman opened a pull request:

    https://github.com/apache/spark/pull/13818

    [SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached

    (Please note this is a revision of PR #13686, which has been closed in favor of this PR.)
    
    ## What changes were proposed in this pull request?
    
    The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is incomplete/inaccurate for a nonempty partitioned table. As a result, cached lookups on nonempty partitioned tables always miss.
    
    Rather than get `pathsInMetastore` from
    
        metastoreRelation.catalogTable.storage.locationUri.toSeq
    
    I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5:
    
    https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.
    
    I also added a call in `InsertIntoHiveTable.scala` to invalidate the table from the SQL session catalog.
    
    ## How was this patch tested?
    
    I've added a new unit test to `parquetSuites.scala`:
    
        SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached relation
    
    Note that the only difference between this new test and the one above it in the file is that the new test populates its partitioned table with a single value, while the existing test leaves the table empty. This reveals a subtle, unexpected hole in test coverage present before this patch.
    
    Note I also modified a different but related unit test in `parquetSuites.scala`:
    
        SPARK-15248: explicitly added partitions should be readable
    
    This unit test asserts that Spark SQL should return data from a table partition which has been placed there outside a metastore query immediately after it is added. I changed the test so that, instead of adding the data as a parquet file saved in the partition's location, the data is added through a SQL `INSERT` query. I made this change because I could find no way to efficiently support partitioned table caching without failing that test.
    
    In addition to my primary motivation, I can offer a few reasons I believe this is an acceptable weakening of that test. First, it still validates a fix for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue for which it was written. Second, the assertion made is stronger than that required for non-partitioned tables. If you write data to the storage location of a non-partitioned metastore table without using a proper SQL DML query, a subsequent call to show that data will not return it. I believe this is an intentional limitation put in place to make table caching feasible, but I'm only speculating.
    
    Building a large `HadoopFsRelation` requires `stat`-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL in this cases.
    
    Thanks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VideoAmp/spark-public spark-15968

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13818.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13818
    
----
commit 8a058c65c6c20e311bde5c0ade87c14c6b6b5f37
Author: Michael Allman <mi...@videoamp.com>
Date:   2016-06-15T16:52:17Z

    [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate
    partitioned metastore relation when searching the internal table cache
    
    The `getCached` method of `HiveMetastoreCatalog` computes
    `pathsInMetastore` from the metastore relation's catalog table. This
    only returns the table base path, which is not correct for nonempty
    partitioned tables. As a result, cached lookups on nonempty partitioned
    tables always miss.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69107723
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -191,6 +191,7 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     
       private def getCached(
           tableIdentifier: QualifiedTableName,
    +      pathsInMetastore: Seq[String],
    --- End diff --
    
    While the following commit does not remove that argument, it appears to be the one that changes the behavior for how partitioned tables are looked up in the cache:
    
    https://github.com/apache/spark/commit/e720dda42e806229ccfd970055c7b8a93eb447bf#diff-ee66e11b56c21364760a5ed2b783f863R508
    
    I'm not sure how to find a PR in GitHub associated with a given commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r95064132
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    We only need to invalidate the cache for partitioned tables. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13818


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    cc @cloud-fan / @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    LGTM, cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    **[Test build #61730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61730/consoleFull)** for PR 13818 at commit [`91ef950`](https://github.com/apache/spark/commit/91ef9508a2cb992772547bc2cc5c01c63b41abd4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61594/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    > I have a few questions.
    > 
    > Is it a regression from 1.6? Looks like not?
    
    I don't know about 1.6. I know it's a regression from 1.5.
    
    > Is it a correctness issue or a performance issue? Seems it is a performance issue?
    
    It is a performance issue.
    
    > If it is a performance issue. What is the impact? For a hive parquet/orc table, after we convert them to Spark's native code path, there is no partitioning discovery. So, I guess the performance is mainly coming from querying metastore? If so, what will be the perf difference after spark.sql.hive.metastorePartitionPruning (only querying needed partition info from Hive metastore) is enabled?
    
    The problem this PR addresses occurs in the analysis phase of query planning. The property `spark.sql.hive.metastorePartitionPruning` only comes into play in `HiveTableScanExec`, which is part of physical planning. (And I don't believe it's used to read Parquet tables.) Therefore, that property has no bearing on this problem.
    
    Regarding the impact, I'll quote from the last paragraph of the PR description:
    
    > Building a large HadoopFsRelation requires stat-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases.
    
    ----
    
    > My feeling is that if it is a perf issue and it is not a regression from 1.6, merging to master should be good enough.
    
    For some (like us), I'd say this extends beyond a performance issue into a usability issue. We can't use Spark 2.0 as-is if it takes us several minutes to build a query plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    **[Test build #61594 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61594/consoleFull)** for PR 13818 at commit [`c2ba4af`](https://github.com/apache/spark/commit/c2ba4af116c9e999f2fa2f68868b72648b4234c7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69230833
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
             PartitionDirectory(values, location)
           }
           val partitionSpec = PartitionSpec(partitionSchema, partitions)
    +      val partitionPaths = partitions.map(_.path.toString)
    +      val paths = partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString)
    --- End diff --
    
    It is weird but correct.
    
    Essentially, the expression
    
        partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString)
    
    will return `partitionPaths` if `partitionPaths` is nonempty and `metastoreRelation.hiveQlTable.getDataLocation.toString` otherwise. This fits the logic that seems to be present wherever partitioned tables are handled in the Spark codebase: use the table base location for empty partitioned tables, and use the partition data locations for nonempty partitioned tables. More specifically, the base table path is _not_ included in the latter case.
    
    The expression you suggest will always return the table base location as one of the table data paths, whether `partitionPaths` is empty or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69232046
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
             PartitionDirectory(values, location)
           }
           val partitionSpec = PartitionSpec(partitionSchema, partitions)
    +      val partitionPaths = partitions.map(_.path.toString)
    +      val paths = partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString)
    --- End diff --
    
    also add some comments to explain it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by erfangc <gi...@git.apache.org>.

Github user erfangc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r73814190
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    Ok, will do - thought so too as this relates to `InsertIntoHive` was a hail mary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    LGTM except for minor styling issues. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69087307
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -191,6 +191,7 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     
       private def getCached(
           tableIdentifier: QualifiedTableName,
    +      pathsInMetastore: Seq[String],
    --- End diff --
    
    so this argument was in 1.5, can you find out the PR that removed it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69142557
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
         cachedDataSourceTables.getIfPresent(tableIdentifier) match {
           case null => None // Cache miss
           case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
    -        val pathsInMetastore = metastoreRelation.catalogTable.storage.locationUri.toSeq
    --- End diff --
    
    > `relation.location.paths` returns the locations of the partitions
    
    How does this happen?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r95064489
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    We are facing the same issue in LOAD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69233431
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
             PartitionDirectory(values, location)
           }
           val partitionSpec = PartitionSpec(partitionSchema, partitions)
    +      val partitionPaths = partitions.map(_.path.toString)
    +      val paths = partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString)
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r68064203
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
    @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
         }
       }
     
    +  test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " +
    --- End diff --
    
    I looked in `CachedTableSuite`. I'm not sure that's a good place for this kind of test. That test suite seems focused on testing tables cached by the [CacheManager](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala). This patch is focused on table caching in [HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala).
    
    It's difficult to find the best place for these kinds of caching tests. I chose this file because it already had some of these tests. Perhaps [HiveMetastoreCatalogSuite](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala) would be a good candidate for an alternative?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    **[Test build #3124 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3124/consoleFull)** for PR 13818 at commit [`8a058c6`](https://github.com/apache/spark/commit/8a058c65c6c20e311bde5c0ade87c14c6b6b5f37).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    **[Test build #61730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61730/consoleFull)** for PR 13818 at commit [`91ef950`](https://github.com/apache/spark/commit/91ef9508a2cb992772547bc2cc5c01c63b41abd4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69232304
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    ah so essentially we have 2 caches to invalidate: the data of table, the metadata of table


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    **[Test build #61594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61594/consoleFull)** for PR 13818 at commit [`c2ba4af`](https://github.com/apache/spark/commit/c2ba4af116c9e999f2fa2f68868b72648b4234c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by erfangc <gi...@git.apache.org>.

Github user erfangc commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r73807925
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    Hi @mallman does this change cause new sessions (ex: external App to the ThriftServer via JDBC) to not see the cached tables? I noticed this in the released version 2.0.0 whereby `CACHE TABLE` in one session has no effect on new sessions. Future SQL statements are still reading the underlying Parquet files from Disk (as evidenced by tasks being `NODE_LOCAL` and `RACK_LOCAL` instead of `PROCESS_LOCAL`). Sorry if this question is unrelated to your patch, but this became a major issue in 2.0.0 for us, where as in 1.6.2 we do not have an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    @hvanhovell I'm mentioning you here because you commented on my previous PR for this Jira issue. In response to your original question, yes, I have added a unit test for this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69159655
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
         cachedDataSourceTables.getIfPresent(tableIdentifier) match {
           case null => None // Cache miss
           case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
    -        val pathsInMetastore = metastoreRelation.catalogTable.storage.locationUri.toSeq
    --- End diff --
    
    This is where the relation's paths are computed and the logic for empty versus non-empty partitioned tables diverges: https://github.com/VideoAmp/spark-public/blob/8a058c65c6c20e311bde5c0ade87c14c6b6b5f37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L489-L493.
    
    I believe this is the PR where this behavior was introduced: #13022.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69232024
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
             PartitionDirectory(values, location)
           }
           val partitionSpec = PartitionSpec(partitionSchema, partitions)
    +      val partitionPaths = partitions.map(_.path.toString)
    +      val paths = partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString)
    --- End diff --
    
    ah i see, can you use a `if-else` here? It's better to reduce the time for people to understand it, rather than the line of code :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    Shall we also have this in branch-2.0? This seems to be a pretty serious bug. cc @rxin.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69448268
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
    @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
         }
       }
     
    +  test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " +
    +      "relation") {
    +    withTable("partitioned") {
    +      sql(
    +        s"""CREATE TABLE partitioned (
    +           | key INT,
    +           | value STRING
    +           |)
    +           |PARTITIONED BY (part INT)
    +           |STORED AS PARQUET
    +       """.stripMargin)
    +      sql("INSERT INTO TABLE partitioned PARTITION(part=0) SELECT 1 as key, 'one' as value")
    +
    +      // First lookup fills the cache
    +      val r1 = collectHadoopFsRelation (table("partitioned"))
    +      // Second lookup should reuse the cache
    +      val r2 = collectHadoopFsRelation (table("partitioned"))
    --- End diff --
    
    Nit: Remove space before `(`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69106546
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
         cachedDataSourceTables.getIfPresent(tableIdentifier) match {
           case null => None // Cache miss
           case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
    -        val pathsInMetastore = metastoreRelation.catalogTable.storage.locationUri.toSeq
    --- End diff --
    
        metastoreRelation.catalogTable.storage.locationUri.toSeq
    
    returns the base path of the relation. This is then compared to `relation.location.paths` to validate the cached entry. For non-empty partitioned tables (by that I mean partitioned tables with one or more metastore partitions), `relation.location.paths` returns the locations of the partitions. Hence, these values will never be equal and `useCached` will always be `false`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r73892580
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    @erfangc I concur with @cloud-fan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    @mallman never mind. https://github.com/apache/spark/commit/5b7a1770ac9cf36a1e92b31d10fe6eeee92fef17 fixed the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69231754
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    Essentially this is a "callback" to the session catalog to invalidate this table in the catalog's table cache, because we just appended to the underlying table data. In the context of a Hive query, the session catalog's table cache is `HiveMetastoreCatalog.cachedDataSourceTables`. The results of the `INSERT` will be invisible in the current session until the table is invalidated.
    
    Another way to think about this code is that it's precisely what makes the following test snippet work: https://github.com/VideoAmp/spark-public/blob/spark-15968/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L582-L585
    
    Does that answer your question?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    @zsxwing I was able to do following without error:
    
        git clone git@github.com:apache/spark.git spark-master
        cd spark-master
        ./dev/change-scala-version.sh 2.10
        ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69087188
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
         cachedDataSourceTables.getIfPresent(tableIdentifier) match {
           case null => None // Cache miss
           case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
    -        val pathsInMetastore = metastoreRelation.catalogTable.storage.locationUri.toSeq
    --- End diff --
    
    Sorry I may not have enough background knowledge to understand this, can you explain a bit more about why this doesn't work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69228447
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    do you mind explain a bit more about this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r95064603
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    Let me submit a PR to fix these issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    **[Test build #3124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3124/consoleFull)** for PR 13818 at commit [`8a058c6`](https://github.com/apache/spark/commit/8a058c65c6c20e311bde5c0ade87c14c6b6b5f37).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    FYI this breaks Scala 2.10:
    ```
    [info] Compiling 254 Scala sources and 5 Java sources to /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/mllib/target/scala-2.10/classes...
    [error] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:301: value invalidateTable is not a member of org.apache.spark.sql.catalyst.catalog.SessionCatalog
    [error]     sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    [error]                                     ^
    [error] one error found
    [warn] Multiple main classes detected.  Run 'show discoveredMainClasses' to see the list
    [info] Packaging /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/mllib/target/scala-2.10/spark-mllib_2.10-2.0.0-SNAPSHOT.jar ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    I believe I've addressed @liancheng's style issues in my new unit test, along with the same in the two tests from which it was copy-pasta'd (boy scout rule). Hopefully I didn't cock it up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69448203
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
    @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
         }
       }
     
    +  test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " +
    +      "relation") {
    +    withTable("partitioned") {
    +      sql(
    +        s"""CREATE TABLE partitioned (
    +           | key INT,
    +           | value STRING
    +           |)
    +           |PARTITIONED BY (part INT)
    +           |STORED AS PARQUET
    +       """.stripMargin)
    --- End diff --
    
    Nit: Indentation is off here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r73813847
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
     
         // Invalidate the cache.
         sqlContext.sharedState.cacheManager.invalidateCache(table)
    +    sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
    --- End diff --
    
    Can you create a JIRA for this? Seems it's not related to this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69228137
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
             PartitionDirectory(values, location)
           }
           val partitionSpec = PartitionSpec(partitionSchema, partitions)
    +      val partitionPaths = partitions.map(_.path.toString)
    +      val paths = partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString)
    --- End diff --
    
    this looks weird, do you mean `partitionPath :+ metastoreRelation.hiveQlTable.getDataLocation.toString`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61730/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    You are very welcome. Thank you for taking time to review it! 😃 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    I have a few questions. 
    
    1. Is it a regression from 1.6? Looks like not?
    2. Is it a correctness issue or a performance issue? Seems it is a performance issue?
    3. If it is a performance issue. What is the impact? For a hive parquet/orc table, after we convert them to Spark's native code path, there is no partitioning discovery. So, I guess the performance is mainly coming from querying metastore? If so, what will be the perf difference after `spark.sql.hive.metastorePartitionPruning` (only querying needed partition info from Hive metastore) is enabled?
    
    My feeling is that if it is a perf issue and it is not a regression from 1.6, merging to master should be good enough. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r67994573
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
    @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
         }
       }
     
    +  test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " +
    --- End diff --
    
    Could you take a look a [CachedTableSuite](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala) and add the test there (and also use a similar approach).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    seems a reasonable change to me, thanks for working on it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13818#discussion_r69448259
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
    @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest {
         }
       }
     
    +  test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " +
    +      "relation") {
    +    withTable("partitioned") {
    +      sql(
    +        s"""CREATE TABLE partitioned (
    +           | key INT,
    +           | value STRING
    +           |)
    +           |PARTITIONED BY (part INT)
    +           |STORED AS PARQUET
    +       """.stripMargin)
    +      sql("INSERT INTO TABLE partitioned PARTITION(part=0) SELECT 1 as key, 'one' as value")
    +
    +      // First lookup fills the cache
    +      val r1 = collectHadoopFsRelation (table("partitioned"))
    --- End diff --
    
    Nit: Remove space before `(`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13818
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org