You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dilipbiswal <gi...@git.apache.org> on 2017/08/04 18:36:33 UTC

[GitHub] spark pull request #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRAN...

GitHub user dilipbiswal opened a pull request:

    https://github.com/apache/spark/pull/18847

    [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException

    ## What changes were proposed in this pull request?
    Backports the following JIRAs into 2.2.
    ```
    [SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl
    [SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats
    [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException
    ```
    ## How was this patch tested?
    Tests cases added as part of original fix.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dilipbiswal/spark datasource_stat_2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18847
    
----
commit 707529428a23ffd65c8212a273d12a4df58b39e6
Author: Xiao Li <ga...@gmail.com>
Date:   2017-05-23T00:28:30Z

    [SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl
    
    ### What changes were proposed in this pull request?
    
    After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792
    
    Also fix the issue to fill Hive-generated RowCounts to our stats.
    
    This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`.
    ### How was this patch tested?
    
    Added a few test cases.
    
    Author: Xiao Li <ga...@gmail.com>
    
    Closes #14971 from gatorsmile/showCreateTableNew.

commit a933350805eda961e41a429317cd3397d159a6fb
Author: Zhenhua Wang <wz...@163.com>
Date:   2017-06-12T00:23:04Z

    [SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats
    
    ## What changes were proposed in this pull request?
    
    Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats.
    
    For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command.
    
    Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats.
    
    A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031).
    
    To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats.
    
    ## How was this patch tested?
    
    Added new tests.
    
    Author: Zhenhua Wang <wz...@163.com>
    
    Closes #18248 from wzhfy/separateHiveStats.

commit a03e188818b1505383f3487904d62d90519e72c9
Author: Dilip Biswal <db...@us.ibm.com>
Date:   2017-08-03T16:25:48Z

    [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException
    
    In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception.
    
    A new test case is added in StatisticsSuite.
    
    Author: Dilip Biswal <db...@us.ibm.com>
    
    Closes #18804 from dilipbiswal/datasource_stats.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18847
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18847
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80260/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18847
  
    I see. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18847
  
    **[Test build #80260 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80260/testReport)** for PR 18847 at commit [`a03e188`](https://github.com/apache/spark/commit/a03e188818b1505383f3487904d62d90519e72c9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18847
  
    **[Test build #80260 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80260/testReport)** for PR 18847 at commit [`a03e188`](https://github.com/apache/spark/commit/a03e188818b1505383f3487904d62d90519e72c9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/18847
  
    @gatorsmile Thanks !! To the best of my knowledge, we don't have the problem of analyze table command failing with java.util.NoSuchElement exception in 2.2. In 2.2, we used to add the column stats using alterTable method. Currently we use alterTableStats method where this problem exists. alterTableStats was introduced as part of  SPARK-21031.
    
    I will close this PR for now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRAN...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal closed the pull request at:

    https://github.com/apache/spark/pull/18847


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18847: [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/18847
  
    @gatorsmile I just created this PR for you take a look and decide if we need to back port the above 3 PRs. The problem for SPARK-21599 does not exist on 2.2 as it was introduced as part of SPARK-21031. I took a look at the problem mentioned in SPARK-21031 and it seems like we may want this fix in 2.2 ? 
    
    I will close this PR in case you decide to not back port these. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org