You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/05/25 09:33:24 UTC

[GitHub] spark pull request #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore...

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/18105

    [SPARK-20881] [SQL] Use Hive's stats in metastore when cbo is disabled

    ## What changes were proposed in this pull request?
    
    Spark's statistics are generated by "analyze command". 
    
    However, when user updates the table and collects stats in Hive, "totalSize"/"numRows" will be updated in metastore. 
    
    Now, in spark side, table stats are stale even if we turn off cbo, because in current implementation, spark's stats always override hive's stats, no matter cbo is enabled or disabled.
    
    The right thing to do is to use hive's stats in this case.
    
    ## How was this patch tested?
    
    Added test case.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark cboSwitchStats

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18105
    
----
commit cdb9b17266151123802324ae788762f58884c2b5
Author: wangzhenhua <wa...@huawei.com>
Date:   2017-05-25T07:03:43Z

    respect spark/hive stats based on cbo conf

commit 4655292a0d2136adb4530766311f0ab4f8821e39
Author: Zhenhua Wang <wz...@163.com>
Date:   2017-05-25T08:14:08Z

    fix tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    I don't think the analyze table command is bound with CBO, neither. I just want to change how we read stats from metastore. That is, which side (spark or hive) of stats we respect based on cbo switch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    @cloud-fan > What was the behavior before?
    
    Previously, analyze table command only updates the size of table, and it uses the same hive stats name "totalSize", and stores it in metastore through table parameter.
    
    Now since we are using a different name for spark's totalSize, they may differ. And actually the parameter "totalSize" will change even in spark when you put some new data into the table, which seems to be hive metastore's behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    Now, we have two sources of statistics. We need a mechanism to decide which one should be chosen. We might need to update the code comments at least to document the behaviors we choose. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    @cloud-fan I mean the behavior when cbo is disabled should be the same as the behavior previously without cbo.
    Previously, size is read from "totalSize", and it changes after update.
    Now, when cbo is enabled, I agree that user is responsible to re-analyze. But when cbo is disabled, I think user expects the behavior without cbo, i.e. read "totalSize" from metastore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    **[Test build #77365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77365/testReport)** for PR 18105 at commit [`59619ee`](https://github.com/apache/spark/commit/59619ee7274fbf37547c9a59fec4bb7832d2693e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77359/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Clearly document the mechanism to ch...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    **[Test build #77490 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77490/testReport)** for PR 18105 at commit [`556a56f`](https://github.com/apache/spark/commit/556a56fcea889e1a99d5310932933213c97e232b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    **[Test build #77365 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77365/testReport)** for PR 18105 at commit [`59619ee`](https://github.com/apache/spark/commit/59619ee7274fbf37547c9a59fec4bb7832d2693e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    @cloud-fan > What was the behavior before?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Clearly document the mechanism to ch...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    **[Test build #77490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77490/testReport)** for PR 18105 at commit [`556a56f`](https://github.com/apache/spark/commit/556a56fcea889e1a99d5310932933213c97e232b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    **[Test build #77359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77359/testReport)** for PR 18105 at commit [`4655292`](https://github.com/apache/spark/commit/4655292a0d2136adb4530766311f0ab4f8821e39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Clearly document the mechanism to ch...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    LGTM pending Jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    I think we'd better respect the "totalSize" stats when cbo is disabled, otherwise user has no way to the default behavior unless he re-runs the analyze command. I personally think that's unfriendly to users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    I think we should always trust Spark's table stats over Hive's, no matter CBO is on or not. If users update the stats at hive side, it's their own responsibility to update it at Spark side.
    
    IIUC `AnalyzeTableCommand` appears before CBO right? What was the behavior before?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    cc @cloud-fan @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    > I also think we should respect Spark-generated statistics over Hive's when it is available.
    
    @gatorsmile OK. Then it's consistent with the current implementation. I'll change the description of this pr and update the code comments based on this mechanism.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    **[Test build #77359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77359/testReport)** for PR 18105 at commit [`4655292`](https://github.com/apache/spark/commit/4655292a0d2136adb4530766311f0ab4f8821e39).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77365/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Clearly document the mechanism to ch...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Use Hive's stats in metastore when c...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    If users have not analyzed the table in Spark yet, we should respect the stats from hive metastore. But if users have already run the analyze table command in Spark, I think it's fair to ask them to re-analyze if data changed. BTW I don't think the analyze table command is bound with CBO, if you think the behavior is reason when CBO is on, I think it's also reasonable when CBO is off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Clearly document the mechanism to ch...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77490/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18105: [SPARK-20881] [SQL] Clearly document the mechanism to ch...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18105
  
    Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18105: [SPARK-20881] [SQL] Clearly document the mechanis...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18105


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org