You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2017/11/28 11:02:05 UTC

[GitHub] spark pull request #19831: [SPARK-22489][SQL] Wrong Hive table statistics ma...

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/19831

    [SPARK-22489][SQL] Wrong Hive table statistics may trigger OOM if enables join reorder in CBO

    ## What changes were proposed in this pull request?
    
    How to reproduce:
    ```basg
    bin/spark-shell --conf spark.sql.cbo.enabled=true --conf spark.sql.cbo.joinReorder.enabled=true
    ```
    ```scala
    import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
    
    spark.sql("CREATE TABLE small (c1 bigint) TBLPROPERTIES ('numRows'='3', 'rawDataSize'='600','totalSize'='800')")
    // Big table with wrong statistics, numRows=0
    spark.sql("CREATE TABLE big (c1 bigint) TBLPROPERTIES ('numRows'='0', 'rawDataSize'='60000000000', 'totalSize'='8000000000000')")
    
    val plan = spark.sql("select * from small t1 join big t2 on (t1.c1 = t2.c1)").queryExecution.executedPlan
    val buildSide = plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
    
    println(buildSide)
    ```
    
    The result is `BuildRight`, but the right side is the big table.
    
    
    For `big` table, `totalSize` or `rawDataSize` > 0, rowCount = 0. At least one other is wrong here.
    https://github.com/apache/spark/blob/ed7352f2191308965a1b2abb6cd075a90b7f7bb7/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L432-L434
    
    This pr to ensure that the `totalSize` or `rawDataSize` > 0, rowCount also must be > 0.
    
    ## How was this patch tested?
    
    unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-22626

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19831.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19831
    
----
commit ed7352f2191308965a1b2abb6cd075a90b7f7bb7
Author: Yuming Wang <wg...@gmail.com>
Date:   2017-11-28T08:56:16Z

    if dataSize > 0, rowCount should bigger than 0.

commit b16f88ef971040e682fafe28f0ff06877814e3df
Author: Yuming Wang <wg...@gmail.com>
Date:   2017-11-28T10:33:46Z

    add test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19831#discussion_r154250160
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl(
           // Note that this statistics could be overridden by Spark's statistics if that's available.
           val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
           val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
    -      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)
    +      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0)
    --- End diff --
    
    Thanks for the investigation. Seems hive can't protect its stats properties.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    cc @wzhfy


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19831#discussion_r154069687
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl(
           // Note that this statistics could be overridden by Spark's statistics if that's available.
           val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
           val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
    -      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)
    +      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0)
    --- End diff --
    
    `StatsSetupConst.COLUMN_STATS_ACCURATE` to ensure that statistics have been updated, but can not be guaranteed to be correct:
    ```bash
    cat <<EOF > data
    1,1
    2,2
    3,3
    4,4
    5,5
    EOF
    
    hive -e "CREATE TABLE spark_22626(c1 int, c2 int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';"
    
    hive -e "LOAD DATA local inpath 'data' into table spark_22626;"
    
    hive -e "INSERT INTO table spark_22626 values(6, 6);"
    
    hive -e "desc extended spark_22626;"
    ```
    
    The result is:
    ```
    parameters:{totalSize=24, numRows=1, rawDataSize=3, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}
    ```
    
    `numRows` should be 6, but got 1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    cc @gatorsmile @cloud-fan


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    **[Test build #84259 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84259/testReport)** for PR 19831 at commit [`5c43b2a`](https://github.com/apache/spark/commit/5c43b2a4e43daa5e6936d2bb8a899b31e9a97d7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Yes, I saw some of these tables in my cluster, but the user did not manually modify this parameter:
    ```
    # Detailed Table Information		
    Database	dw	
    Table	prod	
    Owner	bi	
    Created Time	Tue Nov 03 16:33:52 CST 2015	
    Last Access	Thu Jan 01 08:00:00 CST 1970	
    Created By	Spark 2.2 or prior	
    Type	EXTERNAL	
    Provider	hive	
    Comment	Product list	
    Table Properties	[transient_lastDdlTime=1508260780, last_modified_time=1473154014, last_modified_by=bi]	
    Statistics	26596461123 bytes, 0 rows	
    Location	viewfs://cluster9/user/hive/warehouse/dw.db/prod	
    Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
    InputFormat	org.apache.hadoop.mapred.TextInputFormat	
    OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
    Storage Properties	[serialization.format=1]	
    Partition Provider	Catalog	
    Time taken: 1.241 seconds, Fetched 70 row(s)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22489][SQL] Wrong Hive table statistics may trigg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    **[Test build #84255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84255/testReport)** for PR 19831 at commit [`b16f88e`](https://github.com/apache/spark/commit/b16f88ef971040e682fafe28f0ff06877814e3df).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19831: [SPARK-22626][SQL] It deals with wrong Hive's sta...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19831


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] It deals with wrong Hive's statistics...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84394/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    > Besides, if the size stats totalSize or rawDataSize is wrong, the problem exists whether CBO is enabled or not.
    
    > If CBO enabled, the outputRowCount == 0, the getOutputSize is 1, sizeInBytes is 1 and this side can broadcast:
    If CBO disabled, the sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize and this side cann't broadcast:
    
    @wangyum `totalSize or rawDataSize` can also be wrong, right?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    **[Test build #84255 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84255/testReport)** for PR 19831 at commit [`b16f88e`](https://github.com/apache/spark/commit/b16f88ef971040e682fafe28f0ff06877814e3df).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Instead of manually setting up table statistics, I'm just trying to simulate the statistics for these tables by this way. 
    If `totalSize (or rawDataSize) > 0` and `rowCount = 0`, at least one parameter is incorrect, and  should not be optimized based on these incorrect statistics.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19831#discussion_r153677300
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl(
           // Note that this statistics could be overridden by Spark's statistics if that's available.
           val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
           val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
    -      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)
    +      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0)
    --- End diff --
    
    Hive has a flag called `StatsSetupConst.COLUMN_STATS_ACCURATE`. If I remember correctly, this flag will become **false** if user changes table properties or table data. Can you check if the flag exists in your case? If so, we can use the flag to decide whether to read statistics from Hive.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Besides, if the size stats `totalSize` or `rawDataSize` is wrong, the problem also exists whether CBO is enabled or not. We need to change that in the title too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Since Hive can't protect user to set a wrong stats properties, I think this solution can alleviate the problem. Besides, it's consistent with what we do for `totalSize and rawDataSize` (only use the stats when > 0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] It deals with wrong Hive's statistics...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] It deals with wrong Hive's statistics...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    **[Test build #84394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84394/testReport)** for PR 19831 at commit [`5b744e3`](https://github.com/apache/spark/commit/5b744e392a3c6cd2b575f85f10db34826303a2e5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84259/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    **[Test build #84259 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84259/testReport)** for PR 19831 at commit [`5c43b2a`](https://github.com/apache/spark/commit/5c43b2a4e43daa5e6936d2bb8a899b31e9a97d7f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    BTW, the case here is not about join reorder, it's actually about broadcast decision. Could you update the title of this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] It deals with wrong Hive's statistics...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    **[Test build #84394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84394/testReport)** for PR 19831 at commit [`5b744e3`](https://github.com/apache/spark/commit/5b744e392a3c6cd2b575f85f10db34826303a2e5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19831#discussion_r154499581
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala ---
    @@ -1187,6 +1187,22 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd
           }
         }
       }
    +
    +  test("Wrong Hive table statistics may trigger OOM if enables join reorder in CBO") {
    --- End diff --
    
    IMHO you can just test the read logic for Hive's stats properties in `StatisticsSuite`, instead of a end-to-end test case, developers may not know what's going on by this test case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84255/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Is it really an issue? If you manually set a wrong statistics, how would you expect the system to do? I think data source tables don't allow you set the statistics manually, so this problem is inherited from Hive. cc @wzhfy to confirm.
    
    This PR treats 0 row count as invalid, which is arguable, i.e. if we analyze an empty table, and then the 0 row count is valid.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19831#discussion_r153677676
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl(
           // Note that this statistics could be overridden by Spark's statistics if that's available.
           val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
           val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
    -      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)
    +      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0)
    --- End diff --
    
    The root problem is that user can set "wrong" table properties. So if we want to prevent using wrong stats, we need to detect changes in properties. Otherwise your case can't be avoided.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    @cloud-fan Yes, Spark doesn't allow user to set (Spark's) statistics manually.
    
    This PR treats 0 row count of **Hive's stats**, it doesn't affect the logic for Spark's stats. Besides,  Spark currently only use Hive's `totalSize` and `rawDataSize` when they are > 0. This PR changes the behavior for `rowCount` to be consistent with that, so I think it's fine. But the title of the PR should be more specific, i.e. it deals with wrong Hive's statistics (zero rowCount).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] It deals with wrong Hive's statistics...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/19831
  
    If CBO enabled,  the [`outputRowCount == 0`](https://github.com/apache/spark/pull/19831#L67), the [`getOutputSize`](https://github.com/apache/spark/pull/19831#L60) is 1, `sizeInBytes` is 1 and this side can broadcast:
    
    https://github.com/apache/spark/blob/b803b66a8133f705463039325ee71ee6827ce1a7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala#L65-L88
    
    https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L45-L64
    ---
    If CBO disabled,  the ` sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize` and this side cann't broadcast:
    https://github.com/apache/spark/blob/ae253e5a878a0adc2785ae050c49022687ac1d06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L30-L49


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19831#discussion_r154245570
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl(
           // Note that this statistics could be overridden by Spark's statistics if that's available.
           val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
           val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
    -      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)
    +      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0)
    --- End diff --
    
    Maybe this could be more clear:
    ```scala
    val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_))
    val stats =
      if (totalSize.isDefined && totalSize.get > 0L) {
        Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = rowCount.filter(_ > 0)))
      } else if (rawDataSize.isDefined && rawDataSize.get > 0) {
        Some(CatalogStatistics(sizeInBytes = rawDataSize.get, rowCount = rowCount.filter(_ > 0)))
      } else {
        None
      }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org