You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/01/25 02:57:55 UTC

[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/16696

    [SPARK-19350] [SQL] Cardinality estimation of Limit and Sample

    ## What changes were proposed in this pull request?
    
    Before this pr, LocalLimit/GlobalLimit/Sample propagates the same row count and column stats from its child, which is incorrect.
    We can get the correct rowCount in Statistics for Limit/Sample whether cbo is enabled or not. Column stats should not be propagated because we don't know the distribution of columns after Limit or Sample.
    
    ## How was this patch tested?
    
    Added test cases.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark limitEstimation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16696.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16696
    
----
commit 62013f5c5bc1a6d43e1b111aa3784b4524c8fda4
Author: wangzhenhua <wa...@huawei.com>
Date:   2017-01-25T01:00:08Z

    limit and sample estimation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #73824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73824/testReport)** for PR 16696 at commit [`5692939`](https://github.com/apache/spark/commit/56929391719053e72791abe127b10a3316b51141).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    @cloud-fan Does this look good to you now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104281417
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryNo
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    +    val childStats = child.stats(conf)
    +    if (limit == 0) {
           // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
           // (product of children).
    -      1
    +      Statistics(
    +        sizeInBytes = 1,
    +        rowCount = Some(0),
    +        isBroadcastable = childStats.isBroadcastable)
         } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    +      // The output row count of LocalLimit should be the sum of row count from each partition, but
    +      // since the partition number is not available here, we just use statistics of the child
    +      // except column stats, because we don't know the distribution after a limit operation
    --- End diff --
    
    A loose bound can lead to significant under-estimation. E.g. a > 50, after limit the actual range is [40, 60], while max and min in stats are still [0, 60], then the filter factor will be 1/6 instead of 1/2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104101031
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala ---
    @@ -0,0 +1,121 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.statsEstimation
    +
    +import org.apache.spark.sql.catalyst.CatalystConf
    +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Literal}
    +import org.apache.spark.sql.catalyst.plans.logical._
    +import org.apache.spark.sql.types.IntegerType
    +
    +
    +class StatsEstimationSuite extends StatsEstimationTestBase {
    --- End diff --
    
    `BasicStatsEstimationSuite`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104101253
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
    @@ -116,22 +116,22 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
         withTempView("test") {
    --- End diff --
    
    is this test duplicated with the newly added limit test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #73372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73372/testReport)** for PR 16696 at commit [`5692939`](https://github.com/apache/spark/commit/56929391719053e72791abe127b10a3316b51141).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97932867
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala ---
    @@ -18,12 +18,41 @@
     package org.apache.spark.sql.catalyst.statsEstimation
     
     import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Literal}
    +import org.apache.spark.sql.catalyst.plans.logical._
     import org.apache.spark.sql.types.IntegerType
     
     
    -class StatsConfSuite extends StatsEstimationTestBase {
    +class StatsEstimationSuite extends StatsEstimationTestBase {
    +  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = Some(1), max = Some(10),
    +    nullCount = 0, avgLen = 4, maxLen = 4))
    +
    +  val plan = StatsTestPlan(
    +    outputList = Seq(ar),
    +    attributeStats = AttributeMap(Seq(ar -> colStat)),
    +    rowCount = 10,
    +    size = Some(10 * (8 + 4)))
    --- End diff --
    
    I still prefer to adding a comment above this line:
    ```
          // rowCount * (overhead + column size)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #71964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71964/testReport)** for PR 16696 at commit [`b88fac5`](https://github.com/apache/spark/commit/b88fac58331b5fbbae83ac5cb8ba37d1bbb76b4c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `abstract class LimitNode extends UnaryNode `
      * `case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends LimitNode`
      * `case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends LimitNode`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104550347
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryNo
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    +    val childStats = child.stats(conf)
    +    if (limit == 0) {
           // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
           // (product of children).
    -      1
    +      Statistics(
    +        sizeInBytes = 1,
    +        rowCount = Some(0),
    +        isBroadcastable = childStats.isBroadcastable)
         } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    +      // The output row count of LocalLimit should be the sum of row count from each partition, but
    +      // since the partition number is not available here, we just use statistics of the child
    +      // except column stats, because we don't know the distribution after a limit operation
    +      child.stats(conf).copy(attributeStats = AttributeMap(Nil))
    --- End diff --
    
    Nit: `childStats.copy(attributeStats = AttributeMap(Nil))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16696


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #71988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71988/testReport)** for PR 16696 at commit [`05fcd81`](https://github.com/apache/spark/commit/05fcd8176bdf43677892dc35247ee14a9a155ba8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97932053
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryN
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    -      // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
    -      // (product of children).
    -      1
    -    } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    -    }
    -    child.stats(conf).copy(sizeInBytes = sizeInBytes)
    +    val childStats = child.stats(conf)
    +    // Don't propagate column stats, because we don't know the distribution after a limit operation
    +    Statistics(
    +      sizeInBytes = EstimationUtils.getOutputSize(output, limit, childStats.attributeStats),
    --- End diff --
    
    I think @wzhfy is just keeping the existing code logics. Sure, we can improve it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71988/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104281282
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryNo
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    +    val childStats = child.stats(conf)
    +    if (limit == 0) {
           // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
           // (product of children).
    -      1
    +      Statistics(
    +        sizeInBytes = 1,
    +        rowCount = Some(0),
    +        isBroadcastable = childStats.isBroadcastable)
         } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    +      // The output row count of LocalLimit should be the sum of row count from each partition, but
    +      // since the partition number is not available here, we just use statistics of the child
    +      // except column stats, because we don't know the distribution after a limit operation
    --- End diff --
    
    hmm, what's the strategy here? is a loose bound better than nothing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104116975
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala ---
    @@ -1,64 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.catalyst.statsEstimation
    -
    -import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    -import org.apache.spark.sql.types.IntegerType
    -
    -
    -class StatsConfSuite extends StatsEstimationTestBase {
    --- End diff --
    
    I didn't remove it, just renamed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104116520
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala ---
    @@ -0,0 +1,121 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.statsEstimation
    +
    +import org.apache.spark.sql.catalyst.CatalystConf
    +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Literal}
    +import org.apache.spark.sql.catalyst.plans.logical._
    +import org.apache.spark.sql.types.IntegerType
    +
    +
    +class StatsEstimationSuite extends StatsEstimationTestBase {
    --- End diff --
    
    Good name:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104550722
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala ---
    @@ -1,64 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.catalyst.statsEstimation
    -
    -import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    -import org.apache.spark.sql.types.IntegerType
    -
    -
    -class StatsConfSuite extends StatsEstimationTestBase {
    --- End diff --
    
    yes. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    LGTM except minor comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104280684
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala ---
    @@ -1,64 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.catalyst.statsEstimation
    -
    -import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    -import org.apache.spark.sql.types.IntegerType
    -
    -
    -class StatsConfSuite extends StatsEstimationTestBase {
    --- End diff --
    
    How to use `git mv` now? Do I need to revert to the unchanged version, and `git mv`, and then do all the changes all over again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71964/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97933132
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala ---
    @@ -48,6 +77,14 @@ class StatsConfSuite extends StatsEstimationTestBase {
         // Return the simple statistics
         assert(plan.stats(conf.copy(cboEnabled = false)) == expectedDefaultStats)
       }
    +
    +  /** Check estimated stats which is the same when cbo is turned on/off. */
    +  private def checkStats(plan: LogicalPlan, expected: Statistics): Unit = {
    --- End diff --
    
    You know, this is a utility function. We can make it more general by having two expected stats values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r98146358
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryN
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    -      // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
    -      // (product of children).
    -      1
    -    } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    -    }
    -    child.stats(conf).copy(sizeInBytes = sizeInBytes)
    +    val childStats = child.stats(conf)
    +    // Don't propagate column stats, because we don't know the distribution after a limit operation
    +    Statistics(
    +      sizeInBytes = EstimationUtils.getOutputSize(output, limit, childStats.attributeStats),
    --- End diff --
    
    We should. Otherwise the `rowCount` is not correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by ron8hu <gi...@git.apache.org>.
Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r99474494
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala ---
    @@ -18,12 +18,41 @@
     package org.apache.spark.sql.catalyst.statsEstimation
     
     import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Literal}
    +import org.apache.spark.sql.catalyst.plans.logical._
     import org.apache.spark.sql.types.IntegerType
     
     
    -class StatsConfSuite extends StatsEstimationTestBase {
    +class StatsEstimationSuite extends StatsEstimationTestBase {
    +  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = Some(1), max = Some(10),
    +    nullCount = 0, avgLen = 4, maxLen = 4))
    +
    +  val plan = StatsTestPlan(
    +    outputList = Seq(ar),
    +    attributeStats = AttributeMap(Seq(ar -> colStat)),
    +    rowCount = 10,
    +    size = Some(10 * (8 + 4)))
    +
    +  test("limit estimation") {
    +    val localLimit = LocalLimit(Literal(2), plan)
    +    val globalLimit = GlobalLimit(Literal(2), plan)
    +    // LocalLimit and GlobalLimit share the same stats estimation logic.
    +    val expected = Statistics(sizeInBytes = 24, rowCount = Some(2))
    +    checkStats(localLimit, expected)
    +    checkStats(globalLimit, expected)
    +  }
    +
    +  test("sample estimation") {
    +    val sample = Sample(0.0, 0.5, withReplacement = false, (math.random * 1000).toLong, plan)()
    +    checkStats(sample, expected = Statistics(sizeInBytes = 60, rowCount = Some(5)))
    +
    +    // Test if Sample's child doesn't have rowCount in stats
    +    val stats2 = Statistics(sizeInBytes = 120)
    --- End diff --
    
    For limit estimation test cases, we may add a test with limit number greater than a child node's row count.  This test can show if we properly select the smaller value between limit number child node's row count. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104526617
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ---
    @@ -0,0 +1,121 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.statsEstimation
    +
    +import org.apache.spark.sql.catalyst.CatalystConf
    +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Literal}
    +import org.apache.spark.sql.catalyst.plans.logical._
    +import org.apache.spark.sql.types.IntegerType
    +
    +
    +class BasicStatsEstimationSuite extends StatsEstimationTestBase {
    +  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = Some(1), max = Some(10),
    --- End diff --
    
    nit:
    ```
    val attr = ...
    vak colStat = ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97932007
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryN
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    --- End diff --
    
    To make the stats more accurate, yes, we can use a smaller number between `childStats.rowCounts` and `limit` as `outputRowCount` of `getOutputSize`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73824/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by ron8hu <gi...@git.apache.org>.
Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r98776491
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryN
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    -      // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
    -      // (product of children).
    -      1
    -    } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    -    }
    -    child.stats(conf).copy(sizeInBytes = sizeInBytes)
    +    val childStats = child.stats(conf)
    +    // Don't propagate column stats, because we don't know the distribution after a limit operation
    +    Statistics(
    +      sizeInBytes = EstimationUtils.getOutputSize(output, limit, childStats.attributeStats),
    --- End diff --
    
    Agreed.  We can pick the smaller value between the child node's row count and the limit number. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73391/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #74060 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74060/testReport)** for PR 16696 at commit [`0c42ea2`](https://github.com/apache/spark/commit/0c42ea21b4a0756236789853092dbf0fbfa72d8a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #71964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71964/testReport)** for PR 16696 at commit [`b88fac5`](https://github.com/apache/spark/commit/b88fac58331b5fbbae83ac5cb8ba37d1bbb76b4c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Overall looks good to me. : ) Could you add a few more test cases? 
    
    - One is the child has less row counts than the limit. 
    - Another is having zero row counts but `sizeInBytes` is not zero.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    @cloud-fan @gatorsmile I've updated this pr and also added test cases, please review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104100931
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala ---
    @@ -1,64 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.catalyst.statsEstimation
    -
    -import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    -import org.apache.spark.sql.types.IntegerType
    -
    -
    -class StatsConfSuite extends StatsEstimationTestBase {
    --- End diff --
    
    why remove this test suite?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73372/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #73391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73391/testReport)** for PR 16696 at commit [`5692939`](https://github.com/apache/spark/commit/56929391719053e72791abe127b10a3316b51141).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97933241
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala ---
    @@ -18,12 +18,41 @@
     package org.apache.spark.sql.catalyst.statsEstimation
     
     import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Literal}
    +import org.apache.spark.sql.catalyst.plans.logical._
     import org.apache.spark.sql.types.IntegerType
     
     
    -class StatsConfSuite extends StatsEstimationTestBase {
    +class StatsEstimationSuite extends StatsEstimationTestBase {
    +  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = Some(1), max = Some(10),
    +    nullCount = 0, avgLen = 4, maxLen = 4))
    +
    +  val plan = StatsTestPlan(
    +    outputList = Seq(ar),
    +    attributeStats = AttributeMap(Seq(ar -> colStat)),
    +    rowCount = 10,
    +    size = Some(10 * (8 + 4)))
    +
    +  test("limit estimation") {
    +    val localLimit = LocalLimit(Literal(2), plan)
    +    val globalLimit = GlobalLimit(Literal(2), plan)
    +    // LocalLimit and GlobalLimit share the same stats estimation logic.
    +    val expected = Statistics(sizeInBytes = 24, rowCount = Some(2))
    +    checkStats(localLimit, expected)
    +    checkStats(globalLimit, expected)
    +  }
    +
    +  test("sample estimation") {
    +    val sample = Sample(0.0, 0.5, withReplacement = false, (math.random * 1000).toLong, plan)()
    +    checkStats(sample, expected = Statistics(sizeInBytes = 60, rowCount = Some(5)))
    +
    +    // Test if Sample's child doesn't have rowCount in stats
    +    val stats2 = Statistics(sizeInBytes = 120)
    --- End diff --
    
    The same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104280554
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryNo
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    +    val childStats = child.stats(conf)
    +    if (limit == 0) {
           // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
           // (product of children).
    -      1
    +      Statistics(
    +        sizeInBytes = 1,
    +        rowCount = Some(0),
    +        isBroadcastable = childStats.isBroadcastable)
         } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    +      // The output row count of LocalLimit should be the sum of row count from each partition, but
    +      // since the partition number is not available here, we just use statistics of the child
    +      // except column stats, because we don't know the distribution after a limit operation
    --- End diff --
    
    How can we make sure max/min values are still there after limit? Otherwise it will be a very loose bound of max/min.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97933222
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala ---
    @@ -18,12 +18,41 @@
     package org.apache.spark.sql.catalyst.statsEstimation
     
     import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Literal}
    +import org.apache.spark.sql.catalyst.plans.logical._
     import org.apache.spark.sql.types.IntegerType
     
     
    -class StatsConfSuite extends StatsEstimationTestBase {
    +class StatsEstimationSuite extends StatsEstimationTestBase {
    +  val (ar, colStat) = (attr("key"), ColumnStat(distinctCount = 10, min = Some(1), max = Some(10),
    +    nullCount = 0, avgLen = 4, maxLen = 4))
    +
    +  val plan = StatsTestPlan(
    +    outputList = Seq(ar),
    +    attributeStats = AttributeMap(Seq(ar -> colStat)),
    +    rowCount = 10,
    +    size = Some(10 * (8 + 4)))
    +
    +  test("limit estimation") {
    +    val localLimit = LocalLimit(Literal(2), plan)
    +    val globalLimit = GlobalLimit(Literal(2), plan)
    +    // LocalLimit and GlobalLimit share the same stats estimation logic.
    +    val expected = Statistics(sizeInBytes = 24, rowCount = Some(2))
    +    checkStats(localLimit, expected)
    +    checkStats(globalLimit, expected)
    +  }
    +
    +  test("sample estimation") {
    +    val sample = Sample(0.0, 0.5, withReplacement = false, (math.random * 1000).toLong, plan)()
    +    checkStats(sample, expected = Statistics(sizeInBytes = 60, rowCount = Some(5)))
    +
    +    // Test if Sample's child doesn't have rowCount in stats
    +    val stats2 = Statistics(sizeInBytes = 120)
    +    val plan2 = DummyLogicalPlan(stats2, stats2)
    --- End diff --
    
    rename `plan2` to `childPlan`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #73845 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73845/testReport)** for PR 16696 at commit [`516b114`](https://github.com/apache/spark/commit/516b11468c0c7bb63ce37edf69cab01519f97299).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class BasicStatsEstimationSuite extends StatsEstimationTestBase `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r99838684
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryN
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    -      // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
    -      // (product of children).
    -      1
    -    } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    -    }
    -    child.stats(conf).copy(sizeInBytes = sizeInBytes)
    +    val childStats = child.stats(conf)
    +    // Don't propagate column stats, because we don't know the distribution after a limit operation
    +    Statistics(
    +      sizeInBytes = EstimationUtils.getOutputSize(output, limit, childStats.attributeStats),
    +      rowCount = Some(limit),
    --- End diff --
    
    This is a good point, maybe we should still separate the stats calculation of `GlobalLimit` and `LocalLimit`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #71963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71963/testReport)** for PR 16696 at commit [`62013f5`](https://github.com/apache/spark/commit/62013f5c5bc1a6d43e1b111aa3784b4524c8fda4).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71963/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #71963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71963/testReport)** for PR 16696 at commit [`62013f5`](https://github.com/apache/spark/commit/62013f5c5bc1a6d43e1b111aa3784b4524c8fda4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104223997
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala ---
    @@ -1,64 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.catalyst.statsEstimation
    -
    -import org.apache.spark.sql.catalyst.CatalystConf
    -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference}
    -import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, LogicalPlan, Statistics}
    -import org.apache.spark.sql.types.IntegerType
    -
    -
    -class StatsConfSuite extends StatsEstimationTestBase {
    --- End diff --
    
    Can you use `git mv`? Then, it will keep the change history. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97931646
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -791,12 +773,14 @@ case class Sample(
     
       override def computeStats(conf: CatalystConf): Statistics = {
         val ratio = upperBound - lowerBound
    -    // BigInt can't multiply with Double
    -    var sizeInBytes = child.stats(conf).sizeInBytes * (ratio * 100).toInt / 100
    +    val childStats = child.stats(conf)
    +    var sizeInBytes = EstimationUtils.ceil(BigDecimal(childStats.sizeInBytes) * ratio)
         if (sizeInBytes == 0) {
           sizeInBytes = 1
         }
    -    child.stats(conf).copy(sizeInBytes = sizeInBytes)
    +    val sampledNumber = childStats.rowCount.map(c => EstimationUtils.ceil(BigDecimal(c) * ratio))
    --- End diff --
    
    `sampledNumber` -> `sampledRowCount`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #74060 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74060/testReport)** for PR 16696 at commit [`0c42ea2`](https://github.com/apache/spark/commit/0c42ea21b4a0756236789853092dbf0fbfa72d8a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74060/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #73845 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73845/testReport)** for PR 16696 at commit [`516b114`](https://github.com/apache/spark/commit/516b11468c0c7bb63ce37edf69cab01519f97299).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104550119
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -752,14 +752,14 @@ case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryN
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    -      // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
    -      // (product of children).
    -      1
    -    } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    -    }
    -    child.stats(conf).copy(sizeInBytes = sizeInBytes)
    +    val childStats = child.stats(conf)
    +    val rowCount: BigInt =
    +      if (childStats.rowCount.isDefined) childStats.rowCount.get.min(limit) else limit
    --- End diff --
    
    Nit: `val rowCount: BigInt = childStats.rowCount.map(_.min(limit)).getOrElse(limit)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97933053
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala ---
    @@ -48,6 +77,14 @@ class StatsConfSuite extends StatsEstimationTestBase {
         // Return the simple statistics
         assert(plan.stats(conf.copy(cboEnabled = false)) == expectedDefaultStats)
    --- End diff --
    
    Could you replace the above three lines by `checkStats`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #73391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73391/testReport)** for PR 16696 at commit [`5692939`](https://github.com/apache/spark/commit/56929391719053e72791abe127b10a3316b51141).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r98002931
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryN
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    -      // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
    -      // (product of children).
    -      1
    -    } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    -    }
    -    child.stats(conf).copy(sizeInBytes = sizeInBytes)
    +    val childStats = child.stats(conf)
    +    // Don't propagate column stats, because we don't know the distribution after a limit operation
    +    Statistics(
    +      sizeInBytes = EstimationUtils.getOutputSize(output, limit, childStats.attributeStats),
    +      rowCount = Some(limit),
    --- End diff --
    
    Actually the `rowCount` for `LocalLimit` and `GlobalLimit` should be different. For `LocalLimit`, `limit` is just the row count for one partition. But we can't get the number of partitions here, I think. As the actual row number might be quite bigger than the `limit`, maybe we should set it as None.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #71988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71988/testReport)** for PR 16696 at commit [`05fcd81`](https://github.com/apache/spark/commit/05fcd8176bdf43677892dc35247ee14a9a155ba8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97928605
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -727,37 +728,18 @@ case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryN
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    -      // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
    -      // (product of children).
    -      1
    -    } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    -    }
    -    child.stats(conf).copy(sizeInBytes = sizeInBytes)
    +    val childStats = child.stats(conf)
    +    // Don't propagate column stats, because we don't know the distribution after a limit operation
    +    Statistics(
    +      sizeInBytes = EstimationUtils.getOutputSize(output, limit, childStats.attributeStats),
    --- End diff --
    
    Why don't we use `childStats.rowCount`? If `childStats.rowCount` is less than limit number, I think we should use it instead of limit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    **[Test build #73372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73372/testReport)** for PR 16696 at commit [`5692939`](https://github.com/apache/spark/commit/56929391719053e72791abe127b10a3316b51141).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r97931570
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -29,6 +31,8 @@ object EstimationUtils {
       def rowCountsExist(conf: CatalystConf, plans: LogicalPlan*): Boolean =
         plans.forall(_.stats(conf).rowCount.isDefined)
     
    +  def ceil(bigDecimal: BigDecimal): BigInt = bigDecimal.setScale(0, RoundingMode.CEILING).toBigInt()
    --- End diff --
    
    `ceil` -> `ceiling`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Sure, will review it tonight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    LGTM except one minor comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73845/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104116400
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
    @@ -116,22 +116,22 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
         withTempView("test") {
    --- End diff --
    
    yea I think so


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    cc @cloud-fan @gatorsmile please review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r99837596
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -717,7 +717,8 @@ object Limit {
       }
     }
     
    -case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryNode {
    +abstract class LimitNode extends UnaryNode {
    +  def limitExpr: Expression
    --- End diff --
    
    do we assume `limitExpr` is foldable? but seems there is no type checking logic for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104546713
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryNo
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    +    val childStats = child.stats(conf)
    +    if (limit == 0) {
           // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
           // (product of children).
    -      1
    +      Statistics(
    +        sizeInBytes = 1,
    +        rowCount = Some(0),
    +        isBroadcastable = childStats.isBroadcastable)
         } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    +      // The output row count of LocalLimit should be the sum of row count from each partition, but
    +      // since the partition number is not available here, we just use statistics of the child
    +      // except column stats, because we don't know the distribution after a limit operation
    --- End diff --
    
    Nit: the whole sentence does not have a period. How about rewriting it like?
    > The output row count of LocalLimit should be the sum of row counts from each partition. However, since the number of partitions is not available here, we just use statistics of the child. Because the distirubtion after a limit operation is unknown, we do not propapage the column stats.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16696: [SPARK-19350] [SQL] Cardinality estimation of Lim...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16696#discussion_r104232961
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -773,14 +773,20 @@ case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends UnaryNo
       }
       override def computeStats(conf: CatalystConf): Statistics = {
         val limit = limitExpr.eval().asInstanceOf[Int]
    -    val sizeInBytes = if (limit == 0) {
    +    val childStats = child.stats(conf)
    +    if (limit == 0) {
           // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
           // (product of children).
    -      1
    +      Statistics(
    +        sizeInBytes = 1,
    +        rowCount = Some(0),
    +        isBroadcastable = childStats.isBroadcastable)
         } else {
    -      (limit: Long) * output.map(a => a.dataType.defaultSize).sum
    +      // The output row count of LocalLimit should be the sum of row count from each partition, but
    +      // since the partition number is not available here, we just use statistics of the child
    +      // except column stats, because we don't know the distribution after a limit operation
    --- End diff --
    
    but I think the max/min should still be corrected?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16696
  
    Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org