You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by marmbrus <gi...@git.apache.org> on 2014/07/08 21:56:59 UTC

[GitHub] spark pull request: [SPARK-2393][SQL] Prototype implementation of ...

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1238#discussion_r14676669
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -26,6 +26,17 @@ import org.apache.spark.sql.catalyst.trees
     abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
       self: Product =>
     
    +  protected class Estimates {
    +    lazy val childrenEstimations = children.map(_.estimates)
    +    lazy val cardinality: Long = childrenEstimations.map(_.cardinality).sum
    +    lazy val sizeInBytes: Long = childrenEstimations.map(_.sizeInBytes).sum
    --- End diff --
    
    A few notes here:
     - I think the most important thing about any default values that we set is that they work together correctly.  For example, we probably want the default size for RDDs to be large enough that they do not get broadcast using the default value.  For this reason we probably want to have one central location where defaults are set (probably in SQLConf).
     - Since RDDs are immutable, it may be possible to augment things so that we gather simple statistics the first time the RDD is queried, perhaps using an accumulator.  This however does not need to be done in this PR.  I would make a note of it though.
     - Regarding joins: for inner joins if one side is actually zero then the join will yield no results.  For outer joins the maximum possible output cardinality is the cardinality of the non-empty relation.  However, zero is kind of a weird estimate as if it is wrong it will really throw things off.  For this reason maybe we should consider 1 to be the smallest possible cardinality estimate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---