You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/25 11:43:25 UTC

[GitHub] [arrow-datafusion] mingmwang opened a new issue, #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

mingmwang opened a new issue, #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372

   **Describe the bug**
   A clear and concise description of what the bug is.
   
   
   https://github.com/apache/arrow-datafusion/blob/58b43f5c0b629be49a3efa0e37052ec51d9ba3fe/datafusion/core/src/physical_plan/filter.rs#L173-L193
   
   Should return `input_stats`
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372#issuecomment-1327404807

   And is there any reason that the `total_byte_size` is not adjusted accordingly based on the computed `predicate selectivity`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372#issuecomment-1327375234

   @isidentical 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372#issuecomment-1327403395

   I think it is better to propagate it, or we can have a default predicate selectivity(many other DB systems take this approach). 
   Otherwise the Statistics estimation system will be very brittle, the high level operates can not derive the stats.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on issue #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

Posted by GitBox <gi...@apache.org>.
isidentical commented on issue #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372#issuecomment-1327428444

   > I think it is better to propagate it, or we can have a default predicate selectivity(many other DB systems take this approach).
   
   I'd personally lean towards not propagating due to how unreliable the input is without having any idea about what the filter does to it. For the secondary option (setting a default selectivity), it is a bit better that we are assuming the filter has some sort of an effect and might be a viable alternative (would love to hear the figures/conditions other DB systems use for this, and maybe we can do something similar). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372#issuecomment-1331849333

   @jackwener 
   Do you know that in Doris, is that a default value for Filter selectivity?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on issue #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

Posted by GitBox <gi...@apache.org>.
isidentical commented on issue #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372#issuecomment-1327428667

   > And is there any reason that the total_byte_size is not adjusted accordingly based on the computed predicate selectivity?
   
   Creates #4374!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on issue #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

Posted by GitBox <gi...@apache.org>.
isidentical commented on issue #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372#issuecomment-1327386404

   Without being able to estimate the selectivity, falling back into accepting that the filter always selects feels like would be worse than simply returning not known (which was the default case even before we worked on the filter selectivity analysis). Is there a particular case where propagating `input_stats` would be useful? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4372: FilterExec should not return Non Statistics when it can not calculate the predicate selectivity

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4372:
URL: https://github.com/apache/arrow-datafusion/issues/4372#issuecomment-1331871103

   @isidentical 
   
   Presto/Trino, looks like it is default to 0.9 
   
   ````
   booleanProperty(
            DEFAULT_FILTER_FACTOR_ENABLED,
            "use a default filter factor for unknown filters in a filter node",
           optimizerConfig.isDefaultFilterFactorEnabled(),
                           false),
   
   ````
   https://github.com/trinodb/trino/blob/0f71007ecb480384a9c443ba883f4bc4d896df83/core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java#L90-L93
   
   SparkSQL, looks like it is default to 1.0
   
   ````
   def calculateFilterSelectivity(condition: Expression, update: Boolean = true): Option[Double] = {
       condition match {
         case And(cond1, cond2) =>
           val percent1 = calculateFilterSelectivity(cond1, update).getOrElse(1.0)
           val percent2 = calculateFilterSelectivity(cond2, update).getOrElse(1.0)
           Some(percent1 * percent2)
   
         case Or(cond1, cond2) =>
           val percent1 = calculateFilterSelectivity(cond1, update = false).getOrElse(1.0)
           val percent2 = calculateFilterSelectivity(cond2, update = false).getOrElse(1.0)
           Some(percent1 + percent2 - (percent1 * percent2))
   
         // Not-operator pushdown
         case Not(And(cond1, cond2)) =>
           calculateFilterSelectivity(Or(Not(cond1), Not(cond2)), update = false)
   
         // Not-operator pushdown
         case Not(Or(cond1, cond2)) =>
           calculateFilterSelectivity(And(Not(cond1), Not(cond2)), update = false)
   
         // Collapse two consecutive Not operators which could be generated after Not-operator pushdown
         case Not(Not(cond)) =>
           calculateFilterSelectivity(cond, update = false)
   
         // The foldable Not has been processed in the ConstantFolding rule
         // This is a top-down traversal. The Not could be pushed down by the above two cases.
         case Not(l @ Literal(null, _)) =>
           calculateSingleCondition(l, update = false).map(boundProbability(_))
   
         case Not(cond) =>
           calculateFilterSelectivity(cond, update = false) match {
             case Some(percent) => Some(1.0 - percent)
             case None => None
           }
   
         case _ =>
           calculateSingleCondition(condition, update).map(boundProbability(_))
       }
     }
   ````
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org