You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/12 19:21:29 UTC

[GitHub] [arrow-datafusion] isidentical opened a new issue, #3813: Never fallback to cartesian product for join estimation when we know the min/max values for columns

isidentical opened a new issue, #3813:
URL: https://github.com/apache/arrow-datafusion/issues/3813

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   `distinct_count` is usually expensive to compute, so some platforms which save parquet files abstain from injecting it at the metadata section. We should be able to estimate the join cardinality without it before falling back to cartesian product.
   
   **Describe the solution you'd like**
   Since we already require min/max values to be present, we should be able to just do `min(num_left_rows - num_nulls or 0, scalar_range(left_stats.min, left_stats.max))` to determine an alternative distinct count.
   
   **Describe alternatives you've considered**
   None.
   
   **Additional context**
   Original discussion was here https://github.com/apache/arrow-datafusion/pull/3787#discussion_r992751749
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #3813: Never fallback to cartesian product for join estimation when we know the min/max values for columns

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #3813:
URL: https://github.com/apache/arrow-datafusion/issues/3813#issuecomment-1276643214

   I think it might be better to give in that case?
   
   There is also this presentation about optimizing the order of joins without statistics available (which also seems to do fine for DuckDB). We could also see if we can reuse some of these ideas:
   
   https://www.youtube.com/watch?v=aNRoR0Z3SzU


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org