You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "ozankabak (via GitHub)" <gi...@apache.org> on 2023/04/03 17:28:53 UTC

[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #5754: Improving optimizer performance by eliminating unnecessary sort and distribution passes, add more SymmetricHashJoin improvements

ozankabak commented on code in PR #5754:
URL: https://github.com/apache/arrow-datafusion/pull/5754#discussion_r1156250272


##########
datafusion/common/src/config.rs:
##########
@@ -280,6 +280,10 @@ config_namespace! {
         /// using the provided `target_partitions` level
         pub repartition_joins: bool, default = true
 
+        /// Should DataFusion allow symmetric hash joins for unbounded data sources even when
+        /// its inputs do not have any ordering or filtering
+        pub allow_symmetric_joins_without_pruning: bool, default = true

Review Comment:
   SHJ will always produce correct results, but it will use twice as much memory (assuming inputs are of the same size) for no gain except pipelining.
   
   Some more explanation about this option: It is not always possible to detect 100% accurately whether pruning may occur or not -- the system may think pruning is not possible where it is actually possible. Therefore, one would enable this option if they have a-priori knowledge that data would indeed lend itself to pruning. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org