You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/14 23:35:44 UTC

[GitHub] [spark] skambha commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals

skambha commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals
URL: https://github.com/apache/spark/pull/24593#issuecomment-492448836
 
 
   Thanks everyone for your comments.  (@BryanCutler , @JoshRosen , @viirya, @gatorsmile , @mgaido91 , @dilipbiswal )  
   
   I wanted to summarize my understanding of the discussion so far and my comments. 
   
   In general, the optimization mentioned here to optimize the deterministic UDF for literals makes sense. 
   
   **Optimization:** 
   There are two ways to do the optimization that has been discussed: 
   1. New Optimization Rule  ‘DeterministicLiteralUDF’.    ie Changes are in the PR.   This can be disabled using the optimizer property.  spark.sql.optimizer.excludedRules
   2. Implement ScalaUDF foldable  (As mentioned by JoshRosen) that will be consumed by the existing ConstantFolding optimizer rule.   We can also use a config property to enable or disable it.    
   
   Having this as a separate optimizer rule enables user to use the existing framework's property spark.sql.optimizer.excludedRules to exclude it.  
   
   **Concerns:** 
   1. Today, in Spark the UDFs by **default** are deterministic.   The users may not be aware of this as it is implicitly set by default. 
   2. This optimization is based on the ‘deterministic’ property of the UDF and the inputs are literals.  Users may not be aware of the deterministic property since it is set by default to true.   We do not want to do optimization because we are not sure if user fully understands the implication of the udf being deterministic
   
   **To address the concerns:** 
   - For  1: I can open a new issue to discuss whether the udf should be marked nondeterministic by default and user consciously marks it deterministic. 
   
   - For 2:  For this optimization based on deterministic property,  we create a new rule & have it under a config flag and by default make it **not** use this optimization.   This would address the concern 2.  
   
   
   **Documentation:** 
   In general, there is agreement that we should document in the migration guide to ensure that users are aware of this optimization and the implications behind it. 
   
   **Questions:** 
   Q1: Can we move forward with this PR under a flag that will disable the optimization by default?
   Q2: Do we want to hold off on this optimization till UDFs are marked nondeterministic by default? 
   ———
   
   Please let me know. Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org