You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/13 23:17:48 UTC

[GitHub] [spark] skambha commented on a change in pull request #24593: [SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals

skambha commented on a change in pull request #24593: [SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals
URL: https://github.com/apache/spark/pull/24593#discussion_r283573248
 
 

 ##########
 File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
 ##########
 @@ -892,7 +893,7 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
       assert(msg.contains("Cannot save interval data type into external storage."))
 
       msg = intercept[AnalysisException] {
-        spark.udf.register("testType", () => new IntervalData())
+        spark.udf.register("testType", udf(() => new IntervalData()).asNondeterministic())
 
 Review comment:
   Thanks for the question.  The reason this test is changed is for the following reason. 
   
   ```
   msg = intercept[AnalysisException] {
    
   spark.udf.register("testType", () => new IntervalData())
   
   
    sql("select testType()").write.format("avro").mode("overwrite").save(tempDir)
 
    }.getMessage
     assert(msg.toLowerCase(Locale.ROOT)
    .contains(s"avro data source does not support calendarinterval data type."))
}
   ```
   
   
   This is the **original** test case.  It is testing an error code path for the datasource.   It triggers this codepath by calling a udf that returns the IntervalData. However  the IntervalData and the corresponding UDT does not support the serialize or deserialize methods.  
   
   Now with the new optimization rule in this pr, an evaluation of the udf will happen during optimization phase if the udf is deterministic and the inputs are literals.  In this case, both those conditions satisfy and it will try to evaluate the udf, but since in this case the serialize methods are not implemented for this udt, it will fail.  Thus we get a different error than the error that this test case was trying to test. 
   
   In order to have the test case try to **test the original error codepath**,  I have changed the udf to be non deterministic.  This is done by this line: 
   
   `spark.udf.register("testType", udf(() => new IntervalData()).asNondeterministic())`
   
   Hope this helps.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org