You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/08 07:48:39 UTC

[GitHub] [spark] cloud-fan commented on pull request #24559: [SPARK-27658][SQL] Add FunctionCatalog API

cloud-fan commented on pull request #24559:
URL: https://github.com/apache/spark/pull/24559#issuecomment-774945050


   @rdblue Thanks for writing up the design doc! This is a very important and useful feature, and the `UnboundFunction` seems like a very interesting idea. It allows function overload (for different input schema, people can return different `BoundFunction`), but I'm wondering how it can suggest Spark to add Cast. For example, if a function accepts int type input, but the actual input is byte type.
   
   Another point is we should think of the final generated java code when invoking UDF. With whole-stage-codegen (the default case), the input values are actually java variables in the generated java code. It means we need to build an `InternalRow` before invoking the new UDF, which is very inefficient and is even worse than the current Spark Scala/Java UDF. Also, the type parameter of the return type has perf issues because of primitive type boxing.
   
   My rough idea is
   ```
   interface ScalarFunction {
     StructType[] expectedInputTypes();
     DataType returnType();
   }
   
   class MyScalaFunction implements ScalarFunction {
     StructType[] expectedInputTypes() { // ... allows int and string }
     DataType returnType() { return IntegerType; }
   
     int call(int arg) { return String.valueOf(arg).length(); }
     int call(UTF8String arg) { return arg.length(); }
   }
   ```
   The analyzer will bind the UDF with actual input types (add implicit cast if needed), and check if the `call` method exits
    for certain input/return types via reflection. Then in whole-stage-codegen, we just call the `call` method with certain type of inputs, and assign the result to a java variable. No need to build `InternalRow`, no boxing overhead, but no compile-time type safety (analyzer can still catch errors).
   
   cc @viirya @maropu @kiszk @rednaxelafx 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org