You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/03 02:17:12 UTC

[GitHub] [arrow-datafusion] liukun4515 edited a comment on pull request #1881: add udf/udaf plugin

liukun4515 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057594304


   > > > Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
   > > > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   > > > 
   > > > 1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
   > > > 2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deseri
 alization udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
   > > > 3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   > > > 
   > > > Thanks a lot, can you give me more advice on these?
   > > 
   > > 
   > > For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own `main` function which registers desired UDF/UDAF in the global execution context.
   > 
   > Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.
   > 
   > ```
   > 
   > fn try_into_logical_plan(
   >         &self,
   >         ctx: &ExecutionContext,
   >         extension_codec: &dyn LogicalExtensionCodec,
   >     ) -> Result<LogicalPlan, BallistaError>;
   > ```
   > 
   > But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion?
   
   Agree with your opinion.
   The ballista is a distributed computed engine like spark and others.
   Users who want to use the udf and just don't need to recompile the codes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org