You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/25 06:57:30 UTC

[GitHub] [arrow-datafusion] gaojun2048 opened a new pull request #1881: add udf/udaf plugin

gaojun2048 opened a new pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881


   Now we cannot use UDF and UDAF in ballista because ballista cannot know how to serialize and deserialize UDF / UDAF.
   We are using Trino. Referring to the practice of Trino, we can realize the plug-in of UDF through the way of rust dynamic library. In this way, ballista and datafusion only need to know the plug-in interface of UDF, and they can work without knowing the specific implementation of UDF.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1068866173


   > @Ted-Jiang @alamb
   > 
   > In fact, after the external UDF/UDAF is registered with datafusion's `ExecutionContext` via the `register_udf` function, As long as we know the name of UDF/UDAF, we can obtain the implementation of UDF/UDAF from ExecutionContext in the following ways to complete the deserialization of UDF/UDAF.
   > 
   > ```
   > impl FunctionRegistry for ExecutionContext {
   >     fn udfs(&self) -> HashSet<String> {
   >         self.state.lock().udfs()
   >     }
   > 
   >     fn udf(&self, name: &str) -> Result<Arc<ScalarUDF>> {
   >         self.state.lock().udf(name)
   >     }
   > 
   >     fn udaf(&self, name: &str) -> Result<Arc<AggregateUDF>> {
   >         self.state.lock().udaf(name)
   >     }
   > }
   > ```
   > 
   > So we can put name in proto when we serialize UDF/UDAF, and then we can deserialize UDF/UDAF using this name and ExecutionContext. But that need add a function like this:
   > 
   > ```
   > fn from_proto_to_logicalplan(logical_plan: &protobuf::LogicalExprNode, ctx: &ExecutionContext) -> Result<LogicalPlan, Error>
   > ```
   > 
   > There is no doubt that it is also a good way to add a ballista-plugin crate and make both datafusion-proto and ballista-core dependent on ballista-plugin. This approach may require fewer changes to the code.
   
   I like this approach. In our project we plan on building a custom ballista binary so we can register our `ExecutionContext` at startup. I think using the `ExecutionContext` rather than the plugin manager directly in the serde should support both approaches. 
   
   In general I think the design with respect to plugins is that they are loaded into the `ExecutionContext` (or `SessionContext` when that works is merged) at startup and then internally the context is used. This should support both use cases as well as allow users to build Ballista without the plugin manager if they do not need it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1076730813


   cc @mingmwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1079954431


   @liukun4515  @alamb  @thinkharderdev  Everything is ok. And the udaf test success now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1050674214


   Because some code that I didn't change caused the clippy check failed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057139376


   > Yes. from this issue https://github.com/rust-lang/rfcs/issues/600 I sea Rust doesn’t have a stable ABI, meaning different compiler versions can generate incompatible code. For these reasons, the UDF plug-in must be compiled using the same version of rustc as datafusion.
   
   That makes sense -- it might help to add a comment to the source code explaining that rationale (so that future readers understand as well)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1056580279


   > > 
   > 
   > Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
   > 
   > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   > 
   > 1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
   > 2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserializ
 ation udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
   > 3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   > 
   > Thanks a lot, can you give me more advice on these?
   
   I sea in this pr : https://github.com/apache/arrow-datafusion/pull/1887 the scalar_function and aggregate_function deserialization and serialization is move to `datafusion-serialization`. In `datafusion-serialization` we need use udf plugin to serialization and deserialization udf. So, Where should we put plugin mod？


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

realno commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057639270


   > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   
   > 💯 agree here too
   
   @alamb @gaojun2048 this is an interesting point, could you explain a bit more?
   
   I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the computer core wouldn't it make sense to have udf support in DF? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057597808


   > Thank you for your efforts @gaojun2048 . I skimmed through the code again and I think if we move the plugin manager to ballista it would be good to go from my perspective 👍 .
   > 
   > cc @andygrove @thinkharderdev @edrevo @matthewmturner @liukun4515 and @realno (I am sorry if you work together or already know about this work)
   
   Maybe I need to take time to look at this.
   I can finish reviewing this today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065850733


   @alamb  @thinkharderdev  @Ted-Jiang  @liukun4515  
   
   For the latest problem, I found that we have added `datafusion-proto` crate, in which the serialization and deserialization of LogicalPlan are processed. Now I have move the UDF plugin to ballista, but the `datafusion-proto` crate can not dependency `ballista-core`. So the udf/udaf can not deserialization. Any good suggestions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065883511


   > So the udf/udaf can not deserialization. Any good suggestions?
   
   IMO, this PR add the ability for load external udf/udaf. 
   FYI, the built in udf/udaf  have fixed in #1988 for ballista, if wrong, please correct me.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r816848645



##########
File path: datafusion/build.rs
##########
@@ -0,0 +1,21 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+fn main() {
+    let version = rustc_version::version().unwrap();
+    println!("cargo:rustc-env=RUSTC_VERSION={}", version);
+}

Review comment:
       Can you explain the need for getting the rustc version? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057356012


   > > 
   > 
   > Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
   > 
   > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   > 
   > 1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
   > 2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserializ
 ation udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
   > 3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   > 
   > Thanks a lot, can you give me more advice on these?
   
   For what it's worth, with the changes in https://github.com/apache/arrow-datafusion/pull/1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own `main` function which registers desired UDF/UDAF in the global execution context. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057142655


   > Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. ... I believe this is a more friendly way for those who actually use ballista as a computing engine.
   
   I agree and this makes sense
   
   > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   
   💯  agree here too
   
   
   > I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   
   I think the idea of moving the plugin module into ballista makes a lot of sense to me
   
   > Thanks a lot, can you give me more advice on these?
   
   Thank you for your clear explination and justification 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

liukun4515 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058887462


   > Thanks to all the friends who participated in the discussion, from everyone's feedback, the `udf plugin` is a very useful feature for ballista. I will modify the code in the next day or two to migrate the `udf plugin` mod to the ballista-core crate (because `ballista-core/src/serde/logical_plan/from_proto.rs` and `physical_plan/from_proto` .rs`needs to use`udf plugin` to deserialize udf). Thanks!
   
   Thanks for your work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

mingmwang commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1085503799


   > > It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions
   > 
   > Ok. I will update code and use TaskContext to serialization and deserialization UDF
   
   Yes, there are several changes to SessionContext in those days.  The Executor does not have a global SessionContext now.
   You can have your UDF Plugin Manager load all the dynamic UDFs/UDAFs to Executor's member. I had added a TOTO note .
   
   ````
   impl Executor {
       /// Create a new executor instance
       pub fn new(
           metadata: ExecutorRegistration,
           work_dir: &str,
           runtime: Arc<RuntimeEnv>,
       ) -> Self {
           Self {
               metadata,
               work_dir: work_dir.to_owned(),
               // TODO add logic to dynamically load UDF/UDAFs libs from files
               scalar_functions: HashMap::new(),
               aggregate_functions: HashMap::new(),
               runtime,
           }
       }
   }
   
   ````
   
   In Ballista Scheduler side, there is no global SessionContext either, SessionContext is created on users' requests.
   You can add the  UDF Plugin Manager to Ballista SchedulerServer, when the new session context was created, you can 
   call the  register the UDF/UDAFs to the created session context.
   
   
   ````
   /// Create a DataFusion session context that is compatible with Ballista Configuration
   pub fn create_datafusion_context(
       config: &BallistaConfig,
       session_builder: SessionBuilder,
   ) -> Arc<SessionContext> {
       let config = SessionConfig::new()
           .with_target_partitions(config.default_shuffle_partitions())
           .with_batch_size(config.default_batch_size())
           .with_repartition_joins(config.repartition_joins())
           .with_repartition_aggregations(config.repartition_aggregations())
           .with_repartition_windows(config.repartition_windows())
           .with_parquet_pruning(config.parquet_pruning());
       let session_state = session_builder(config);
       Arc::new(SessionContext::with_state(session_state))
       /// Add logic to register UDF/UDFS to context.
   }
   
   ````
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1068866939


   @gaojun2048 Not sure how soon you will be able to get back to this, but if you want I can work on a PR to do the UDF/UDAF deserialization using the `ExecutionContext`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065850733


   @alamb  @thinkharderdev  @Ted-Jiang  @liukun4515  
   
   For the latest problem, I found that we have added `datafusion-proto` crate, in which the serialization and deserialization of LogicalPlan are processed. Now I have move the UDF plugin to ballista, but the `datafusion-proto` crate can not dependency `ballista-core`. So the udf/udaf can not deserialization. Any good suggestions?
   
   More information about this can see in the build error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r834903280



##########
File path: ballista/rust/core/src/serde/physical_plan/from_proto.rs
##########
@@ -81,6 +83,41 @@ impl TryFrom<&protobuf::PhysicalExprNode> for Arc<dyn PhysicalExpr> {
                         .to_owned(),
                 ));
             }
+            ExprType::ScalarUdfProtoExpr(e) => {
+                if let Some(udf_plugin_manager) = get_udf_plugin_manager("") {

Review comment:
       @thinkharderdev 
   I added TODO and I plan to finish physical plan serialization and deserialization through SessionContext in the next PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1083063047


   > > Awesome!
   > 
   > 
   > 
   > Today I tried to resolve the conflict, but I found it very difficult. In `AsExecutionPlan.try_into_physical_plan` `SessionContext` is removed. So I can not serialization and deserialization UDF with SessionContext. So I will update my code and serialization and deserialization UDF with `udf_plugin`.
   
   It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1076243748


   Hi, @Ted-Jiang  @alamb  @thinkharderdev  @liukun4515  
   
   I have completed all the development and testing of the UDF plugin, but now the udaf test will get stuck
   
   <img width="1347" alt="image" src="https://user-images.githubusercontent.com/32193458/159686486-ae21f83e-36d5-4d39-8238-fd4a8c121350.png">
   
   
   This PR has lasted for a long time. There are many code changes in the middle, and I see many people discussing this issue, so I hope to merge the PR first. I will fix the problem after the prud test is completed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058403165


   > > If I understand correctly, you meant, udf is need in DF, but the mechanism to register udf during runtime is only needed in Ballista, correct? If so then I fully agree.
   > Yes, That's what I mean.
   
   This is also my understanding


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

realno edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057639270


   > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   
   > 💯 agree here too
   
   @alamb @gaojun2048 this is an interesting point, could you explain a bit more?
   
   I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1051573273


   > The clippy check failed because package fuzz-utils. But I don't modify this package.
   > 
   > <img alt="image" width="1439" src="https://user-images.githubusercontent.com/32193458/155689278-72fdd5bd-1b98-44b2-8914-ca6299e5dcd0.png">
   
   fixed in #1880 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1056084581

Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.

In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion.
3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.

Thanks a lot, can you give me more advice on these?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1056084581

Thank you for your advice @alamb .
Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.

Thanks a lot, can you give me more advice on these?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058406981


   I filed https://github.com/apache/arrow-datafusion/issues/1916 to discuss the direction of Ballista, FWIW


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r833635693



##########
File path: datafusion/src/physical_plan/udf.rs
##########
@@ -42,10 +45,102 @@ pub fn create_physical_expr(
         .map(|e| e.data_type(input_schema))
         .collect::<Result<Vec<_>>>()?;
 
-    Ok(Arc::new(ScalarFunctionExpr::new(
-        &fun.name,
-        fun.fun.clone(),
-        coerced_phy_exprs,
-        (fun.return_type)(&coerced_exprs_types)?.as_ref(),
-    )))
+    Ok(Arc::new(ScalarUDFExpr {
+        fun: fun.clone(),
+        name: fun.name.clone(),
+        args: coerced_phy_exprs.clone(),
+        return_type: (fun.return_type)(&coerced_exprs_types)?.as_ref().clone(),
+    }))
+}
+
+/// Physical expression of a UDF.
+/// argo engine add

Review comment:
       I don't understand what "argo engine add" means here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1085716579






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065873229


   >  So the udf/udaf can not deserialization. Any good suggestions
   
   What about making a `ballista-plugin` crate (that depends on ballista core as well as datafusion-proto)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1082566772


   > Awesome!
   
   Today I tried to resolve the conflict, but I found it very difficult. In `AsExecutionPlan.try_into_physical_plan` `SessionContext` is removed. So I can not serialization and deserialization UDF with SessionContext. So I will update my code and serialization and deserialization UDF with `udf_plugin`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1085051261


   > > It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions
   > 
   > Ok. I will update code and use TaskContext to serialization and deserialization UDF
   
   Hi @gaojun2048. I had to implement this on our fork for our project so I went ahead and PR'd it here https://github.com/apache/arrow-datafusion/pull/2130. Hope that helps!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r817890437



##########
File path: datafusion-examples/examples/simple_udf_plugin.rs
##########
@@ -0,0 +1,142 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion::arrow::array::{ArrayRef, Int64Array, ListArray};
+use datafusion::arrow::datatypes::{DataType, Field, Int64Type};
+use datafusion::declare_udf_plugin;
+use datafusion::error::DataFusionError;
+use datafusion::error::Result;
+use datafusion::physical_plan::functions::{make_scalar_function, Signature, Volatility};
+use datafusion::physical_plan::udaf::AggregateUDF;
+use datafusion::physical_plan::udf::ScalarUDF;
+use datafusion::plugin::udf::UDFPlugin;
+use datafusion_expr::{ReturnTypeFunction, ScalarFunctionImplementation};
+use lazy_static::lazy_static;
+use std::any::Any;
+use std::sync::Arc;
+
+/// this examples show how to implements a udf plugin

Review comment:
       ```suggestion
   /// this examples show how to implements a udf plugin for Ballista
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

realno commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057737445


   > Yes, it is important and required for DF to support udf. But for those who use DF, it is not necessary to support the udf plugin to dynamically load udf.
   
   If I understand correctly, you meant, udf is need in DF, but the mechanism to register udf during runtime is only needed in Ballista, correct? If so then I fully agree.
   
   > I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. 
   
   Currently DF does more than just a computing library, it has full support for SQL parsing, dataframe API and planners - it is a complete single node compute engine. Ballista still uses most of the fundamental implementation from DF. The current effort splitting it to separate creates  is a step towards making it just a library, I am curious what the direction the community is thinking here @alamb . If this is where we are going I think we can think about merging the SQL, DataFrame, ExecutionContext and Planner logic between DF and Ballista. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058887462


   > Thanks to all the friends who participated in the discussion, from everyone's feedback, the `udf plugin` is a very useful feature for ballista. I will modify the code in the next day or two to migrate the `udf plugin` mod to the ballista-core crate (because `ballista-core/src/serde/logical_plan/from_proto.rs` and `physical_plan/from_proto` .rs`needs to use`udf plugin` to deserialize udf). Thanks!
   
   Thanks for your works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058421054


   > > > Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
   > > > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   > > > 
   > > > 1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
   > > > 2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deseri
 alization udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
   > > > 3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   > > > 
   > > > Thanks a lot, can you give me more advice on these?
   > > 
   > > 
   > > For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own `main` function which registers desired UDF/UDAF in the global execution context.
   > 
   > Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.
   > 
   > ```
   > 
   > fn try_into_logical_plan(
   >         &self,
   >         ctx: &ExecutionContext,
   >         extension_codec: &dyn LogicalExtensionCodec,
   >     ) -> Result<LogicalPlan, BallistaError>;
   > ```
   > 
   > But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion?
   
   I think ideally Ballista will be an out-of-the-box compute engine, but currently we are not there yet. But either way I think being able to dynamically load UDFs is very valuable. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065889841


   @Ted-Jiang  @alamb 
   
   In fact, after the external UDF/UDAF is registered with datafusion's `ExecutionContext` via the `register_udf` function, As long as we know the name of UDF/UDAF, we can obtain the implementation of UDF/UDAF from ExecutionContext in the following ways to complete the deserialization of UDF/UDAF. 
   
   ```
   impl FunctionRegistry for ExecutionContext {
       fn udfs(&self) -> HashSet<String> {
           self.state.lock().udfs()
       }
   
       fn udf(&self, name: &str) -> Result<Arc<ScalarUDF>> {
           self.state.lock().udf(name)
       }
   
       fn udaf(&self, name: &str) -> Result<Arc<AggregateUDF>> {
           self.state.lock().udaf(name)
       }
   }
   ```
   
   So we can put name in proto when we serialize UDF/UDAF, and then we can deserialize UDF/UDAF using this name and ExecutionContext.
   But that need add a function like this:
   
   ```
   fn from_proto_to_logicalplan(logical_plan: &protobuf::LogicalExprNode, ctx: &ExecutionContext) -> Result<LogicalPlan, Error>
   ```
   
   There is no doubt that it is also a good way to add a ballista-plugin crate and make both datafusion-proto and ballista-core dependent on ballista-plugin. This approach may require fewer changes to the code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065889841


   In fact, after the external UDF/UDAF is registered with datafusion's `ExecutionContext` via the `register_udf` function, As long as we know the name of UDF/UDAF, we can obtain the implementation of UDF/UDAF from ExecutionContext in the following ways to complete the deserialization of UDF/UDAF. 
   
   ```
   impl FunctionRegistry for ExecutionContext {
       fn udfs(&self) -> HashSet<String> {
           self.state.lock().udfs()
       }
   
       fn udf(&self, name: &str) -> Result<Arc<ScalarUDF>> {
           self.state.lock().udf(name)
       }
   
       fn udaf(&self, name: &str) -> Result<Arc<AggregateUDF>> {
           self.state.lock().udaf(name)
       }
   }
   ```
   
   So we can put name in proto when we serialize UDF/UDAF, and then we can deserialize UDF/UDAF using this name and ExecutionContext.
   But that need add a function like this:
   
   ```
   fn from_proto_to_logicalplan(logical_plan: &protobuf::LogicalExprNode, ctx: &ExecutionContext) -> Result<LogicalPlan, Error>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r834108696



##########
File path: ballista/rust/core/src/serde/physical_plan/from_proto.rs
##########
@@ -81,6 +83,41 @@ impl TryFrom<&protobuf::PhysicalExprNode> for Arc<dyn PhysicalExpr> {
                         .to_owned(),
                 ));
             }
+            ExprType::ScalarUdfProtoExpr(e) => {
+                if let Some(udf_plugin_manager) = get_udf_plugin_manager("") {

Review comment:
       This should be resolved from the `SessionContext` right?

##########
File path: ballista/rust/core/src/plugin/plugin_manager.rs
##########
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+use crate::error::{BallistaError, Result};
+use libloading::Library;
+use log::info;
+use std::collections::HashMap;
+use std::io;
+use std::sync::{Arc, Mutex};
+use walkdir::{DirEntry, WalkDir};
+
+use crate::plugin::{
+    PluginDeclaration, PluginEnum, PluginRegistrar, CORE_VERSION, RUSTC_VERSION,
+};
+use once_cell::sync::OnceCell;
+
+/// To prevent the library from being loaded multiple times, we use once_cell defines a Arc<Mutex<GlobalPluginManager>>
+/// Because datafusion is a library, not a service, users may not need to load all plug-ins in the process.
+/// So fn global_plugin_manager return Arc<Mutex<GlobalPluginManager>>. In this way, users can load the required library through the load method of GlobalPluginManager when needed
+static INSTANCE: OnceCell<Arc<Mutex<GlobalPluginManager>>> = OnceCell::new();
+
+/// global_plugin_manager
+pub fn global_plugin_manager(
+    plugin_path: &str,
+) -> &'static Arc<Mutex<GlobalPluginManager>> {
+    INSTANCE.get_or_init(move || unsafe {
+        let mut gpm = GlobalPluginManager::default();
+        gpm.load(plugin_path).unwrap();
+        Arc::new(Mutex::new(gpm))
+    })
+}
+
+#[derive(Default)]
+/// manager all plugin_type's plugin_manager
+pub struct GlobalPluginManager {
+    /// every plugin need a plugin registrar
+    pub plugin_managers: HashMap<PluginEnum, Box<dyn PluginRegistrar>>,
+
+    /// loaded plugin files
+    pub plugin_files: Vec<String>,
+}
+
+impl GlobalPluginManager {
+    /// # Safety
+    /// find plugin file from `plugin_path` and load it .
+    unsafe fn load(&mut self, plugin_path: &str) -> Result<()> {
+        if "".eq(plugin_path) {
+            return Ok(());
+        }
+        // find library file from udaf_plugin_path
+        info!("load plugin from dir:{}", plugin_path);
+
+        let plugin_files = self.get_all_plugin_files(plugin_path)?;
+
+        for plugin_file in plugin_files {
+            let library = Library::new(plugin_file.path()).map_err(|e| {
+                BallistaError::IoError(io::Error::new(
+                    io::ErrorKind::Other,
+                    format!("load library error: {}", e),
+                ))
+            })?;
+
+            let library = Arc::new(library);
+
+            let dec = library.get::<*mut PluginDeclaration>(b"plugin_declaration\0");
+            if dec.is_err() {
+                info!(
+                    "not found plugin_declaration in the library: {}",
+                    plugin_file.path().to_str().unwrap()
+                );
+                continue;
+            }
+
+            let dec = dec.unwrap().read();
+
+            // ersion checks to prevent accidental ABI incompatibilities
+            if dec.rustc_version != RUSTC_VERSION || dec.core_version != CORE_VERSION {
+                return Err(BallistaError::IoError(io::Error::new(
+                    io::ErrorKind::Other,
+                    "Version mismatch",
+                )));
+            }
+
+            let plugin_enum = (dec.plugin_type)();
+            let curr_plugin_manager = match self.plugin_managers.get_mut(&plugin_enum) {
+                None => {
+                    let plugin_manager = plugin_enum.init_plugin_manager();
+                    self.plugin_managers.insert(plugin_enum, plugin_manager);
+                    self.plugin_managers.get_mut(&plugin_enum).unwrap()
+                }
+                Some(manager) => manager,
+            };
+            curr_plugin_manager.load(library)?;
+            self.plugin_files
+                .push(plugin_file.path().to_str().unwrap().to_string());
+        }
+
+        Ok(())
+    }
+
+    /// get all plugin file in the dir
+    fn get_all_plugin_files(&self, plugin_path: &str) -> io::Result<Vec<DirEntry>> {
+        let mut plugin_files = Vec::new();
+        for entry in WalkDir::new(plugin_path).into_iter().filter_map(|e| {
+            let item = e.unwrap();
+            // every file only load once
+            if self
+                .plugin_files
+                .contains(&item.path().to_str().unwrap().to_string())
+            {
+                return None;
+            }
+
+            let file_type = item.file_type();
+            if !file_type.is_file() {
+                return None;
+            }
+
+            if let Some(path) = item.path().extension() {
+                if let Some(suffix) = path.to_str() {
+                    if suffix == "dylib" || suffix == "so" || suffix == "dll" {
+                        info!(
+                            "load plugin from library file:{}",
+                            item.path().to_str().unwrap()
+                        );
+                        println!(
+                            "load plugin from library file:{}",
+                            item.path().to_str().unwrap()
+                        );

Review comment:
       Is this a duplicate of the info log?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r815268308



##########
File path: datafusion/src/plugin/udf.rs
##########
@@ -0,0 +1,151 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+use crate::physical_plan::udaf::AggregateUDF;
+use crate::physical_plan::udf::ScalarUDF;
+use crate::plugin::plugin_manager::global_plugin_manager;
+use crate::plugin::{Plugin, PluginEnum, PluginRegistrar};
+use datafusion_common::{DataFusionError, Result};
+use libloading::{Library, Symbol};
+use std::any::Any;
+use std::collections::HashMap;
+use std::io;
+use std::sync::Arc;
+
+/// 定义udf插件，udf的定义方需要实现该trait

Review comment:
       🙈




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Jimexist commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r815446309



##########
File path: datafusion/build.rs
##########
@@ -0,0 +1,21 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+fn main() {
+    let version = rustc_version::version().unwrap();
+    println!("cargo:rustc-env=RUSTC_VERSION={}", version);
+}

Review comment:
       not sure this is something we'd add for datafusion




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1056084581

In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserializat
ion udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.

Thanks a lot, can you give me more advice on these?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057594304


   > > > Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
   > > > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   > > > 
   > > > 1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
   > > > 2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deseri
 alization udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
   > > > 3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   > > > 
   > > > Thanks a lot, can you give me more advice on these?
   > > 
   > > 
   > > For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own `main` function which registers desired UDF/UDAF in the global execution context.
   > 
   > Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.
   > 
   > ```
   > 
   > fn try_into_logical_plan(
   >         &self,
   >         ctx: &ExecutionContext,
   >         extension_codec: &dyn LogicalExtensionCodec,
   >     ) -> Result<LogicalPlan, BallistaError>;
   > ```
   > 
   > But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion?
   
   Agree with your option.
   The ballista is a distributed computed engine like spark and others.
   Users who want to use the udf and just don't need to recompile the codes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058415493


   If i'm understanding everything correctly i think theres an important point worth making on how `datafusion` on its own can be used as a system, similar to spark.   Specifically, the `datafusion-python` bindings i believe would fall in the category of system (@alamb i believe you had that view as well - please correct me if im wrong or if things have changed since then).
   
   on my side, im planning to use `datafusion-python` until size of data / etc require using ballista.  with that in mind, my hope is that `datafusion-python` can have first class system like support - for example runtime registration of UDFs.  I know there is existing support for UDF/UDAF in `datafusion-python` although it isnt clear to me how the changes here could impact that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058403165


   > > If I understand correctly, you meant, udf is need in DF, but the mechanism to register udf during runtime is only needed in Ballista, correct? If so then I fully agree.
   
   > Yes, That's what I mean.
   
   This is also my understanding


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1081705181


   > The datafusion changes look good to me. Thank you very much @gaojun2048. Can someone please review the Ballista changes?
   > 
   > Perhaps @liukun4515 @mingmwang @thinkharderdev has the time and expertise?
   
   I can review today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1059471984


   Thank you @gaojun2048  for taking the feedback and making changes. I appreciate that building consensus took some non trivial effort in this case. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Ted-Jiang edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065883511


   > So the udf/udaf can not deserialization. Any good suggestions?
   
   IMO, this PR add the ability for load external udf/udaf. 
   FYI, the built in udf/udaf  have fixed in #1988 for ballista, if wrong, please correct me.
   
   If this use for external udf/udaf.  
   >making a ballista-plugin crate
   
   +1  this is a very awesome work !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1050674214


   The clippy check failed because package fuzz-utils. But I don't modify this package.
   
   <img width="1439" alt="image" src="https://user-images.githubusercontent.com/32193458/155689278-72fdd5bd-1b98-44b2-8914-ca6299e5dcd0.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r815449602



##########
File path: datafusion/build.rs
##########
@@ -0,0 +1,21 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+fn main() {
+    let version = rustc_version::version().unwrap();
+    println!("cargo:rustc-env=RUSTC_VERSION={}", version);
+}

Review comment:
       Add build.rs  file  can help me easy to get the rustc_version compiled by the project .
   Of course, I can modify ` pub static RUSTC_VERSION: &str = env! ("RUSTC_VERSION");`  In `datafusion / src / plugin / mod.rs`  Through `rustc_version::version().unwrap()` to get the rustc_version at compile time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1056252014


   > 
   `Can you explain the need for getting the rustc version?`
   https://github.com/apache/arrow-datafusion/pull/1881#discussion_r816848645
   
   Yes. from this issue https://github.com/rust-lang/rfcs/issues/600 I sea Rust doesn’t have a stable ABI, meaning different compiler versions can generate incompatible code. For these reasons, the UDF plug-in must be compiled using the same version of rustc as datafusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1085311262


   > #2130
   
   Ok, Can I submit the plugin related code first, regardless of the serialization and deserialization parts of UDF?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1085348962


   @thinkharderdev  I push a sub PR of this PR. please help me review.
   
   https://github.com/apache/arrow-datafusion/pull/2131


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1050674214


   The clippy check failed because some code that is not changed by me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057550703


   > > Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
   > > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   > > 
   > > 1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
   > > 2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserial
 ization udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
   > > 3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   > > 
   > > Thanks a lot, can you give me more advice on these?
   > 
   > For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own `main` function which registers desired UDF/UDAF in the global execution context.
   
   Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.
   
   But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058759101


   > > Yes. from this issue [rust-lang/rfcs#600](https://github.com/rust-lang/rfcs/issues/600) I sea Rust doesn’t have a stable ABI, meaning different compiler versions can generate incompatible code. For these reasons, the UDF plug-in must be compiled using the same version of rustc as datafusion.
   > 
   > That makes sense -- it might help to add a comment to the source code explaining that rationale (so that future readers understand as well)
   
   Ok, I will add a comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1076261320


   > Hi, @Ted-Jiang @alamb @thinkharderdev @liukun4515
   > 
   > I have completed all the development and testing of the UDF plugin, but now the udaf test will get stuck
   > 
   > <img alt="image" width="1347" src="https://user-images.githubusercontent.com/32193458/159686486-ae21f83e-36d5-4d39-8238-fd4a8c121350.png">
   > 
   > This PR has lasted for a long time. There are many code changes in the middle, and I see many people discussing this issue, so I hope to merge the PR first. I will fix the problem after the prud test is completed.
   
   Sorry, I will update code and resolve conflicts later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

liukun4515 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057594304


   > > > Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
   > > > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   > > > 
   > > > 1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
   > > > 2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deseri
 alization udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
   > > > 3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   > > > 
   > > > Thanks a lot, can you give me more advice on these?
   > > 
   > > 
   > > For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own `main` function which registers desired UDF/UDAF in the global execution context.
   > 
   > Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.
   > 
   > ```
   > 
   > fn try_into_logical_plan(
   >         &self,
   >         ctx: &ExecutionContext,
   >         extension_codec: &dyn LogicalExtensionCodec,
   >     ) -> Result<LogicalPlan, BallistaError>;
   > ```
   > 
   > But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion?
   
   Agree with your opinion.
   The ballista is a distributed computed engine like spark and others.
   Users who want to use the udf and just don't need to recompile the codes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1083160223


   > It should take a FuntionRegisrty now (which will be a TaskContext) at runtime. I think we should use that since we can setup the TaskContext with any preloaded functions
   
   Ok. I will update code and use TaskContext to serialization and deserialization UDF


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on a change in pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on a change in pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#discussion_r834903280



##########
File path: ballista/rust/core/src/serde/physical_plan/from_proto.rs
##########
@@ -81,6 +83,41 @@ impl TryFrom<&protobuf::PhysicalExprNode> for Arc<dyn PhysicalExpr> {
                         .to_owned(),
                 ));
             }
+            ExprType::ScalarUdfProtoExpr(e) => {
+                if let Some(udf_plugin_manager) = get_udf_plugin_manager("") {

Review comment:
       I added TODO and I plan to finish physical plan serialization and deserialization through SessionContext in the next PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1079236920


   Hi @gaojun2048  -- I think github has been having some issues: https://www.githubstatus.com/history
   
   I re kicked off the jobs here: https://github.com/apache/arrow-datafusion/actions/runs/2040619251 
   
   Hopefully they will complete this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065889841


   In fact, after the external UDF/UDAF is registered with datafusion's `ExecutionContext` via the `register_udf` function, `LogicalPlan` can be deserialized via ExecutionContext 
   
   ```
   impl FunctionRegistry for ExecutionContext {
       fn udfs(&self) -> HashSet<String> {
           self.state.lock().udfs()
       }
   
       fn udf(&self, name: &str) -> Result<Arc<ScalarUDF>> {
           self.state.lock().udf(name)
       }
   
       fn udaf(&self, name: &str) -> Result<Arc<AggregateUDF>> {
           self.state.lock().udaf(name)
       }
   }
   ```
   get the implementation of UDF/UDAF. But that need add a function like this:
   
   ```
   fn from_proto_to_logicalplan(logical_plan: &protobuf::LogicalExprNode, ctx: &ExecutionContext) -> Result<LogicalPlan, Error>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Ted-Jiang edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065883511


   > So the udf/udaf can not deserialization. Any good suggestions?
   
   IMO, this PR add the ability for load external udf/udaf. 
   FYI, the built in udf/udaf  have fixed in #1988 for ballista, if wrong, please correct me.
   
   If this use for external udf/udaf.  
   >making a ballista-plugin crate
   
   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1068870286


   > In general I think the design with respect to plugins is that they are loaded into the ExecutionContext (or SessionContext when that works is merged) at startup and then internally the context is used
   
   Yes, I think so.
   
   > @gaojun2048 Not sure how soon you will be able to get back to this, but if you want I can work on a PR to do the UDF/UDAF deserialization using the `ExecutionContext`.
   
   That`s great. If so, I will modify this PR, delete the changes related to UDF/UDAF deserialization, and only retain the function of load udf_plugin and register UDF/UDAF to ExecutionContext.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057550703


   > > Thank you for your advice @alamb . Yes, the udf plugin is designed for those who use Ballista as a computing engine, but do not want to modify the source code of ballista. We use ballista in production and we need ballista to be able to use our custom udf. As a user of ballista, I am reluctant to modify the source code of ballista directly, because it means that I need to recompile ballista myself, and in the future, when I want to upgrade ballista to the latest version of the community, I need to do more merges work. If I use the udf plugin, I only need to maintain the custom udf code. When I upgrade the version of ballista, I only need to modify the version number of the datafusion dependency in the code, and then recompile these udf dynamic libraries. I believe this is a more friendly way for those who actually use ballista as a computing engine.
   > > In my opinion, people who use datafusion and people who use ballista are different people, and the udf plugin is more suitable for ballista than datafusion.
   > > 
   > > 1. People who use datafusion generally develop their own computing engines on the basis of datafusion. In this case, they often do not need udf plugins. They only need to put the udf code into their own computing engines, and they decide for themselves. When to call register_udf to register udf into datafusion. If needed, they can handle the serialization and deserialization of custom UDFs in their own computing engine to achieve distributed scheduling.
   > > 2. People who use ballista generally only use ballista as a computing engine. They often do not have a deep understanding of the source code of datafusion. It is very difficult to directly modify the source code of ballista and datafusion. They may update the version of ballista frequently, and modifying the source code of ballista's datafusion means that each upgrade requires merge code and recompile, which is a very big burden for them. In particular, it should be pointed out that there is no way for udf to work in ballista now, because serialization and deserialization of udf need to know the specific implementation of udf, which cannot be achieved without modifying the source code of ballista and datafusion. The role of the udf plugin in this case is very obvious. They only need to maintain their own udf code and do not need to pay attention to the code changes of ballista's datafusion. And In ballista, we can serialization the udf with the udf's name, And then we deserial
 ization udf use the udf's name `get_scalar_udf_by_name(&self, fun_name: &str)`. These operations are completed through the trail `UDFPlugin`. Ballista does not need to know who has implemented the UDF plugin.
   > > 3. I don't think scalar_functions and aggregate_functions in ExecutionContext need to be modified as these are for those who use datafusion but not ballista. So I think I should modify the code and migrate the plugin mod into the ballista crate instead of staying in datafusion.
   > > 
   > > Thanks a lot, can you give me more advice on these?
   > 
   > For what it's worth, with the changes in #1677 you wouldn't actually have to build Ballista from source or modify the ballista source. You can just use the ballista crate dependency and define your own `main` function which registers desired UDF/UDAF in the global execution context.
   
   Ugh, I always thought that ballista was an out-of-the-box computing engine, like presto/impala, not a computing library, so I don't quite understand that using ballista also requires dependency ballista and defines its own main function. Of course, for those who want to develop their own computing engine based on ballista, this is indeed a good way, which means that the udf plugin does not need to be placed in the ballista crate, because they can maintain the udf plugin in their own projects, and Load udf plugins in their own defined main function and then register them in the global ExecutionContext. When serializing and deserializing LogicalPlan, the implementation of udf can be found through the incoming ExecutionContext.
   ```
   
   fn try_into_logical_plan(
           &self,
           ctx: &ExecutionContext,
           extension_codec: &dyn LogicalExtensionCodec,
       ) -> Result<LogicalPlan, BallistaError>;
   ```
   But I'm still not quite sure, ballista is an out-of-the-box compute engine like presto/impala. Or is it just a dependent library for someone else to implement their own computing engine like datafusion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1051572336


   I have meet the same issue, Thanks for your work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1051573375


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang edited a comment on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

Ted-Jiang edited a comment on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065883511


   > So the udf/udaf can not deserialization. Any good suggestions?
   
   IMO, this PR add the ability for load external udf/udaf. 
   FYI, the built in udf/udaf  have fixed in #1988 for ballista, if wrong, please correct me.
   
   If this use for external udf/udaf.  
   >making a ballista-plugin 
   
   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1065886927


   > #1988
   
   `FYI, the built in udf/udaf have fixed in https://github.com/apache/arrow-datafusion/issues/1988 for ballista, if wrong, please correct me.`
   
   I think there are only two kinds of functions. One is build_ In, one is UDF (customized by users outside the system). https://github.com/apache/arrow-datafusion/issues/1988  fix the function that build_in datafusion but is missing in ballista.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1068870286


   > @gaojun2048 Not sure how soon you will be able to get back to this, but if you want I can work on a PR to do the UDF/UDAF deserialization using the `ExecutionContext`.
   
   That`s great. If so, I will modify this PR, delete the changes related to UDF/UDAF deserialization, and only retain the function of load udf_plugin and register UDF/UDAF to ExecutionContext.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1078964823


   @alamb  CI get stuck . Can you help me retry?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057773271


   > If I understand correctly, you meant, udf is need in DF, but the mechanism to register udf during runtime is only needed in Ballista, correct? If so then I fully agree.
   
   Yes, That's what I mean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057712514

> I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?

I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. Those who use DF will use DF as a dependency of the project and then develop their computing engine based on DF. For example, Ballista is a distributed computing engine developed based on DF. Ballista is a mature computing engine just like Presto/spark. People who use Ballista only need to download and deploy Ballista to their machines to start the ballista service. They rarely care about how Ballista is implemented, so a A udf plugin that supports dynamic loading allows these people to define their own udf functions without modifying Ballista's source code.

```
I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?
```

Yes, it is important and required for DF to support udf. But for those who use DF, it is not necessary to support the `udf plugin` to dynamically load udf. Because for people who use DF as a dependency to develop their own calculation engine, such as Ballista. Imagine one, if Ballista and DF are not under the same repository, but two separate projects, as a Ballista developer, I need to add my own udf to meet my special analysis needs. What I'm most likely to do is to manage my own udf, such as writing the implementation of udf directly in the Ballista crate. Or add a `udf plugin` to Ballista like this pr, which supports dynamic loading of udfs developed by Ballista users (not Ballista developers). Then I decide when to call the `register_udf` method of the DF to register these udfs in the `ExecutionContext` so that the DF can be used for calculation. Of course, we can directly put the udf plugin in DF, but this feature is not necessary for DF, and doing so will make the `register
_udf` method look redundant, but make the design of DF's udf not easy to understand.

So I would say that the people who need the `udf plugin` the most are those who use Ballista as a full-fledged computing engine, and they just download and deploy Ballista. They don't modify the source code of Ballista and DF because that would mean a better understanding of Ballista and DF. And once the source code of Ballista and DF is modified, it means that they need to invest more cost to merge and build when upgrading Ballista. But now if the user just downloads and deploys Ballista for use, there is no way for the user to register his udf into the DF. The core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF.

Finally, if we define Ballista's goal as a distributed implementation of datafusion, a library that needs to be used as a dependency of other projects, rather than a distributed computing engine (like presto/spark) that can be directly downloaded and deployed and used. It seems to me that the udf plugin is not necessary, because the core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF. Those projects that use ballista as depencency can manage their own udf and decide when to register their udf into DF.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on pull request #1881: add udf/udaf plugin

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on pull request #1881:
URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1058763901


   Thanks to all the friends who participated in the discussion, from everyone's feedback, the `udf plugin` is a very useful feature for ballista. I will modify the code in the next day or two to migrate the `udf plugin` mod to the ballista-core crate (because `ballista-core/src/serde/logical_plan/from_proto.rs` and `physical_plan/from_proto` .rs` needs to use `udf plugin` to deserialize udf).
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org