You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "epsio-banay (via GitHub)" <gi...@apache.org> on 2023/06/08 10:26:25 UTC

[GitHub] [arrow-datafusion] epsio-banay opened a new pull request, #6601: Prioritize UDF over scalar built-in function in case of function name…

epsio-banay opened a new pull request, #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601

   # Which issue does this PR close?
   
   Closes #6588.
   
   # Rationale for this change
   This PR lets a user override the built-in implementation of scalar functions (such as abs, floor, etc) using the ScalarUdf feature.
   We are currently only using the logical planner of Datafusion, this feature would allow us and others to have more control over the built-in scalar functions.
   It should also be inline with Datafusion's intention to be customizable.
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   Described in user-facing changes below.
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   Yes
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   Yes.
   
   a new field called "prioritize_udf" inside SqlParserOptions and ParserOptions.
   It controls whether DataFusion should prioritize UDF over built-in scalar function in case of function name conflict.
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#issuecomment-1587640759

   Thanks again @epsio-banay 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] epsio-banay commented on a diff in pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "epsio-banay (via GitHub)" <gi...@apache.org>.

epsio-banay commented on code in PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#discussion_r1225772209


##########
datafusion/common/src/config.rs:
##########
@@ -191,6 +191,8 @@ config_namespace! {
         /// MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, MsSQL, ClickHouse, BigQuery, and Ansi.
         pub dialect: String, default = "generic".to_string()
 
+        /// Should DataFusion prioritize UDF over built-in scalar function in case of function name conflict.

Review Comment:
   Done.
   Thanks for responding so fast, this PR got much more simple.
   I'll also edit the PR description - remove the described user facing changes 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] epsio-banay commented on a diff in pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "epsio-banay (via GitHub)" <gi...@apache.org>.

epsio-banay commented on code in PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#discussion_r1225772209


##########
datafusion/common/src/config.rs:
##########
@@ -191,6 +191,8 @@ config_namespace! {
         /// MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, MsSQL, ClickHouse, BigQuery, and Ansi.
         pub dialect: String, default = "generic".to_string()
 
+        /// Should DataFusion prioritize UDF over built-in scalar function in case of function name conflict.

Review Comment:
   Done.
   Thanks for responding so fast, this PR got much more simple.
   I'll also edit the PR description - remove the described config flag



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] epsio-banay commented on pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "epsio-banay (via GitHub)" <gi...@apache.org>.

epsio-banay commented on PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#issuecomment-1586891902

   > I think this branch has a logical merge conflict with some of the changes on master. If I applied this diff locally the tests passed:
   > 
   > ```diff
   > diff --git a/datafusion/core/tests/sql/udf.rs b/datafusion/core/tests/sql/udf.rs
   > index 0ecd5d0fd..eb76c3417 100644
   > --- a/datafusion/core/tests/sql/udf.rs
   > +++ b/datafusion/core/tests/sql/udf.rs
   > @@ -185,7 +185,7 @@ async fn scalar_udf_override_built_in_scalar_function() -> Result<()> {
   >  
   >      let batch = RecordBatch::try_new(
   >          Arc::new(schema.clone()),
   > -        vec![Arc::new(Int32Array::from(vec![-100]))],
   > +        vec![Arc::new(Int32Array::from_slice([-100]))],
   >      )?;
   >      let ctx = SessionContext::new();
   >  
   > ```
   
   rebased and fixed:)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] epsio-banay commented on a diff in pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "epsio-banay (via GitHub)" <gi...@apache.org>.

epsio-banay commented on code in PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#discussion_r1225772235


##########
datafusion/sql/tests/integration_test.rs:
##########
@@ -125,12 +127,78 @@ fn parse_ident_normalization() {
             ParserOptions {
                 parse_float_as_decimal: false,
                 enable_ident_normalization,
+                prioritize_udf: false,
             },
         );
         assert_eq!(expected, format!("{plan:?}"));
     }
 }
 
+#[test]
+fn parse_prioritize_udf() {
+    // Create a UDF that has the same name as a builtin function (abs)
+    let return_type: ReturnTypeFunction =
+        Arc::new(move |_| Ok(Arc::new(DataType::Int64)));
+    let fun: ScalarFunctionImplementation =
+        Arc::new(move |_| Ok(ColumnarValue::Scalar(ScalarValue::Int64(Some(1)))));
+    let scalar_udf = Arc::new(ScalarUDF::new(
+        "abs",
+        &Signature::uniform(1, vec![DataType::Float32], Volatility::Stable),
+        &return_type,
+        &fun,
+    ));
+
+    // Test data
+    let test_data = [

Review Comment:
   Done in scalar_udf_override_built_in_scalar_function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb merged PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on code in PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#discussion_r1225504017


##########
datafusion/common/src/config.rs:
##########
@@ -191,6 +191,8 @@ config_namespace! {
         /// MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, MsSQL, ClickHouse, BigQuery, and Ansi.
         pub dialect: String, default = "generic".to_string()
 
+        /// Should DataFusion prioritize UDF over built-in scalar function in case of function name conflict.

Review Comment:
   Agreed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#discussion_r1223089700


##########
datafusion/sql/tests/integration_test.rs:
##########
@@ -125,12 +127,78 @@ fn parse_ident_normalization() {
             ParserOptions {
                 parse_float_as_decimal: false,
                 enable_ident_normalization,
+                prioritize_udf: false,
             },
         );
         assert_eq!(expected, format!("{plan:?}"));
     }
 }
 
+#[test]
+fn parse_prioritize_udf() {
+    // Create a UDF that has the same name as a builtin function (abs)
+    let return_type: ReturnTypeFunction =
+        Arc::new(move |_| Ok(Arc::new(DataType::Int64)));
+    let fun: ScalarFunctionImplementation =
+        Arc::new(move |_| Ok(ColumnarValue::Scalar(ScalarValue::Int64(Some(1)))));
+    let scalar_udf = Arc::new(ScalarUDF::new(
+        "abs",
+        &Signature::uniform(1, vec![DataType::Float32], Volatility::Stable),
+        &return_type,
+        &fun,
+    ));
+
+    // Test data
+    let test_data = [

Review Comment:
   Rather than checking the plan here, what do you think about simply running the query and verifying the output?
   
   For example, you could run the query
   
   ```
   SELECT ABS(-100)
   ```
   
   And if it returns `1` you know the UDF is called, and if it returns `100` you know the built in was called



##########
datafusion/common/src/config.rs:
##########
@@ -191,6 +191,8 @@ config_namespace! {
         /// MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, MsSQL, ClickHouse, BigQuery, and Ansi.
         pub dialect: String, default = "generic".to_string()
 
+        /// Should DataFusion prioritize UDF over built-in scalar function in case of function name conflict.

Review Comment:
   I actually don't think a config flag is needed -- if a UDF is registered with the same name as a built in, it will not be callable (as you note in the issue / PR description). Therefore I think it is safe to assume that the if a user registers a function with the same name as a built in they want the new function to have precidence



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#issuecomment-1584589943

   Marking as draft as this PR has feedback and is awaiting a response (I am trying to keep the list of PRs needing review clear)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #6601: Prioritize UDF over scalar built-in function in case of function name…

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #6601:
URL: https://github.com/apache/arrow-datafusion/pull/6601#issuecomment-1586356543

   I think this branch has a logical merge conflict with  some of the changes on master. If I applied this diff locally the tests passed:
   
   ```diff
   diff --git a/datafusion/core/tests/sql/udf.rs b/datafusion/core/tests/sql/udf.rs
   index 0ecd5d0fd..eb76c3417 100644
   --- a/datafusion/core/tests/sql/udf.rs
   +++ b/datafusion/core/tests/sql/udf.rs
   @@ -185,7 +185,7 @@ async fn scalar_udf_override_built_in_scalar_function() -> Result<()> {
    
        let batch = RecordBatch::try_new(
            Arc::new(schema.clone()),
   -        vec![Arc::new(Int32Array::from(vec![-100]))],
   +        vec![Arc::new(Int32Array::from_slice([-100]))],
        )?;
        let ctx = SessionContext::new();
    
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org