You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "comphead (via GitHub)" <gi...@apache.org> on 2023/03/14 15:35:09 UTC

[GitHub] [arrow-datafusion] comphead opened a new issue, #5600: [DISCUSSION] Add separate crate to cover spark builtin functions

comphead opened a new issue, #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   The discussion is to collect community thoughts on implementing spark builtin functions listed https://spark.apache.org/docs/3.2.0/api/sql/
   
   We more often face the requests to implement Spark functions and the use case on person/company stack, one treats the spark more important, for others its opposite and PG compatibility is a priority
   
   Builtin function list between Postgres and Spark are expectedly not the same. I believe it can be rare and worse cases when the function name the same but signature and/or return type is different.
   
   The discussion goal is to find out how to organize DF and keep compatibility for majors like Spark, Postgres, and perhaps other systems
   <!--
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   -->
   
   **Describe the solution you'd like**
   @alamb in https://github.com/apache/arrow-datafusion/issues/5568#issuecomment-1466608119 made a proposition to create an extensible crate for spark functions, or even it can be a separate subproject so the PG users have the possibility to exclude Spark functions.
   
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features you've considered.
   -->
   Not doing this
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   Created after https://github.com/apache/arrow-datafusion/issues/5568#issuecomment-1466608119


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [arrow-datafusion]

Posted by "Omega359 (via GitHub)" <gi...@apache.org>.

Omega359 commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1921858313

   I would like to recommend that each function crate also has a benches directory and any Cargo.toml file for those crates have criterion setup as a dev dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5600: [DISCUSSION] Add separate crate to cover spark builtin functions

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1470977590

   @alamb you are right, we can't provide optional builtin functions, that would be more than unexpected. 
   
   In wonderful example #5568 (Implement to_unixtime function)  the same functionality can be achieved differently in  
   Spark `to_unix_timestamp(now())`
   PG `extract(epoch from now())`
   Trino `select to_unixtime(now())`
   
   I'm not even sure if we need extra crates now. My personal feeling is Trino trying to adopt all possible syntaxes. 
   Perhaps we go that way and just keep extending the list of builtin functions and in case of conflicts(if any) the PG signature/return type could be as dominant
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1922233428

   Here is a proposal of how to extract / organize the functions: https://github.com/apache/arrow-datafusion/issues/9100


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [arrow-datafusion]

Posted by "Omega359 (via GitHub)" <gi...@apache.org>.

Omega359 commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1921882744

   > But I am not sure how well the vision and reality match (as in does anyone use datafusion without them)
   
   Well, I can say that for my current use case crypto and array functions are something that I'm unlikely to need. Every one's use case is different of course but it's a data point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5600: [DISCUSSION] Add separate crate to cover spark builtin functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1469703505

   I also agree (unsurprisingly) that a separate crate for libraries of functions would be valuable. 
   
   Overall, keeping datafusion setup with a core and then a bunch of extension points,  such that different sets of functionalty can be assembled for whatever usecase people want would be my ideal.
   
   For example, there are already a bunch of functions that in theory should be "optional" via feature flags-- 
   
   https://github.com/apache/arrow-datafusion/blob/6bfede0e1e1a09bd06ea0166e01c0ad4834a1721/datafusion/physical-expr/Cargo.toml#L35-L42
   
   But I am not sure how well the vision and reality match (as in does anyone use datafusion without them)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on issue #5600: [DISCUSSION] Add separate crate to cover spark builtin functions

Posted by "yjshen (via GitHub)" <gi...@apache.org>.

yjshen commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1468994048

   Separating Spark functions into a special crate seems reasonable but supporting Spark UDFs requires significant effort. This is because many UDFs in Spark are designed to be compatible with Hive and handle corner cases differently than other databases like PG. These corner cases increase the workload of integrating Spark/Hive with DataFusion.
   
   When developing Blaze, we must compare the implementations of both engines or port tests first to ensure that they have identical semantics before passing a UDF for execution by DataFusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5600: [DISCUSSION] Add separate crate to cover spark builtin functions

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1483362561

   @alamb do we need more discussion?
   We can try to adopt syntaxes from major engines like Spark, PG, etc, this will make the migration to DF easier.
   However The downside is  supporting the same functionality in different implementations leading to code growth, and probably more time to support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1921160267

   BTW https://github.com/apache/arrow-datafusion/issues/8045  tracks the work to break apart the function libraries into smaller pieces. Once this is done I think adding making a crate of spark compatible functions will be pretty straightforward


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #5600: [DISCUSSION] Add separate crate to cover spark builtin functions

Posted by "andygrove (via GitHub)" <gi...@apache.org>.

andygrove commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1468348803

   It seems reasonable to me to have separate Spark and PG implementations of functions. This would make it more attractive for users wanting to migrate from Spark to DataFusion.
   
   I like the idea of having a new `datafusion-spark` crate with anything related to Spark compatibility and integration (such as Spark SQL dialect and Spark built-in functions). I wonder if @yjshen has any thoughts on this since he built a Spark + DataFusion integration in Blaze.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5600: [DISCUSSION] Add separate crate to cover spark builtin functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5600:
URL: https://github.com/apache/arrow-datafusion/issues/5600#issuecomment-1483798905

   So my opinion on this matter is that ideally DataFusion should be an extensible engine and so people using it can pick whatever parts they want to use and replace what they don't with their own implementations.
   
   DataFusion includes a bunch of pre-built functionality (like the mostly compatible PG functions, parquet / json / etc readers, a memory catalog, etc) in order to get people started so they can focus on extending whatever is most important for their usecase.
   
   So I think it would be great to have a separate crate with "spark compatible functions" (maybe also the same could be done for  a "postgres compatible functions crate"). I think the BuiltInFunction thing is not required long term and it would be better if all functions could behave the same as user defined functions
   
   Then the question becomes "where is that crate's code stored"  -- it is probably fine initially to be in the main datafusion repo initially and if it gets too unweildy we could break it into its own repo or something.
   
   But the ability to customize the functions available I think is key
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org