You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Petar Zečević <pe...@gmail.com> on 2018/11/20 10:10:57 UTC

Array indexing functions

Hi,
I implemented two array functions that are useful to us and I wonder if you think it would be useful to add them to the distribution. The functions are used for filtering arrays based on indexes:

array_allpositions (named after array_position) - takes a column and a value and returns an array of the column's indexes corresponding to elements equal to the provided value

array_select - takes an array column and an array of indexes and returns a subset of the array based on the provided indexes.

If you agree with this addition I can create a JIRA ticket and a pull request.

-- 
Petar Zečević

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Array indexing functions

Posted by Petar Zečević <pe...@gmail.com>.

Hi,
yes, these are imlemented just like native functions in sql.functions, with code generation, so whole-stage codegen should apply.

Regarding plan optimization, I am not sure how these would be taken into account in the existing rules, except maybe for filter pushdown.

Petar


Alessandro Solimando <al...@gmail.com> writes:

> Hi Petar,
> I have implemented similar functions a few times through ad-hoc UDFs in the past, so +1 from me.
>
> Can you elaborate a bit more on how you practically implement those functions? Are they UDF or "native" functions like those in sql.functions package?
>
> I am asking because I wonder if/how Catalyst can take those functions into account for producing more optimized plans, maybe you or someone else in the list can clarify this.
>
> Best regards,
> Alessandro
>
> On Tue, 20 Nov 2018 at 11:11, Petar Zečević <pe...@gmail.com> wrote:
>
>  Hi,
>  I implemented two array functions that are useful to us and I wonder if you think it would be useful to add them to the distribution. The functions are used for filtering arrays based on indexes:
>
>  array_allpositions (named after array_position) - takes a column and a value and returns an array of the column's indexes corresponding to elements equal to the provided value
>
>  array_select - takes an array column and an array of indexes and returns a subset of the array based on the provided indexes.
>
>  If you agree with this addition I can create a JIRA ticket and a pull request.
>
>  -- 
>  Petar Zečević
>
>  ---------------------------------------------------------------------
>  To unsubscribe e-mail: dev-unsubscribe@spark.apache.org




---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Array indexing functions

Posted by Alessandro Solimando <al...@gmail.com>.

Hi Petar,
I have implemented similar functions a few times through ad-hoc UDFs in the
past, so +1 from me.

Can you elaborate a bit more on how you practically implement those
functions? Are they UDF or "native" functions like those in sql.functions
package?

I am asking because I wonder if/how Catalyst can take those functions into
account for producing more optimized plans, maybe you or someone else in
the list can clarify this.

Best regards,
Alessandro

On Tue, 20 Nov 2018 at 11:11, Petar Zečević <pe...@gmail.com> wrote:

>
> Hi,
> I implemented two array functions that are useful to us and I wonder if
> you think it would be useful to add them to the distribution. The functions
> are used for filtering arrays based on indexes:
>
> array_allpositions (named after array_position) - takes a column and a
> value and returns an array of the column's indexes corresponding to
> elements equal to the provided value
>
> array_select - takes an array column and an array of indexes and returns a
> subset of the array based on the provided indexes.
>
> If you agree with this addition I can create a JIRA ticket and a pull
> request.
>
> --
> Petar Zečević
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Array indexing functions

Posted by Petar Zečević <pe...@gmail.com>.

Hi,
as far as I know these are not standard functions.

Writing UDFs is easy, but only in Java and Scala is it equally efficient as a built-in function. When using Python, data movement/conversion to/from Arrow is still necessary, and that makes a difference in performance. That was the motivation behind these two.

I'd object to the rule of not implementing functions not found anywhere else, but there seems to be a consensus around this, so I'll just close the JIRA.

Thanks,
Petar


Sean Owen <sr...@gmail.com> writes:

> Is it standard SQL or implemented in Hive? Because UDFs are so relatively easy in Spark we don't need tons of builtins like an RDBMS does. 
>
> On Tue, Feb 5, 2019, 7:43 AM Petar Zečević <petar.zecevic@gmail.com wrote:
>
>  Hi everybody,
>  I finally created the JIRA ticket and the pull request for the two array indexing functions:
>  https://issues.apache.org/jira/browse/SPARK-26826
>
>  Can any of the committers please check it out?
>
>  Thanks,
>  Petar
>
>  Petar Zečević <pe...@gmail.com> writes:
>
>  > Hi,
>  > I implemented two array functions that are useful to us and I wonder if you think it would be useful to add them to the distribution. The functions are used for filtering arrays based on indexes:
>  >
>  > array_allpositions (named after array_position) - takes a column and a value and returns an array of the column's indexes corresponding to elements equal to the provided value
>  >
>  > array_select - takes an array column and an array of indexes and returns a subset of the array based on the provided indexes.
>  >
>  > If you agree with this addition I can create a JIRA ticket and a pull request.
>
>  ---------------------------------------------------------------------
>  To unsubscribe e-mail: dev-unsubscribe@spark.apache.org




---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Array indexing functions

Posted by Sean Owen <sr...@gmail.com>.

Is it standard SQL or implemented in Hive? Because UDFs are so relatively
easy in Spark we don't need tons of builtins like an RDBMS does.

On Tue, Feb 5, 2019, 7:43 AM Petar Zečević <petar.zecevic@gmail.com wrote:

>
> Hi everybody,
> I finally created the JIRA ticket and the pull request for the two array
> indexing functions:
> https://issues.apache.org/jira/browse/SPARK-26826
>
> Can any of the committers please check it out?
>
> Thanks,
> Petar
>
>
> Petar Zečević <pe...@gmail.com> writes:
>
> > Hi,
> > I implemented two array functions that are useful to us and I wonder if
> you think it would be useful to add them to the distribution. The functions
> are used for filtering arrays based on indexes:
> >
> > array_allpositions (named after array_position) - takes a column and a
> value and returns an array of the column's indexes corresponding to
> elements equal to the provided value
> >
> > array_select - takes an array column and an array of indexes and returns
> a subset of the array based on the provided indexes.
> >
> > If you agree with this addition I can create a JIRA ticket and a pull
> request.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Array indexing functions

Posted by Petar Zečević <pe...@gmail.com>.

Hi everybody,
I finally created the JIRA ticket and the pull request for the two array indexing functions:
https://issues.apache.org/jira/browse/SPARK-26826

Can any of the committers please check it out?

Thanks,
Petar


Petar Zečević <pe...@gmail.com> writes:

> Hi,
> I implemented two array functions that are useful to us and I wonder if you think it would be useful to add them to the distribution. The functions are used for filtering arrays based on indexes:
>
> array_allpositions (named after array_position) - takes a column and a value and returns an array of the column's indexes corresponding to elements equal to the provided value
>
> array_select - takes an array column and an array of indexes and returns a subset of the array based on the provided indexes.
>
> If you agree with this addition I can create a JIRA ticket and a pull request.



---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org