You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Debajyoti Roy <ne...@gmail.com> on 2021/03/04 06:11:44 UTC

mapPartitions

Hi All,

For operators like filter, join, union, aggregate, window the
logical RelNode choices are obvious within org.apache.calcite.rel.logical
package.

However, for more complex applications that require operations like
mapPartitions, flatMapGroupsWithState, etc. what would be some choices for
logical rel node and relational expression classes in Apache Calcite
(independent of engine)?

What is a good way to model operators that are not traditionally relational
and hence do not exist in Calcite (looking for hooks / extension points /
roadmaps)?

Thanks in advance for any pointers, (disclaimer: I am new to Calcite)
Debajyoti Roy

Re: mapPartitions

Posted by Rui Wang <am...@apache.org>.
TableFunction is powerful: its input can be a table. So within the function
you can partition the table and apply computation over partitions. The only
question is what output do you expect. If you expect per partition result
(so TableFunction will return a table in this case), then you will likely
need a column for partition id in the output table.


-Rui

On Thu, Mar 4, 2021 at 12:40 PM Debajyoti Roy <ne...@gmail.com> wrote:

> Yes there is definitely some similarity to groupby
>
> What is this used for:
>
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/logical/LogicalTableFunctionScan.html
> Can i model a mapPartitions T -> U as ^ ?
>
> On Thu, Mar 4, 2021 at 12:06 PM Rui Wang <am...@apache.org> wrote:
>
> > I feel like the mapPartitions can be implemented as a SELECT + GROUP BY,
> > where GROUP BY is to partition the data, then per partition computation
> is
> > handled by the SELECT.
> >
> > -Rui
> >
> > On Thu, Mar 4, 2021 at 11:56 AM Debajyoti Roy <ne...@gmail.com>
> wrote:
> >
> > > Thanks again Julian.
> > >
> > > Since, mapPartitions is really a specialized map would it be best to
> > model
> > > it as a SELECT (similar to functions inside an expression) ?
> > > Barring cases where h > h' and mapPartitions acts like a filter.
> > >
> > > On Thu, Mar 4, 2021 at 11:41 AM Julian Hyde <jh...@gmail.com>
> > > wrote:
> > >
> > > > SQL has equivalents of many functional programming idioms:
> > > >
> > > >   * map is SELECT
> > > >   * filter is WHERE
> > > >   * flatMap is similar to CROSS APPLY
> > > >
> > > > That said, SQL’s strength is that the operations are not optimized
> for
> > > any
> > > > particular physical organization of data (e.g. working on sorted or
> > > > partitioned data). mapPartitions is in this category. Of course a
> > > physical
> > > > implementation of one of SQL’s logical operators might use
> > mapPartitions.
> > > >
> > > > Julian
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > On Mar 4, 2021, at 10:44 AM, Debajyoti Roy <ne...@gmail.com>
> > > wrote:
> > > > >
> > > > > Thanks for the responses, adding some more color below.
> > > > >
> > > > > Spark's API adopted concepts from the functional programming
> paradigm
> > > > (map,
> > > > > filter, flatmap,...) into data processing. Spark did add several
> > > > relational
> > > > > operators like join, union, select, etc. However, there are certain
> > > APIs
> > > > > that are really hard to model in terms of standard relational
> > > operators.
> > > > > Let me take one example of mapPartitions.
> > > > >
> > > > > mapPartitions( T -> U ):
> > > > > w columns and h rows can turn into totally different w' != w
> columns
> > > and
> > > > h'
> > > > > != h rows. Since processing happens per partition, this API is a
> > great
> > > > > choice for vectorized heavyweight initialization cost operations
> e.g.
> > > > batch
> > > > > inferencing.
> > > > >
> > > > > In terms of relational models, mapPartitions can be modeled just
> > like a
> > > > > function inside an expression operator. However, there can be
> > > interesting
> > > > > cases e.g. h > h' and mapMartitions starts to feel like a filter.
> Can
> > > > there
> > > > > be other challenges and opportunities in terms of planner and
> > optimizer
> > > > > because mapPartitions is definitely NOT like any other function
> > inside
> > > an
> > > > > expression as shown below:
> > > > >
> > > > > SELECT name, address, mapPartitions(id, tweet, '{threshold: 0.5}',
> > > > > 'sentiment_analysis_4', 10000) FROM my_twitter_data...
> > > > >
> > > > > So what is a better usage example for mapPartitions expressed as
> SQL
> > ?
> > > I
> > > > am
> > > > > really struggling with that part and I agree with Julian that is
> the
> > > key.
> > > > >
> > > > > Regards,
> > > > > Debajyoti Roy
> > > > >
> > > > > On Thu, Mar 4, 2021 at 12:01 AM Julian Hyde <
> jhyde.apache@gmail.com>
> > > > wrote:
> > > > >
> > > > >> I searched for mapPartitions and flatMapGroupsWithState, and it
> > looks
> > > as
> > > > >> if you are talking about Apache Spark operations. Can you give
> some
> > > > >> examples of typical queries that would use these operations?
> > > > >>
> > > > >> It’s possible that these operations accomplish things that are not
> > > > >> possible in the relational model; but I think it’s more likely
> that
> > > > these
> > > > >> are algorithms that can implement relational operations such as
> > > windowed
> > > > >> aggregate functions. If you give some examples, we can see whether
> > > they
> > > > can
> > > > >> be expressed in SQL or relational algebra.
> > > > >>
> > > > >> Julian
> > > > >>
> > > > >>
> > > > >>> On Mar 3, 2021, at 10:54 PM, Rui Wang <am...@apache.org>
> > wrote:
> > > > >>>
> > > > >>> Well I think the expected approach is to translate other
> operations
> > > to
> > > > >>> relational operators by yourself ;-)
> > > > >>>
> > > > >>> I think Calcite won't be the place to have extensions for such
> > > > >> translation.
> > > > >>> The main concern is that those non relational operations are
> > > > >> "non-standard".
> > > > >>>
> > > > >>> -Rui
> > > > >>>
> > > > >>> On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <
> newroyker@gmail.com
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> Hi All,
> > > > >>>>
> > > > >>>> For operators like filter, join, union, aggregate, window the
> > > > >>>> logical RelNode choices are obvious within
> > > > >> org.apache.calcite.rel.logical
> > > > >>>> package.
> > > > >>>>
> > > > >>>> However, for more complex applications that require operations
> > like
> > > > >>>> mapPartitions, flatMapGroupsWithState, etc. what would be some
> > > choices
> > > > >> for
> > > > >>>> logical rel node and relational expression classes in Apache
> > Calcite
> > > > >>>> (independent of engine)?
> > > > >>>>
> > > > >>>> What is a good way to model operators that are not traditionally
> > > > >> relational
> > > > >>>> and hence do not exist in Calcite (looking for hooks / extension
> > > > points
> > > > >> /
> > > > >>>> roadmaps)?
> > > > >>>>
> > > > >>>> Thanks in advance for any pointers, (disclaimer: I am new to
> > > Calcite)
> > > > >>>> Debajyoti Roy
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: mapPartitions

Posted by Debajyoti Roy <ne...@gmail.com>.
Yes there is definitely some similarity to groupby

What is this used for:
https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/logical/LogicalTableFunctionScan.html
Can i model a mapPartitions T -> U as ^ ?

On Thu, Mar 4, 2021 at 12:06 PM Rui Wang <am...@apache.org> wrote:

> I feel like the mapPartitions can be implemented as a SELECT + GROUP BY,
> where GROUP BY is to partition the data, then per partition computation is
> handled by the SELECT.
>
> -Rui
>
> On Thu, Mar 4, 2021 at 11:56 AM Debajyoti Roy <ne...@gmail.com> wrote:
>
> > Thanks again Julian.
> >
> > Since, mapPartitions is really a specialized map would it be best to
> model
> > it as a SELECT (similar to functions inside an expression) ?
> > Barring cases where h > h' and mapPartitions acts like a filter.
> >
> > On Thu, Mar 4, 2021 at 11:41 AM Julian Hyde <jh...@gmail.com>
> > wrote:
> >
> > > SQL has equivalents of many functional programming idioms:
> > >
> > >   * map is SELECT
> > >   * filter is WHERE
> > >   * flatMap is similar to CROSS APPLY
> > >
> > > That said, SQL’s strength is that the operations are not optimized for
> > any
> > > particular physical organization of data (e.g. working on sorted or
> > > partitioned data). mapPartitions is in this category. Of course a
> > physical
> > > implementation of one of SQL’s logical operators might use
> mapPartitions.
> > >
> > > Julian
> > >
> > >
> > >
> > >
> > >
> > > > On Mar 4, 2021, at 10:44 AM, Debajyoti Roy <ne...@gmail.com>
> > wrote:
> > > >
> > > > Thanks for the responses, adding some more color below.
> > > >
> > > > Spark's API adopted concepts from the functional programming paradigm
> > > (map,
> > > > filter, flatmap,...) into data processing. Spark did add several
> > > relational
> > > > operators like join, union, select, etc. However, there are certain
> > APIs
> > > > that are really hard to model in terms of standard relational
> > operators.
> > > > Let me take one example of mapPartitions.
> > > >
> > > > mapPartitions( T -> U ):
> > > > w columns and h rows can turn into totally different w' != w columns
> > and
> > > h'
> > > > != h rows. Since processing happens per partition, this API is a
> great
> > > > choice for vectorized heavyweight initialization cost operations e.g.
> > > batch
> > > > inferencing.
> > > >
> > > > In terms of relational models, mapPartitions can be modeled just
> like a
> > > > function inside an expression operator. However, there can be
> > interesting
> > > > cases e.g. h > h' and mapMartitions starts to feel like a filter. Can
> > > there
> > > > be other challenges and opportunities in terms of planner and
> optimizer
> > > > because mapPartitions is definitely NOT like any other function
> inside
> > an
> > > > expression as shown below:
> > > >
> > > > SELECT name, address, mapPartitions(id, tweet, '{threshold: 0.5}',
> > > > 'sentiment_analysis_4', 10000) FROM my_twitter_data...
> > > >
> > > > So what is a better usage example for mapPartitions expressed as SQL
> ?
> > I
> > > am
> > > > really struggling with that part and I agree with Julian that is the
> > key.
> > > >
> > > > Regards,
> > > > Debajyoti Roy
> > > >
> > > > On Thu, Mar 4, 2021 at 12:01 AM Julian Hyde <jh...@gmail.com>
> > > wrote:
> > > >
> > > >> I searched for mapPartitions and flatMapGroupsWithState, and it
> looks
> > as
> > > >> if you are talking about Apache Spark operations. Can you give some
> > > >> examples of typical queries that would use these operations?
> > > >>
> > > >> It’s possible that these operations accomplish things that are not
> > > >> possible in the relational model; but I think it’s more likely that
> > > these
> > > >> are algorithms that can implement relational operations such as
> > windowed
> > > >> aggregate functions. If you give some examples, we can see whether
> > they
> > > can
> > > >> be expressed in SQL or relational algebra.
> > > >>
> > > >> Julian
> > > >>
> > > >>
> > > >>> On Mar 3, 2021, at 10:54 PM, Rui Wang <am...@apache.org>
> wrote:
> > > >>>
> > > >>> Well I think the expected approach is to translate other operations
> > to
> > > >>> relational operators by yourself ;-)
> > > >>>
> > > >>> I think Calcite won't be the place to have extensions for such
> > > >> translation.
> > > >>> The main concern is that those non relational operations are
> > > >> "non-standard".
> > > >>>
> > > >>> -Rui
> > > >>>
> > > >>> On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <newroyker@gmail.com
> >
> > > >> wrote:
> > > >>>
> > > >>>> Hi All,
> > > >>>>
> > > >>>> For operators like filter, join, union, aggregate, window the
> > > >>>> logical RelNode choices are obvious within
> > > >> org.apache.calcite.rel.logical
> > > >>>> package.
> > > >>>>
> > > >>>> However, for more complex applications that require operations
> like
> > > >>>> mapPartitions, flatMapGroupsWithState, etc. what would be some
> > choices
> > > >> for
> > > >>>> logical rel node and relational expression classes in Apache
> Calcite
> > > >>>> (independent of engine)?
> > > >>>>
> > > >>>> What is a good way to model operators that are not traditionally
> > > >> relational
> > > >>>> and hence do not exist in Calcite (looking for hooks / extension
> > > points
> > > >> /
> > > >>>> roadmaps)?
> > > >>>>
> > > >>>> Thanks in advance for any pointers, (disclaimer: I am new to
> > Calcite)
> > > >>>> Debajyoti Roy
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: mapPartitions

Posted by Rui Wang <am...@apache.org>.
I feel like the mapPartitions can be implemented as a SELECT + GROUP BY,
where GROUP BY is to partition the data, then per partition computation is
handled by the SELECT.

-Rui

On Thu, Mar 4, 2021 at 11:56 AM Debajyoti Roy <ne...@gmail.com> wrote:

> Thanks again Julian.
>
> Since, mapPartitions is really a specialized map would it be best to model
> it as a SELECT (similar to functions inside an expression) ?
> Barring cases where h > h' and mapPartitions acts like a filter.
>
> On Thu, Mar 4, 2021 at 11:41 AM Julian Hyde <jh...@gmail.com>
> wrote:
>
> > SQL has equivalents of many functional programming idioms:
> >
> >   * map is SELECT
> >   * filter is WHERE
> >   * flatMap is similar to CROSS APPLY
> >
> > That said, SQL’s strength is that the operations are not optimized for
> any
> > particular physical organization of data (e.g. working on sorted or
> > partitioned data). mapPartitions is in this category. Of course a
> physical
> > implementation of one of SQL’s logical operators might use mapPartitions.
> >
> > Julian
> >
> >
> >
> >
> >
> > > On Mar 4, 2021, at 10:44 AM, Debajyoti Roy <ne...@gmail.com>
> wrote:
> > >
> > > Thanks for the responses, adding some more color below.
> > >
> > > Spark's API adopted concepts from the functional programming paradigm
> > (map,
> > > filter, flatmap,...) into data processing. Spark did add several
> > relational
> > > operators like join, union, select, etc. However, there are certain
> APIs
> > > that are really hard to model in terms of standard relational
> operators.
> > > Let me take one example of mapPartitions.
> > >
> > > mapPartitions( T -> U ):
> > > w columns and h rows can turn into totally different w' != w columns
> and
> > h'
> > > != h rows. Since processing happens per partition, this API is a great
> > > choice for vectorized heavyweight initialization cost operations e.g.
> > batch
> > > inferencing.
> > >
> > > In terms of relational models, mapPartitions can be modeled just like a
> > > function inside an expression operator. However, there can be
> interesting
> > > cases e.g. h > h' and mapMartitions starts to feel like a filter. Can
> > there
> > > be other challenges and opportunities in terms of planner and optimizer
> > > because mapPartitions is definitely NOT like any other function inside
> an
> > > expression as shown below:
> > >
> > > SELECT name, address, mapPartitions(id, tweet, '{threshold: 0.5}',
> > > 'sentiment_analysis_4', 10000) FROM my_twitter_data...
> > >
> > > So what is a better usage example for mapPartitions expressed as SQL ?
> I
> > am
> > > really struggling with that part and I agree with Julian that is the
> key.
> > >
> > > Regards,
> > > Debajyoti Roy
> > >
> > > On Thu, Mar 4, 2021 at 12:01 AM Julian Hyde <jh...@gmail.com>
> > wrote:
> > >
> > >> I searched for mapPartitions and flatMapGroupsWithState, and it looks
> as
> > >> if you are talking about Apache Spark operations. Can you give some
> > >> examples of typical queries that would use these operations?
> > >>
> > >> It’s possible that these operations accomplish things that are not
> > >> possible in the relational model; but I think it’s more likely that
> > these
> > >> are algorithms that can implement relational operations such as
> windowed
> > >> aggregate functions. If you give some examples, we can see whether
> they
> > can
> > >> be expressed in SQL or relational algebra.
> > >>
> > >> Julian
> > >>
> > >>
> > >>> On Mar 3, 2021, at 10:54 PM, Rui Wang <am...@apache.org> wrote:
> > >>>
> > >>> Well I think the expected approach is to translate other operations
> to
> > >>> relational operators by yourself ;-)
> > >>>
> > >>> I think Calcite won't be the place to have extensions for such
> > >> translation.
> > >>> The main concern is that those non relational operations are
> > >> "non-standard".
> > >>>
> > >>> -Rui
> > >>>
> > >>> On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <ne...@gmail.com>
> > >> wrote:
> > >>>
> > >>>> Hi All,
> > >>>>
> > >>>> For operators like filter, join, union, aggregate, window the
> > >>>> logical RelNode choices are obvious within
> > >> org.apache.calcite.rel.logical
> > >>>> package.
> > >>>>
> > >>>> However, for more complex applications that require operations like
> > >>>> mapPartitions, flatMapGroupsWithState, etc. what would be some
> choices
> > >> for
> > >>>> logical rel node and relational expression classes in Apache Calcite
> > >>>> (independent of engine)?
> > >>>>
> > >>>> What is a good way to model operators that are not traditionally
> > >> relational
> > >>>> and hence do not exist in Calcite (looking for hooks / extension
> > points
> > >> /
> > >>>> roadmaps)?
> > >>>>
> > >>>> Thanks in advance for any pointers, (disclaimer: I am new to
> Calcite)
> > >>>> Debajyoti Roy
> > >>>>
> > >>
> > >>
> >
> >
>

Re: mapPartitions

Posted by Debajyoti Roy <ne...@gmail.com>.
Thanks again Julian.

Since, mapPartitions is really a specialized map would it be best to model
it as a SELECT (similar to functions inside an expression) ?
Barring cases where h > h' and mapPartitions acts like a filter.

On Thu, Mar 4, 2021 at 11:41 AM Julian Hyde <jh...@gmail.com> wrote:

> SQL has equivalents of many functional programming idioms:
>
>   * map is SELECT
>   * filter is WHERE
>   * flatMap is similar to CROSS APPLY
>
> That said, SQL’s strength is that the operations are not optimized for any
> particular physical organization of data (e.g. working on sorted or
> partitioned data). mapPartitions is in this category. Of course a physical
> implementation of one of SQL’s logical operators might use mapPartitions.
>
> Julian
>
>
>
>
>
> > On Mar 4, 2021, at 10:44 AM, Debajyoti Roy <ne...@gmail.com> wrote:
> >
> > Thanks for the responses, adding some more color below.
> >
> > Spark's API adopted concepts from the functional programming paradigm
> (map,
> > filter, flatmap,...) into data processing. Spark did add several
> relational
> > operators like join, union, select, etc. However, there are certain APIs
> > that are really hard to model in terms of standard relational operators.
> > Let me take one example of mapPartitions.
> >
> > mapPartitions( T -> U ):
> > w columns and h rows can turn into totally different w' != w columns and
> h'
> > != h rows. Since processing happens per partition, this API is a great
> > choice for vectorized heavyweight initialization cost operations e.g.
> batch
> > inferencing.
> >
> > In terms of relational models, mapPartitions can be modeled just like a
> > function inside an expression operator. However, there can be interesting
> > cases e.g. h > h' and mapMartitions starts to feel like a filter. Can
> there
> > be other challenges and opportunities in terms of planner and optimizer
> > because mapPartitions is definitely NOT like any other function inside an
> > expression as shown below:
> >
> > SELECT name, address, mapPartitions(id, tweet, '{threshold: 0.5}',
> > 'sentiment_analysis_4', 10000) FROM my_twitter_data...
> >
> > So what is a better usage example for mapPartitions expressed as SQL ? I
> am
> > really struggling with that part and I agree with Julian that is the key.
> >
> > Regards,
> > Debajyoti Roy
> >
> > On Thu, Mar 4, 2021 at 12:01 AM Julian Hyde <jh...@gmail.com>
> wrote:
> >
> >> I searched for mapPartitions and flatMapGroupsWithState, and it looks as
> >> if you are talking about Apache Spark operations. Can you give some
> >> examples of typical queries that would use these operations?
> >>
> >> It’s possible that these operations accomplish things that are not
> >> possible in the relational model; but I think it’s more likely that
> these
> >> are algorithms that can implement relational operations such as windowed
> >> aggregate functions. If you give some examples, we can see whether they
> can
> >> be expressed in SQL or relational algebra.
> >>
> >> Julian
> >>
> >>
> >>> On Mar 3, 2021, at 10:54 PM, Rui Wang <am...@apache.org> wrote:
> >>>
> >>> Well I think the expected approach is to translate other operations to
> >>> relational operators by yourself ;-)
> >>>
> >>> I think Calcite won't be the place to have extensions for such
> >> translation.
> >>> The main concern is that those non relational operations are
> >> "non-standard".
> >>>
> >>> -Rui
> >>>
> >>> On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <ne...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> For operators like filter, join, union, aggregate, window the
> >>>> logical RelNode choices are obvious within
> >> org.apache.calcite.rel.logical
> >>>> package.
> >>>>
> >>>> However, for more complex applications that require operations like
> >>>> mapPartitions, flatMapGroupsWithState, etc. what would be some choices
> >> for
> >>>> logical rel node and relational expression classes in Apache Calcite
> >>>> (independent of engine)?
> >>>>
> >>>> What is a good way to model operators that are not traditionally
> >> relational
> >>>> and hence do not exist in Calcite (looking for hooks / extension
> points
> >> /
> >>>> roadmaps)?
> >>>>
> >>>> Thanks in advance for any pointers, (disclaimer: I am new to Calcite)
> >>>> Debajyoti Roy
> >>>>
> >>
> >>
>
>

Re: mapPartitions

Posted by Julian Hyde <jh...@gmail.com>.
SQL has equivalents of many functional programming idioms:

  * map is SELECT 
  * filter is WHERE
  * flatMap is similar to CROSS APPLY

That said, SQL’s strength is that the operations are not optimized for any particular physical organization of data (e.g. working on sorted or partitioned data). mapPartitions is in this category. Of course a physical implementation of one of SQL’s logical operators might use mapPartitions.

Julian





> On Mar 4, 2021, at 10:44 AM, Debajyoti Roy <ne...@gmail.com> wrote:
> 
> Thanks for the responses, adding some more color below.
> 
> Spark's API adopted concepts from the functional programming paradigm (map,
> filter, flatmap,...) into data processing. Spark did add several relational
> operators like join, union, select, etc. However, there are certain APIs
> that are really hard to model in terms of standard relational operators.
> Let me take one example of mapPartitions.
> 
> mapPartitions( T -> U ):
> w columns and h rows can turn into totally different w' != w columns and h'
> != h rows. Since processing happens per partition, this API is a great
> choice for vectorized heavyweight initialization cost operations e.g. batch
> inferencing.
> 
> In terms of relational models, mapPartitions can be modeled just like a
> function inside an expression operator. However, there can be interesting
> cases e.g. h > h' and mapMartitions starts to feel like a filter. Can there
> be other challenges and opportunities in terms of planner and optimizer
> because mapPartitions is definitely NOT like any other function inside an
> expression as shown below:
> 
> SELECT name, address, mapPartitions(id, tweet, '{threshold: 0.5}',
> 'sentiment_analysis_4', 10000) FROM my_twitter_data...
> 
> So what is a better usage example for mapPartitions expressed as SQL ? I am
> really struggling with that part and I agree with Julian that is the key.
> 
> Regards,
> Debajyoti Roy
> 
> On Thu, Mar 4, 2021 at 12:01 AM Julian Hyde <jh...@gmail.com> wrote:
> 
>> I searched for mapPartitions and flatMapGroupsWithState, and it looks as
>> if you are talking about Apache Spark operations. Can you give some
>> examples of typical queries that would use these operations?
>> 
>> It’s possible that these operations accomplish things that are not
>> possible in the relational model; but I think it’s more likely that these
>> are algorithms that can implement relational operations such as windowed
>> aggregate functions. If you give some examples, we can see whether they can
>> be expressed in SQL or relational algebra.
>> 
>> Julian
>> 
>> 
>>> On Mar 3, 2021, at 10:54 PM, Rui Wang <am...@apache.org> wrote:
>>> 
>>> Well I think the expected approach is to translate other operations to
>>> relational operators by yourself ;-)
>>> 
>>> I think Calcite won't be the place to have extensions for such
>> translation.
>>> The main concern is that those non relational operations are
>> "non-standard".
>>> 
>>> -Rui
>>> 
>>> On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <ne...@gmail.com>
>> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> For operators like filter, join, union, aggregate, window the
>>>> logical RelNode choices are obvious within
>> org.apache.calcite.rel.logical
>>>> package.
>>>> 
>>>> However, for more complex applications that require operations like
>>>> mapPartitions, flatMapGroupsWithState, etc. what would be some choices
>> for
>>>> logical rel node and relational expression classes in Apache Calcite
>>>> (independent of engine)?
>>>> 
>>>> What is a good way to model operators that are not traditionally
>> relational
>>>> and hence do not exist in Calcite (looking for hooks / extension points
>> /
>>>> roadmaps)?
>>>> 
>>>> Thanks in advance for any pointers, (disclaimer: I am new to Calcite)
>>>> Debajyoti Roy
>>>> 
>> 
>> 


Re: mapPartitions

Posted by Debajyoti Roy <ne...@gmail.com>.
Thanks for the responses, adding some more color below.

Spark's API adopted concepts from the functional programming paradigm (map,
filter, flatmap,...) into data processing. Spark did add several relational
operators like join, union, select, etc. However, there are certain APIs
that are really hard to model in terms of standard relational operators.
Let me take one example of mapPartitions.

mapPartitions( T -> U ):
w columns and h rows can turn into totally different w' != w columns and h'
!= h rows. Since processing happens per partition, this API is a great
choice for vectorized heavyweight initialization cost operations e.g. batch
inferencing.

In terms of relational models, mapPartitions can be modeled just like a
function inside an expression operator. However, there can be interesting
cases e.g. h > h' and mapMartitions starts to feel like a filter. Can there
be other challenges and opportunities in terms of planner and optimizer
because mapPartitions is definitely NOT like any other function inside an
expression as shown below:

SELECT name, address, mapPartitions(id, tweet, '{threshold: 0.5}',
'sentiment_analysis_4', 10000) FROM my_twitter_data...

So what is a better usage example for mapPartitions expressed as SQL ? I am
really struggling with that part and I agree with Julian that is the key.

Regards,
Debajyoti Roy

On Thu, Mar 4, 2021 at 12:01 AM Julian Hyde <jh...@gmail.com> wrote:

> I searched for mapPartitions and flatMapGroupsWithState, and it looks as
> if you are talking about Apache Spark operations. Can you give some
> examples of typical queries that would use these operations?
>
> It’s possible that these operations accomplish things that are not
> possible in the relational model; but I think it’s more likely that these
> are algorithms that can implement relational operations such as windowed
> aggregate functions. If you give some examples, we can see whether they can
> be expressed in SQL or relational algebra.
>
> Julian
>
>
> > On Mar 3, 2021, at 10:54 PM, Rui Wang <am...@apache.org> wrote:
> >
> > Well I think the expected approach is to translate other operations to
> > relational operators by yourself ;-)
> >
> > I think Calcite won't be the place to have extensions for such
> translation.
> > The main concern is that those non relational operations are
> "non-standard".
> >
> > -Rui
> >
> > On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <ne...@gmail.com>
> wrote:
> >
> >> Hi All,
> >>
> >> For operators like filter, join, union, aggregate, window the
> >> logical RelNode choices are obvious within
> org.apache.calcite.rel.logical
> >> package.
> >>
> >> However, for more complex applications that require operations like
> >> mapPartitions, flatMapGroupsWithState, etc. what would be some choices
> for
> >> logical rel node and relational expression classes in Apache Calcite
> >> (independent of engine)?
> >>
> >> What is a good way to model operators that are not traditionally
> relational
> >> and hence do not exist in Calcite (looking for hooks / extension points
> /
> >> roadmaps)?
> >>
> >> Thanks in advance for any pointers, (disclaimer: I am new to Calcite)
> >> Debajyoti Roy
> >>
>
>

Re: mapPartitions

Posted by Julian Hyde <jh...@gmail.com>.
I searched for mapPartitions and flatMapGroupsWithState, and it looks as if you are talking about Apache Spark operations. Can you give some examples of typical queries that would use these operations?

It’s possible that these operations accomplish things that are not possible in the relational model; but I think it’s more likely that these are algorithms that can implement relational operations such as windowed aggregate functions. If you give some examples, we can see whether they can be expressed in SQL or relational algebra.

Julian


> On Mar 3, 2021, at 10:54 PM, Rui Wang <am...@apache.org> wrote:
> 
> Well I think the expected approach is to translate other operations to
> relational operators by yourself ;-)
> 
> I think Calcite won't be the place to have extensions for such translation.
> The main concern is that those non relational operations are "non-standard".
> 
> -Rui
> 
> On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <ne...@gmail.com> wrote:
> 
>> Hi All,
>> 
>> For operators like filter, join, union, aggregate, window the
>> logical RelNode choices are obvious within org.apache.calcite.rel.logical
>> package.
>> 
>> However, for more complex applications that require operations like
>> mapPartitions, flatMapGroupsWithState, etc. what would be some choices for
>> logical rel node and relational expression classes in Apache Calcite
>> (independent of engine)?
>> 
>> What is a good way to model operators that are not traditionally relational
>> and hence do not exist in Calcite (looking for hooks / extension points /
>> roadmaps)?
>> 
>> Thanks in advance for any pointers, (disclaimer: I am new to Calcite)
>> Debajyoti Roy
>> 


Re: mapPartitions

Posted by Rui Wang <am...@apache.org>.
Well I think the expected approach is to translate other operations to
relational operators by yourself ;-)

I think Calcite won't be the place to have extensions for such translation.
The main concern is that those non relational operations are "non-standard".

-Rui

On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <ne...@gmail.com> wrote:

> Hi All,
>
> For operators like filter, join, union, aggregate, window the
> logical RelNode choices are obvious within org.apache.calcite.rel.logical
> package.
>
> However, for more complex applications that require operations like
> mapPartitions, flatMapGroupsWithState, etc. what would be some choices for
> logical rel node and relational expression classes in Apache Calcite
> (independent of engine)?
>
> What is a good way to model operators that are not traditionally relational
> and hence do not exist in Calcite (looking for hooks / extension points /
> roadmaps)?
>
> Thanks in advance for any pointers, (disclaimer: I am new to Calcite)
> Debajyoti Roy
>