You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Márton Balassi <ba...@gmail.com> on 2015/06/01 16:10:51 UTC

[DISCUSS] Consolidate method naming between the batch and streaming API

Looking at the DataSet and DataStream APIs we have come to the conclusion
with Aljoscha that there are a few methods that although providing the same
functionality are named differently. These are the following:

   1.  rebalance (batch) / distribute (streaming): Rebalances the data sent
   to the downstream operators thus equally distributing it.
   2. partitionByHash, partitionCustom (batch) / partitionBy (streaming):
   Partitioning has just recently been exposed in the streaming API and is not
   as refined as the batch one. The streaming partitionBy is actually
   partitionByHash.
   3. Union (batch) / merge, connect (streaming): The streaming merge does
   a union of two streams with the same type. Connect is conceptually
   different, it provides a way of sharing state between two streams with
   potentially different types without mapping them to a common type and then
   merging them. This saves latency and an ugly mapping. The former advantage
   can be offset by proper operator chaining, the second one would remain if
   we did not have connect.

To consolidate the naming I would suggest the following:

   1. Rename streaming distribute to rebalance.
   2. Rename streaming partitionBy to partitionByHash and file JIRA for
   custom partitioning support for streaming.
   3. Rename streaming merge to union, leave streaming connect as it is.

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Posted by Szabó Péter <ne...@gmail.com>.

Great proposal! We should use consistent naming for the two API.

Peter

2015-06-01 21:11 GMT+02:00 Márton Balassi <ba...@gmail.com>:

> @Fabian: I hope that this is the complete list, correct me f I am wrong. :)
>
> I am opening a small PR with the changes on top of Aljoscha's one that
> exposes the streaming partitioning then.
>
> On Mon, Jun 1, 2015 at 6:01 PM, Stephan Ewen <se...@apache.org> wrote:
>
> > +1
> >
> > Good list and choices, Marton!
> >
> > On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <fh...@gmail.com> wrote:
> >
> > > Thanks for bringing up this point!
> > >
> > > +1 for the renaming.
> > > @Marton: Is this a "complete" list, i.e., did you go through both APIs
> or
> > > might there be more methods that are semantically identical but named
> > > differently?
> > >
> > > 2015-06-01 17:31 GMT+02:00 Gyula Fóra <gy...@apache.org>:
> > >
> > > > +1 for the changes proposed by Marton (before the release)
> > > >
> > > > Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2015. jún.
> > 1.,
> > > > H,
> > > > 16:32):
> > > >
> > > > > Yes, these renamings make sense. The partitionBy() is not yet in
> the
> > > > > master for streaming, though.
> > > > >
> > > > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <
> > > balassi.marton@gmail.com
> > > > >
> > > > > wrote:
> > > > > > Looking at the DataSet and DataStream APIs we have come to the
> > > > conclusion
> > > > > > with Aljoscha that there are a few methods that although
> providing
> > > the
> > > > > same
> > > > > > functionality are named differently. These are the following:
> > > > > >
> > > > > >    1.  rebalance (batch) / distribute (streaming): Rebalances the
> > > data
> > > > > sent
> > > > > >    to the downstream operators thus equally distributing it.
> > > > > >    2. partitionByHash, partitionCustom (batch) / partitionBy
> > > > (streaming):
> > > > > >    Partitioning has just recently been exposed in the streaming
> API
> > > and
> > > > > is not
> > > > > >    as refined as the batch one. The streaming partitionBy is
> > actually
> > > > > >    partitionByHash.
> > > > > >    3. Union (batch) / merge, connect (streaming): The streaming
> > merge
> > > > > does
> > > > > >    a union of two streams with the same type. Connect is
> > conceptually
> > > > > >    different, it provides a way of sharing state between two
> > streams
> > > > with
> > > > > >    potentially different types without mapping them to a common
> > type
> > > > and
> > > > > then
> > > > > >    merging them. This saves latency and an ugly mapping. The
> former
> > > > > advantage
> > > > > >    can be offset by proper operator chaining, the second one
> would
> > > > > remain if
> > > > > >    we did not have connect.
> > > > > >
> > > > > > To consolidate the naming I would suggest the following:
> > > > > >
> > > > > >    1. Rename streaming distribute to rebalance.
> > > > > >    2. Rename streaming partitionBy to partitionByHash and file
> JIRA
> > > for
> > > > > >    custom partitioning support for streaming.
> > > > > >    3. Rename streaming merge to union, leave streaming connect as
> > it
> > > > is.
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Posted by Márton Balassi <ba...@gmail.com>.

@Fabian: I hope that this is the complete list, correct me f I am wrong. :)

I am opening a small PR with the changes on top of Aljoscha's one that
exposes the streaming partitioning then.

On Mon, Jun 1, 2015 at 6:01 PM, Stephan Ewen <se...@apache.org> wrote:

> +1
>
> Good list and choices, Marton!
>
> On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > Thanks for bringing up this point!
> >
> > +1 for the renaming.
> > @Marton: Is this a "complete" list, i.e., did you go through both APIs or
> > might there be more methods that are semantically identical but named
> > differently?
> >
> > 2015-06-01 17:31 GMT+02:00 Gyula Fóra <gy...@apache.org>:
> >
> > > +1 for the changes proposed by Marton (before the release)
> > >
> > > Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2015. jún.
> 1.,
> > > H,
> > > 16:32):
> > >
> > > > Yes, these renamings make sense. The partitionBy() is not yet in the
> > > > master for streaming, though.
> > > >
> > > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <
> > balassi.marton@gmail.com
> > > >
> > > > wrote:
> > > > > Looking at the DataSet and DataStream APIs we have come to the
> > > conclusion
> > > > > with Aljoscha that there are a few methods that although providing
> > the
> > > > same
> > > > > functionality are named differently. These are the following:
> > > > >
> > > > >    1.  rebalance (batch) / distribute (streaming): Rebalances the
> > data
> > > > sent
> > > > >    to the downstream operators thus equally distributing it.
> > > > >    2. partitionByHash, partitionCustom (batch) / partitionBy
> > > (streaming):
> > > > >    Partitioning has just recently been exposed in the streaming API
> > and
> > > > is not
> > > > >    as refined as the batch one. The streaming partitionBy is
> actually
> > > > >    partitionByHash.
> > > > >    3. Union (batch) / merge, connect (streaming): The streaming
> merge
> > > > does
> > > > >    a union of two streams with the same type. Connect is
> conceptually
> > > > >    different, it provides a way of sharing state between two
> streams
> > > with
> > > > >    potentially different types without mapping them to a common
> type
> > > and
> > > > then
> > > > >    merging them. This saves latency and an ugly mapping. The former
> > > > advantage
> > > > >    can be offset by proper operator chaining, the second one would
> > > > remain if
> > > > >    we did not have connect.
> > > > >
> > > > > To consolidate the naming I would suggest the following:
> > > > >
> > > > >    1. Rename streaming distribute to rebalance.
> > > > >    2. Rename streaming partitionBy to partitionByHash and file JIRA
> > for
> > > > >    custom partitioning support for streaming.
> > > > >    3. Rename streaming merge to union, leave streaming connect as
> it
> > > is.
> > > >
> > >
> >
>

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Posted by Stephan Ewen <se...@apache.org>.

+1

Good list and choices, Marton!

On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <fh...@gmail.com> wrote:

> Thanks for bringing up this point!
>
> +1 for the renaming.
> @Marton: Is this a "complete" list, i.e., did you go through both APIs or
> might there be more methods that are semantically identical but named
> differently?
>
> 2015-06-01 17:31 GMT+02:00 Gyula Fóra <gy...@apache.org>:
>
> > +1 for the changes proposed by Marton (before the release)
> >
> > Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2015. jún. 1.,
> > H,
> > 16:32):
> >
> > > Yes, these renamings make sense. The partitionBy() is not yet in the
> > > master for streaming, though.
> > >
> > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <
> balassi.marton@gmail.com
> > >
> > > wrote:
> > > > Looking at the DataSet and DataStream APIs we have come to the
> > conclusion
> > > > with Aljoscha that there are a few methods that although providing
> the
> > > same
> > > > functionality are named differently. These are the following:
> > > >
> > > >    1.  rebalance (batch) / distribute (streaming): Rebalances the
> data
> > > sent
> > > >    to the downstream operators thus equally distributing it.
> > > >    2. partitionByHash, partitionCustom (batch) / partitionBy
> > (streaming):
> > > >    Partitioning has just recently been exposed in the streaming API
> and
> > > is not
> > > >    as refined as the batch one. The streaming partitionBy is actually
> > > >    partitionByHash.
> > > >    3. Union (batch) / merge, connect (streaming): The streaming merge
> > > does
> > > >    a union of two streams with the same type. Connect is conceptually
> > > >    different, it provides a way of sharing state between two streams
> > with
> > > >    potentially different types without mapping them to a common type
> > and
> > > then
> > > >    merging them. This saves latency and an ugly mapping. The former
> > > advantage
> > > >    can be offset by proper operator chaining, the second one would
> > > remain if
> > > >    we did not have connect.
> > > >
> > > > To consolidate the naming I would suggest the following:
> > > >
> > > >    1. Rename streaming distribute to rebalance.
> > > >    2. Rename streaming partitionBy to partitionByHash and file JIRA
> for
> > > >    custom partitioning support for streaming.
> > > >    3. Rename streaming merge to union, leave streaming connect as it
> > is.
> > >
> >
>

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Posted by Fabian Hueske <fh...@gmail.com>.

Thanks for bringing up this point!

+1 for the renaming.
@Marton: Is this a "complete" list, i.e., did you go through both APIs or
might there be more methods that are semantically identical but named
differently?

2015-06-01 17:31 GMT+02:00 Gyula Fóra <gy...@apache.org>:

> +1 for the changes proposed by Marton (before the release)
>
> Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2015. jún. 1.,
> H,
> 16:32):
>
> > Yes, these renamings make sense. The partitionBy() is not yet in the
> > master for streaming, though.
> >
> > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <balassi.marton@gmail.com
> >
> > wrote:
> > > Looking at the DataSet and DataStream APIs we have come to the
> conclusion
> > > with Aljoscha that there are a few methods that although providing the
> > same
> > > functionality are named differently. These are the following:
> > >
> > >    1.  rebalance (batch) / distribute (streaming): Rebalances the data
> > sent
> > >    to the downstream operators thus equally distributing it.
> > >    2. partitionByHash, partitionCustom (batch) / partitionBy
> (streaming):
> > >    Partitioning has just recently been exposed in the streaming API and
> > is not
> > >    as refined as the batch one. The streaming partitionBy is actually
> > >    partitionByHash.
> > >    3. Union (batch) / merge, connect (streaming): The streaming merge
> > does
> > >    a union of two streams with the same type. Connect is conceptually
> > >    different, it provides a way of sharing state between two streams
> with
> > >    potentially different types without mapping them to a common type
> and
> > then
> > >    merging them. This saves latency and an ugly mapping. The former
> > advantage
> > >    can be offset by proper operator chaining, the second one would
> > remain if
> > >    we did not have connect.
> > >
> > > To consolidate the naming I would suggest the following:
> > >
> > >    1. Rename streaming distribute to rebalance.
> > >    2. Rename streaming partitionBy to partitionByHash and file JIRA for
> > >    custom partitioning support for streaming.
> > >    3. Rename streaming merge to union, leave streaming connect as it
> is.
> >
>

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Posted by Gyula Fóra <gy...@apache.org>.

+1 for the changes proposed by Marton (before the release)

Aljoscha Krettek <al...@apache.org> ezt írta (időpont: 2015. jún. 1., H,
16:32):

> Yes, these renamings make sense. The partitionBy() is not yet in the
> master for streaming, though.
>
> On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <ba...@gmail.com>
> wrote:
> > Looking at the DataSet and DataStream APIs we have come to the conclusion
> > with Aljoscha that there are a few methods that although providing the
> same
> > functionality are named differently. These are the following:
> >
> >    1.  rebalance (batch) / distribute (streaming): Rebalances the data
> sent
> >    to the downstream operators thus equally distributing it.
> >    2. partitionByHash, partitionCustom (batch) / partitionBy (streaming):
> >    Partitioning has just recently been exposed in the streaming API and
> is not
> >    as refined as the batch one. The streaming partitionBy is actually
> >    partitionByHash.
> >    3. Union (batch) / merge, connect (streaming): The streaming merge
> does
> >    a union of two streams with the same type. Connect is conceptually
> >    different, it provides a way of sharing state between two streams with
> >    potentially different types without mapping them to a common type and
> then
> >    merging them. This saves latency and an ugly mapping. The former
> advantage
> >    can be offset by proper operator chaining, the second one would
> remain if
> >    we did not have connect.
> >
> > To consolidate the naming I would suggest the following:
> >
> >    1. Rename streaming distribute to rebalance.
> >    2. Rename streaming partitionBy to partitionByHash and file JIRA for
> >    custom partitioning support for streaming.
> >    3. Rename streaming merge to union, leave streaming connect as it is.
>

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Posted by Aljoscha Krettek <al...@apache.org>.

Yes, these renamings make sense. The partitionBy() is not yet in the
master for streaming, though.

On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <ba...@gmail.com> wrote:
> Looking at the DataSet and DataStream APIs we have come to the conclusion
> with Aljoscha that there are a few methods that although providing the same
> functionality are named differently. These are the following:
>
>    1.  rebalance (batch) / distribute (streaming): Rebalances the data sent
>    to the downstream operators thus equally distributing it.
>    2. partitionByHash, partitionCustom (batch) / partitionBy (streaming):
>    Partitioning has just recently been exposed in the streaming API and is not
>    as refined as the batch one. The streaming partitionBy is actually
>    partitionByHash.
>    3. Union (batch) / merge, connect (streaming): The streaming merge does
>    a union of two streams with the same type. Connect is conceptually
>    different, it provides a way of sharing state between two streams with
>    potentially different types without mapping them to a common type and then
>    merging them. This saves latency and an ugly mapping. The former advantage
>    can be offset by proper operator chaining, the second one would remain if
>    we did not have connect.
>
> To consolidate the naming I would suggest the following:
>
>    1. Rename streaming distribute to rebalance.
>    2. Rename streaming partitionBy to partitionByHash and file JIRA for
>    custom partitioning support for streaming.
>    3. Rename streaming merge to union, leave streaming connect as it is.