You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Stas Levin <st...@gmail.com> on 2017/01/08 14:06:04 UTC

splitIntoBundles vs. generateInitialSplits

Hi,

A short terminology question regarding "bundle", and
particularly splitIntoBundles vs. generateInitialSplits.

In *BoundedSource* we have:
List<? extends BoundedSource<T>> *splitIntoBundles*(...)

In *UnboundedSource* we have:
List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
*generateInitialSplits*(...)

I was wondering if the names were intentionally made different, i.e. "into
bundles" vs "into splits"?
In a way these two methods carry out a very similar task, would it be
reasonable to think of *splitIntoBundles *as *generate*Initial*Splits? *
(strikethrough due to "initial" not being applicable in the case of bounded
sources)

Regards,
Stas

Re: splitIntoBundles vs. generateInitialSplits

Posted by Etienne Chauchot <ec...@gmail.com>.

Hi all,

Just to give feedback on the ML, we got a consensus in the PR about 
using "split()" name. So I'll update the PR

Best

Etienne


Le 13/04/2017 � 13:35, Jean-Baptiste Onofr� a �crit :
> Hi,
>
> Thanks for that. I gonna review and it's not a big deal with the IOs 
> PR (easy to rebase and update).
>
> I would have preferred simply split() but ok with splitIntoSubSources().
>
> Regards
> JB
>
> On 04/13/2017 01:31 PM, Etienne Chauchot wrote:
>> Hi all,
>>
>> It seems that we have an agreement on the name, and as the stable 
>> release date
>> is coming soon, I did two simple PRs to rename both methods to
>> splitIntoSubSources as suggested bellow. One is for the beam code 
>> base and the
>> other is for the website. I did not change the python-sdk
>>
>> https://github.com/apache/beam/pull/2523
>>
>> https://github.com/apache/beam-site/pull/210
>>
>> If these PRs are merged, sorry for the other opened PRs (including IO 
>> PRs) that
>> use the old names :)
>>
>> Best
>>
>> Etienne
>>
>> Le 11/01/2017 � 16:26, Stas Levin a �crit :
>>> Eugene, that makes a lot of sense to me.
>>>
>>> Do you think it's worth filing a Jira ticket?
>>>
>>> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
>>> <ki...@google.com.invalid> wrote:
>>>
>>> I agree that the methods are named somewhat confusingly, and ideally 
>>> would
>>> be named the same. Both of the names miss some aspect of the underlying
>>> concept.
>>>
>>> The underlying concept is split the source into smaller sub-sources 
>>> which,
>>> if you read all of them, would have read the same data as the 
>>> original one.
>>> "splitIntoBundles" assumes that 1 source = 1 bundle, which is 
>>> completely
>>> false in streaming, and only partially true in batch (I'm talking 
>>> about the
>>> Dataflow runner).
>>> "generateInitialSplits" assumes that this splitting happens only
>>> "initially", i.e. at job startup time. This is currently true in 
>>> practice
>>> for all existing runners, but it doesn't have to be - we could 
>>> conceivably
>>> call it again at some point during the job if we see that some of the
>>> sub-sources are still too large.
>>>
>>> The analogous method in Splittable DoFn (
>>> https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
>>> there are no restrictions in source API, only sources.
>>>
>>> Perhaps both should be called simply "split", or "splitIntoSubSources".
>>>
>>> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:
>>>
>>>> Definitely seems like the formatting got lost in translation, sorry 
>>>> about
>>>> that :)
>>>>
>>>> I guess both cases (methods) create splits, which are essentially a 
>>>> list
>>> of
>>>> bounded/unbounded source instances, each responsible for reading 
>>>> certain
>>>> segments (physical or otherwise) of the data.
>>>>
>>>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
>>>> wrote:
>>>>
>>>>> hi!
>>>>>
>>>>> I think your strikethrough got lost due to this being a text-only 
>>>>> email
>>>>> list. To make sure, I think you're asking the following:
>>>>> " would it be reasonable to think of splitIntoBundles as 
>>>>> generateSplits?
>>>> "
>>>>> (ie, you strikethrough'd Initial)
>>>>>
>>>>> They are very similar and I definitely also think of them as 
>>>>> occupying
>>>> the
>>>>> same niche. I'll let someone else who was around for naming discuss
>>>> whether
>>>>> it was intentional or not. Conceptually, the way that bounded vs
>>>> streaming
>>>>> are handled means that they are doing slightly different things: a
>>>> bounded
>>>>> source is really kind of creating physical chunks of the data, 
>>>>> whereas
>>>> the
>>>>> streaming source is creating conceptual divisions of the data that 
>>>>> will
>>>> be
>>>>> used later. I'm not sure that's worth the confusion caused by the
>>>>> differences.
>>>>>
>>>>> One thing to clarify - splitIntoBundles does have an "Initial" 
>>>>> aspect to
>>>>> it. I don't believe there is a publicly defined/written down order 
>>>>> the
>>>>> Sources & Reader methods are called in, but a runner trying to get
>>>>> efficiency would be able to use splitIntoBundles during job 
>>>>> startup to
>>> be
>>>>> able to split up the work before creating readers rather than after
>>>>> creating readers and waiting to use splitAtFraction.
>>>>>
>>>>> S
>>>>>
>>>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> A short terminology question regarding "bundle", and
>>>>>> particularly splitIntoBundles vs. generateInitialSplits.
>>>>>>
>>>>>> In *BoundedSource* we have:
>>>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
>>>>>>
>>>>>> In *UnboundedSource* we have:
>>>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
>>>>>> *generateInitialSplits*(...)
>>>>>>
>>>>>> I was wondering if the names were intentionally made different, i.e.
>>>>> "into
>>>>>> bundles" vs "into splits"?
>>>>>> In a way these two methods carry out a very similar task, would 
>>>>>> it be
>>>>>> reasonable to think of *splitIntoBundles *as 
>>>>>> *generate*Initial*Splits?
>>>> *
>>>>>> (strikethrough due to "initial" not being applicable in the case of
>>>>> bounded
>>>>>> sources)
>>>>>>
>>>>>> Regards,
>>>>>> Stas
>>>>>>
>>
>

Re: splitIntoBundles vs. generateInitialSplits

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi,

Thanks for that. I gonna review and it's not a big deal with the IOs PR (easy to 
rebase and update).

I would have preferred simply split() but ok with splitIntoSubSources().

Regards
JB

On 04/13/2017 01:31 PM, Etienne Chauchot wrote:
> Hi all,
>
> It seems that we have an agreement on the name, and as the stable release date
> is coming soon, I did two simple PRs to rename both methods to
> splitIntoSubSources as suggested bellow. One is for the beam code base and the
> other is for the website. I did not change the python-sdk
>
> https://github.com/apache/beam/pull/2523
>
> https://github.com/apache/beam-site/pull/210
>
> If these PRs are merged, sorry for the other opened PRs (including IO PRs) that
> use the old names :)
>
> Best
>
> Etienne
>
> Le 11/01/2017 � 16:26, Stas Levin a �crit :
>> Eugene, that makes a lot of sense to me.
>>
>> Do you think it's worth filing a Jira ticket?
>>
>> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
>> <ki...@google.com.invalid> wrote:
>>
>> I agree that the methods are named somewhat confusingly, and ideally would
>> be named the same. Both of the names miss some aspect of the underlying
>> concept.
>>
>> The underlying concept is split the source into smaller sub-sources which,
>> if you read all of them, would have read the same data as the original one.
>> "splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
>> false in streaming, and only partially true in batch (I'm talking about the
>> Dataflow runner).
>> "generateInitialSplits" assumes that this splitting happens only
>> "initially", i.e. at job startup time. This is currently true in practice
>> for all existing runners, but it doesn't have to be - we could conceivably
>> call it again at some point during the job if we see that some of the
>> sub-sources are still too large.
>>
>> The analogous method in Splittable DoFn (
>> https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
>> there are no restrictions in source API, only sources.
>>
>> Perhaps both should be called simply "split", or "splitIntoSubSources".
>>
>> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:
>>
>>> Definitely seems like the formatting got lost in translation, sorry about
>>> that :)
>>>
>>> I guess both cases (methods) create splits, which are essentially a list
>> of
>>> bounded/unbounded source instances, each responsible for reading certain
>>> segments (physical or otherwise) of the data.
>>>
>>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
>>> wrote:
>>>
>>>> hi!
>>>>
>>>> I think your strikethrough got lost due to this being a text-only email
>>>> list. To make sure, I think you're asking the following:
>>>> " would it be reasonable to think of splitIntoBundles as generateSplits?
>>> "
>>>> (ie, you strikethrough'd Initial)
>>>>
>>>> They are very similar and I definitely also think of them as occupying
>>> the
>>>> same niche. I'll let someone else who was around for naming discuss
>>> whether
>>>> it was intentional or not. Conceptually, the way that bounded vs
>>> streaming
>>>> are handled means that they are doing slightly different things: a
>>> bounded
>>>> source is really kind of creating physical chunks of the data, whereas
>>> the
>>>> streaming source is creating conceptual divisions of the data that will
>>> be
>>>> used later. I'm not sure that's worth the confusion caused by the
>>>> differences.
>>>>
>>>> One thing to clarify - splitIntoBundles does have an "Initial" aspect to
>>>> it. I don't believe there is a publicly defined/written down order the
>>>> Sources & Reader methods are called in, but a runner trying to get
>>>> efficiency would be able to use splitIntoBundles during job startup to
>> be
>>>> able to split up the work before creating readers rather than after
>>>> creating readers and waiting to use splitAtFraction.
>>>>
>>>> S
>>>>
>>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> A short terminology question regarding "bundle", and
>>>>> particularly splitIntoBundles vs. generateInitialSplits.
>>>>>
>>>>> In *BoundedSource* we have:
>>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
>>>>>
>>>>> In *UnboundedSource* we have:
>>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
>>>>> *generateInitialSplits*(...)
>>>>>
>>>>> I was wondering if the names were intentionally made different, i.e.
>>>> "into
>>>>> bundles" vs "into splits"?
>>>>> In a way these two methods carry out a very similar task, would it be
>>>>> reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
>>> *
>>>>> (strikethrough due to "initial" not being applicable in the case of
>>>> bounded
>>>>> sources)
>>>>>
>>>>> Regards,
>>>>> Stas
>>>>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: splitIntoBundles vs. generateInitialSplits

Posted by Etienne Chauchot <ec...@gmail.com>.

Hi all,

It seems that we have an agreement on the name, and as the stable 
release date is coming soon, I did two simple PRs to rename both methods 
to splitIntoSubSources as suggested bellow. One is for the beam code 
base and the other is for the website. I did not change the python-sdk

https://github.com/apache/beam/pull/2523

https://github.com/apache/beam-site/pull/210

If these PRs are merged, sorry for the other opened PRs (including IO 
PRs) that use the old names :)

Best

Etienne

Le 11/01/2017 � 16:26, Stas Levin a �crit :
> Eugene, that makes a lot of sense to me.
>
> Do you think it's worth filing a Jira ticket?
>
> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> I agree that the methods are named somewhat confusingly, and ideally would
> be named the same. Both of the names miss some aspect of the underlying
> concept.
>
> The underlying concept is split the source into smaller sub-sources which,
> if you read all of them, would have read the same data as the original one.
> "splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
> false in streaming, and only partially true in batch (I'm talking about the
> Dataflow runner).
> "generateInitialSplits" assumes that this splitting happens only
> "initially", i.e. at job startup time. This is currently true in practice
> for all existing runners, but it doesn't have to be - we could conceivably
> call it again at some point during the job if we see that some of the
> sub-sources are still too large.
>
> The analogous method in Splittable DoFn (
> https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
> there are no restrictions in source API, only sources.
>
> Perhaps both should be called simply "split", or "splitIntoSubSources".
>
> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:
>
>> Definitely seems like the formatting got lost in translation, sorry about
>> that :)
>>
>> I guess both cases (methods) create splits, which are essentially a list
> of
>> bounded/unbounded source instances, each responsible for reading certain
>> segments (physical or otherwise) of the data.
>>
>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
>> wrote:
>>
>>> hi!
>>>
>>> I think your strikethrough got lost due to this being a text-only email
>>> list. To make sure, I think you're asking the following:
>>> " would it be reasonable to think of splitIntoBundles as generateSplits?
>> "
>>> (ie, you strikethrough'd Initial)
>>>
>>> They are very similar and I definitely also think of them as occupying
>> the
>>> same niche. I'll let someone else who was around for naming discuss
>> whether
>>> it was intentional or not. Conceptually, the way that bounded vs
>> streaming
>>> are handled means that they are doing slightly different things: a
>> bounded
>>> source is really kind of creating physical chunks of the data, whereas
>> the
>>> streaming source is creating conceptual divisions of the data that will
>> be
>>> used later. I'm not sure that's worth the confusion caused by the
>>> differences.
>>>
>>> One thing to clarify - splitIntoBundles does have an "Initial" aspect to
>>> it. I don't believe there is a publicly defined/written down order the
>>> Sources & Reader methods are called in, but a runner trying to get
>>> efficiency would be able to use splitIntoBundles during job startup to
> be
>>> able to split up the work before creating readers rather than after
>>> creating readers and waiting to use splitAtFraction.
>>>
>>> S
>>>
>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> A short terminology question regarding "bundle", and
>>>> particularly splitIntoBundles vs. generateInitialSplits.
>>>>
>>>> In *BoundedSource* we have:
>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
>>>>
>>>> In *UnboundedSource* we have:
>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
>>>> *generateInitialSplits*(...)
>>>>
>>>> I was wondering if the names were intentionally made different, i.e.
>>> "into
>>>> bundles" vs "into splits"?
>>>> In a way these two methods carry out a very similar task, would it be
>>>> reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
>> *
>>>> (strikethrough due to "initial" not being applicable in the case of
>>> bounded
>>>> sources)
>>>>
>>>> Regards,
>>>> Stas
>>>>

Re: splitIntoBundles vs. generateInitialSplits

Posted by Stas Levin <st...@apache.org>.

Indeed, take a look at https://issues.apache.org/jira/browse/BEAM-1272.

On Tue, Mar 21, 2017 at 8:20 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> It makes sense.
>
> Regards
> JB
>
> On 03/20/2017 11:14 PM, Ismaël Mejía wrote:
> > This is an forgotten one, Stas did you create a JIRA about this one? I
> > think this change should be also tagged as First version release,
> > because this is an API change and can break stuff if we do it later
> > on.
> >
> > On Wed, Jan 11, 2017 at 4:30 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >> Hi Eugene and Stas,
> >>
> >> Just back from couple of days off and jump on this discussion.
> >>
> >> I agree with Stas: it's worth to create a Jira about that. The only
> >> "semantic" difference is unbounded vs bounded source, but the behavior
> is
> >> the same.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 01/11/2017 04:26 PM, Stas Levin wrote:
> >>>
> >>> Eugene, that makes a lot of sense to me.
> >>>
> >>> Do you think it's worth filing a Jira ticket?
> >>>
> >>> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
> >>> <ki...@google.com.invalid> wrote:
> >>>
> >>> I agree that the methods are named somewhat confusingly, and ideally
> would
> >>> be named the same. Both of the names miss some aspect of the underlying
> >>> concept.
> >>>
> >>> The underlying concept is split the source into smaller sub-sources
> which,
> >>> if you read all of them, would have read the same data as the original
> >>> one.
> >>> "splitIntoBundles" assumes that 1 source = 1 bundle, which is
> completely
> >>> false in streaming, and only partially true in batch (I'm talking about
> >>> the
> >>> Dataflow runner).
> >>> "generateInitialSplits" assumes that this splitting happens only
> >>> "initially", i.e. at job startup time. This is currently true in
> practice
> >>> for all existing runners, but it doesn't have to be - we could
> conceivably
> >>> call it again at some point during the job if we see that some of the
> >>> sub-sources are still too large.
> >>>
> >>> The analogous method in Splittable DoFn (
> >>> https://s.apache.org/splittable-do-fn) is called @SplitRestriction,
> but
> >>> there are no restrictions in source API, only sources.
> >>>
> >>> Perhaps both should be called simply "split", or "splitIntoSubSources".
> >>>
> >>> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:
> >>>
> >>>> Definitely seems like the formatting got lost in translation, sorry
> about
> >>>> that :)
> >>>>
> >>>> I guess both cases (methods) create splits, which are essentially a
> list
> >>>
> >>> of
> >>>>
> >>>> bounded/unbounded source instances, each responsible for reading
> certain
> >>>> segments (physical or otherwise) of the data.
> >>>>
> >>>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <sisk@google.com.invalid
> >
> >>>> wrote:
> >>>>
> >>>>> hi!
> >>>>>
> >>>>> I think your strikethrough got lost due to this being a text-only
> email
> >>>>> list. To make sure, I think you're asking the following:
> >>>>> " would it be reasonable to think of splitIntoBundles as
> generateSplits?
> >>>>
> >>>> "
> >>>>>
> >>>>> (ie, you strikethrough'd Initial)
> >>>>>
> >>>>> They are very similar and I definitely also think of them as
> occupying
> >>>>
> >>>> the
> >>>>>
> >>>>> same niche. I'll let someone else who was around for naming discuss
> >>>>
> >>>> whether
> >>>>>
> >>>>> it was intentional or not. Conceptually, the way that bounded vs
> >>>>
> >>>> streaming
> >>>>>
> >>>>> are handled means that they are doing slightly different things: a
> >>>>
> >>>> bounded
> >>>>>
> >>>>> source is really kind of creating physical chunks of the data,
> whereas
> >>>>
> >>>> the
> >>>>>
> >>>>> streaming source is creating conceptual divisions of the data that
> will
> >>>>
> >>>> be
> >>>>>
> >>>>> used later. I'm not sure that's worth the confusion caused by the
> >>>>> differences.
> >>>>>
> >>>>> One thing to clarify - splitIntoBundles does have an "Initial"
> aspect to
> >>>>> it. I don't believe there is a publicly defined/written down order
> the
> >>>>> Sources & Reader methods are called in, but a runner trying to get
> >>>>> efficiency would be able to use splitIntoBundles during job startup
> to
> >>>
> >>> be
> >>>>>
> >>>>> able to split up the work before creating readers rather than after
> >>>>> creating readers and waiting to use splitAtFraction.
> >>>>>
> >>>>> S
> >>>>>
> >>>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> A short terminology question regarding "bundle", and
> >>>>>> particularly splitIntoBundles vs. generateInitialSplits.
> >>>>>>
> >>>>>> In *BoundedSource* we have:
> >>>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
> >>>>>>
> >>>>>> In *UnboundedSource* we have:
> >>>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
> >>>>>> *generateInitialSplits*(...)
> >>>>>>
> >>>>>> I was wondering if the names were intentionally made different, i.e.
> >>>>>
> >>>>> "into
> >>>>>>
> >>>>>> bundles" vs "into splits"?
> >>>>>> In a way these two methods carry out a very similar task, would it
> be
> >>>>>> reasonable to think of *splitIntoBundles *as
> *generate*Initial*Splits?
> >>>>
> >>>> *
> >>>>>>
> >>>>>> (strikethrough due to "initial" not being applicable in the case of
> >>>>>
> >>>>> bounded
> >>>>>>
> >>>>>> sources)
> >>>>>>
> >>>>>> Regards,
> >>>>>> Stas
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: splitIntoBundles vs. generateInitialSplits

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

It makes sense.

Regards
JB

On 03/20/2017 11:14 PM, Isma�l Mej�a wrote:
> This is an forgotten one, Stas did you create a JIRA about this one? I
> think this change should be also tagged as First version release,
> because this is an API change and can break stuff if we do it later
> on.
>
> On Wed, Jan 11, 2017 at 4:30 PM, Jean-Baptiste Onofr� <jb...@nanthrax.net> wrote:
>> Hi Eugene and Stas,
>>
>> Just back from couple of days off and jump on this discussion.
>>
>> I agree with Stas: it's worth to create a Jira about that. The only
>> "semantic" difference is unbounded vs bounded source, but the behavior is
>> the same.
>>
>> Regards
>> JB
>>
>>
>> On 01/11/2017 04:26 PM, Stas Levin wrote:
>>>
>>> Eugene, that makes a lot of sense to me.
>>>
>>> Do you think it's worth filing a Jira ticket?
>>>
>>> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
>>> <ki...@google.com.invalid> wrote:
>>>
>>> I agree that the methods are named somewhat confusingly, and ideally would
>>> be named the same. Both of the names miss some aspect of the underlying
>>> concept.
>>>
>>> The underlying concept is split the source into smaller sub-sources which,
>>> if you read all of them, would have read the same data as the original
>>> one.
>>> "splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
>>> false in streaming, and only partially true in batch (I'm talking about
>>> the
>>> Dataflow runner).
>>> "generateInitialSplits" assumes that this splitting happens only
>>> "initially", i.e. at job startup time. This is currently true in practice
>>> for all existing runners, but it doesn't have to be - we could conceivably
>>> call it again at some point during the job if we see that some of the
>>> sub-sources are still too large.
>>>
>>> The analogous method in Splittable DoFn (
>>> https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
>>> there are no restrictions in source API, only sources.
>>>
>>> Perhaps both should be called simply "split", or "splitIntoSubSources".
>>>
>>> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:
>>>
>>>> Definitely seems like the formatting got lost in translation, sorry about
>>>> that :)
>>>>
>>>> I guess both cases (methods) create splits, which are essentially a list
>>>
>>> of
>>>>
>>>> bounded/unbounded source instances, each responsible for reading certain
>>>> segments (physical or otherwise) of the data.
>>>>
>>>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
>>>> wrote:
>>>>
>>>>> hi!
>>>>>
>>>>> I think your strikethrough got lost due to this being a text-only email
>>>>> list. To make sure, I think you're asking the following:
>>>>> " would it be reasonable to think of splitIntoBundles as generateSplits?
>>>>
>>>> "
>>>>>
>>>>> (ie, you strikethrough'd Initial)
>>>>>
>>>>> They are very similar and I definitely also think of them as occupying
>>>>
>>>> the
>>>>>
>>>>> same niche. I'll let someone else who was around for naming discuss
>>>>
>>>> whether
>>>>>
>>>>> it was intentional or not. Conceptually, the way that bounded vs
>>>>
>>>> streaming
>>>>>
>>>>> are handled means that they are doing slightly different things: a
>>>>
>>>> bounded
>>>>>
>>>>> source is really kind of creating physical chunks of the data, whereas
>>>>
>>>> the
>>>>>
>>>>> streaming source is creating conceptual divisions of the data that will
>>>>
>>>> be
>>>>>
>>>>> used later. I'm not sure that's worth the confusion caused by the
>>>>> differences.
>>>>>
>>>>> One thing to clarify - splitIntoBundles does have an "Initial" aspect to
>>>>> it. I don't believe there is a publicly defined/written down order the
>>>>> Sources & Reader methods are called in, but a runner trying to get
>>>>> efficiency would be able to use splitIntoBundles during job startup to
>>>
>>> be
>>>>>
>>>>> able to split up the work before creating readers rather than after
>>>>> creating readers and waiting to use splitAtFraction.
>>>>>
>>>>> S
>>>>>
>>>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> A short terminology question regarding "bundle", and
>>>>>> particularly splitIntoBundles vs. generateInitialSplits.
>>>>>>
>>>>>> In *BoundedSource* we have:
>>>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
>>>>>>
>>>>>> In *UnboundedSource* we have:
>>>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
>>>>>> *generateInitialSplits*(...)
>>>>>>
>>>>>> I was wondering if the names were intentionally made different, i.e.
>>>>>
>>>>> "into
>>>>>>
>>>>>> bundles" vs "into splits"?
>>>>>> In a way these two methods carry out a very similar task, would it be
>>>>>> reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
>>>>
>>>> *
>>>>>>
>>>>>> (strikethrough due to "initial" not being applicable in the case of
>>>>>
>>>>> bounded
>>>>>>
>>>>>> sources)
>>>>>>
>>>>>> Regards,
>>>>>> Stas
>>>>>>
>>>>>
>>>>
>>>
>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: splitIntoBundles vs. generateInitialSplits

Posted by Ismaël Mejía <ie...@apache.org>.

This is an forgotten one, Stas did you create a JIRA about this one? I
think this change should be also tagged as First version release,
because this is an API change and can break stuff if we do it later
on.

On Wed, Jan 11, 2017 at 4:30 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> Hi Eugene and Stas,
>
> Just back from couple of days off and jump on this discussion.
>
> I agree with Stas: it's worth to create a Jira about that. The only
> "semantic" difference is unbounded vs bounded source, but the behavior is
> the same.
>
> Regards
> JB
>
>
> On 01/11/2017 04:26 PM, Stas Levin wrote:
>>
>> Eugene, that makes a lot of sense to me.
>>
>> Do you think it's worth filing a Jira ticket?
>>
>> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
>> <ki...@google.com.invalid> wrote:
>>
>> I agree that the methods are named somewhat confusingly, and ideally would
>> be named the same. Both of the names miss some aspect of the underlying
>> concept.
>>
>> The underlying concept is split the source into smaller sub-sources which,
>> if you read all of them, would have read the same data as the original
>> one.
>> "splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
>> false in streaming, and only partially true in batch (I'm talking about
>> the
>> Dataflow runner).
>> "generateInitialSplits" assumes that this splitting happens only
>> "initially", i.e. at job startup time. This is currently true in practice
>> for all existing runners, but it doesn't have to be - we could conceivably
>> call it again at some point during the job if we see that some of the
>> sub-sources are still too large.
>>
>> The analogous method in Splittable DoFn (
>> https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
>> there are no restrictions in source API, only sources.
>>
>> Perhaps both should be called simply "split", or "splitIntoSubSources".
>>
>> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:
>>
>>> Definitely seems like the formatting got lost in translation, sorry about
>>> that :)
>>>
>>> I guess both cases (methods) create splits, which are essentially a list
>>
>> of
>>>
>>> bounded/unbounded source instances, each responsible for reading certain
>>> segments (physical or otherwise) of the data.
>>>
>>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
>>> wrote:
>>>
>>>> hi!
>>>>
>>>> I think your strikethrough got lost due to this being a text-only email
>>>> list. To make sure, I think you're asking the following:
>>>> " would it be reasonable to think of splitIntoBundles as generateSplits?
>>>
>>> "
>>>>
>>>> (ie, you strikethrough'd Initial)
>>>>
>>>> They are very similar and I definitely also think of them as occupying
>>>
>>> the
>>>>
>>>> same niche. I'll let someone else who was around for naming discuss
>>>
>>> whether
>>>>
>>>> it was intentional or not. Conceptually, the way that bounded vs
>>>
>>> streaming
>>>>
>>>> are handled means that they are doing slightly different things: a
>>>
>>> bounded
>>>>
>>>> source is really kind of creating physical chunks of the data, whereas
>>>
>>> the
>>>>
>>>> streaming source is creating conceptual divisions of the data that will
>>>
>>> be
>>>>
>>>> used later. I'm not sure that's worth the confusion caused by the
>>>> differences.
>>>>
>>>> One thing to clarify - splitIntoBundles does have an "Initial" aspect to
>>>> it. I don't believe there is a publicly defined/written down order the
>>>> Sources & Reader methods are called in, but a runner trying to get
>>>> efficiency would be able to use splitIntoBundles during job startup to
>>
>> be
>>>>
>>>> able to split up the work before creating readers rather than after
>>>> creating readers and waiting to use splitAtFraction.
>>>>
>>>> S
>>>>
>>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> A short terminology question regarding "bundle", and
>>>>> particularly splitIntoBundles vs. generateInitialSplits.
>>>>>
>>>>> In *BoundedSource* we have:
>>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
>>>>>
>>>>> In *UnboundedSource* we have:
>>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
>>>>> *generateInitialSplits*(...)
>>>>>
>>>>> I was wondering if the names were intentionally made different, i.e.
>>>>
>>>> "into
>>>>>
>>>>> bundles" vs "into splits"?
>>>>> In a way these two methods carry out a very similar task, would it be
>>>>> reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
>>>
>>> *
>>>>>
>>>>> (strikethrough due to "initial" not being applicable in the case of
>>>>
>>>> bounded
>>>>>
>>>>> sources)
>>>>>
>>>>> Regards,
>>>>> Stas
>>>>>
>>>>
>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: splitIntoBundles vs. generateInitialSplits

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Eugene and Stas,

Just back from couple of days off and jump on this discussion.

I agree with Stas: it's worth to create a Jira about that. The only 
"semantic" difference is unbounded vs bounded source, but the behavior 
is the same.

Regards
JB

On 01/11/2017 04:26 PM, Stas Levin wrote:
> Eugene, that makes a lot of sense to me.
>
> Do you think it's worth filing a Jira ticket?
>
> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> I agree that the methods are named somewhat confusingly, and ideally would
> be named the same. Both of the names miss some aspect of the underlying
> concept.
>
> The underlying concept is split the source into smaller sub-sources which,
> if you read all of them, would have read the same data as the original one.
> "splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
> false in streaming, and only partially true in batch (I'm talking about the
> Dataflow runner).
> "generateInitialSplits" assumes that this splitting happens only
> "initially", i.e. at job startup time. This is currently true in practice
> for all existing runners, but it doesn't have to be - we could conceivably
> call it again at some point during the job if we see that some of the
> sub-sources are still too large.
>
> The analogous method in Splittable DoFn (
> https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
> there are no restrictions in source API, only sources.
>
> Perhaps both should be called simply "split", or "splitIntoSubSources".
>
> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:
>
>> Definitely seems like the formatting got lost in translation, sorry about
>> that :)
>>
>> I guess both cases (methods) create splits, which are essentially a list
> of
>> bounded/unbounded source instances, each responsible for reading certain
>> segments (physical or otherwise) of the data.
>>
>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
>> wrote:
>>
>>> hi!
>>>
>>> I think your strikethrough got lost due to this being a text-only email
>>> list. To make sure, I think you're asking the following:
>>> " would it be reasonable to think of splitIntoBundles as generateSplits?
>> "
>>> (ie, you strikethrough'd Initial)
>>>
>>> They are very similar and I definitely also think of them as occupying
>> the
>>> same niche. I'll let someone else who was around for naming discuss
>> whether
>>> it was intentional or not. Conceptually, the way that bounded vs
>> streaming
>>> are handled means that they are doing slightly different things: a
>> bounded
>>> source is really kind of creating physical chunks of the data, whereas
>> the
>>> streaming source is creating conceptual divisions of the data that will
>> be
>>> used later. I'm not sure that's worth the confusion caused by the
>>> differences.
>>>
>>> One thing to clarify - splitIntoBundles does have an "Initial" aspect to
>>> it. I don't believe there is a publicly defined/written down order the
>>> Sources & Reader methods are called in, but a runner trying to get
>>> efficiency would be able to use splitIntoBundles during job startup to
> be
>>> able to split up the work before creating readers rather than after
>>> creating readers and waiting to use splitAtFraction.
>>>
>>> S
>>>
>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> A short terminology question regarding "bundle", and
>>>> particularly splitIntoBundles vs. generateInitialSplits.
>>>>
>>>> In *BoundedSource* we have:
>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
>>>>
>>>> In *UnboundedSource* we have:
>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
>>>> *generateInitialSplits*(...)
>>>>
>>>> I was wondering if the names were intentionally made different, i.e.
>>> "into
>>>> bundles" vs "into splits"?
>>>> In a way these two methods carry out a very similar task, would it be
>>>> reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
>> *
>>>> (strikethrough due to "initial" not being applicable in the case of
>>> bounded
>>>> sources)
>>>>
>>>> Regards,
>>>> Stas
>>>>
>>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: splitIntoBundles vs. generateInitialSplits

Posted by Stas Levin <st...@gmail.com>.

Eugene, that makes a lot of sense to me.

Do you think it's worth filing a Jira ticket?

On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

I agree that the methods are named somewhat confusingly, and ideally would
be named the same. Both of the names miss some aspect of the underlying
concept.

The underlying concept is split the source into smaller sub-sources which,
if you read all of them, would have read the same data as the original one.
"splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
false in streaming, and only partially true in batch (I'm talking about the
Dataflow runner).
"generateInitialSplits" assumes that this splitting happens only
"initially", i.e. at job startup time. This is currently true in practice
for all existing runners, but it doesn't have to be - we could conceivably
call it again at some point during the job if we see that some of the
sub-sources are still too large.

The analogous method in Splittable DoFn (
https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
there are no restrictions in source API, only sources.

Perhaps both should be called simply "split", or "splitIntoSubSources".

On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:

> Definitely seems like the formatting got lost in translation, sorry about
> that :)
>
> I guess both cases (methods) create splits, which are essentially a list
of
> bounded/unbounded source instances, each responsible for reading certain
> segments (physical or otherwise) of the data.
>
> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
> wrote:
>
> > hi!
> >
> > I think your strikethrough got lost due to this being a text-only email
> > list. To make sure, I think you're asking the following:
> > " would it be reasonable to think of splitIntoBundles as generateSplits?
> "
> > (ie, you strikethrough'd Initial)
> >
> > They are very similar and I definitely also think of them as occupying
> the
> > same niche. I'll let someone else who was around for naming discuss
> whether
> > it was intentional or not. Conceptually, the way that bounded vs
> streaming
> > are handled means that they are doing slightly different things: a
> bounded
> > source is really kind of creating physical chunks of the data, whereas
> the
> > streaming source is creating conceptual divisions of the data that will
> be
> > used later. I'm not sure that's worth the confusion caused by the
> > differences.
> >
> > One thing to clarify - splitIntoBundles does have an "Initial" aspect to
> > it. I don't believe there is a publicly defined/written down order the
> > Sources & Reader methods are called in, but a runner trying to get
> > efficiency would be able to use splitIntoBundles during job startup to
be
> > able to split up the work before creating readers rather than after
> > creating readers and waiting to use splitAtFraction.
> >
> > S
> >
> > On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > A short terminology question regarding "bundle", and
> > > particularly splitIntoBundles vs. generateInitialSplits.
> > >
> > > In *BoundedSource* we have:
> > > List<? extends BoundedSource<T>> *splitIntoBundles*(...)
> > >
> > > In *UnboundedSource* we have:
> > > List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
> > > *generateInitialSplits*(...)
> > >
> > > I was wondering if the names were intentionally made different, i.e.
> > "into
> > > bundles" vs "into splits"?
> > > In a way these two methods carry out a very similar task, would it be
> > > reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
> *
> > > (strikethrough due to "initial" not being applicable in the case of
> > bounded
> > > sources)
> > >
> > > Regards,
> > > Stas
> > >
> >
>

Re: splitIntoBundles vs. generateInitialSplits

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

I agree that the methods are named somewhat confusingly, and ideally would
be named the same. Both of the names miss some aspect of the underlying
concept.

The underlying concept is split the source into smaller sub-sources which,
if you read all of them, would have read the same data as the original one.
"splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
false in streaming, and only partially true in batch (I'm talking about the
Dataflow runner).
"generateInitialSplits" assumes that this splitting happens only
"initially", i.e. at job startup time. This is currently true in practice
for all existing runners, but it doesn't have to be - we could conceivably
call it again at some point during the job if we see that some of the
sub-sources are still too large.

The analogous method in Splittable DoFn (
https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
there are no restrictions in source API, only sources.

Perhaps both should be called simply "split", or "splitIntoSubSources".

On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <st...@gmail.com> wrote:

> Definitely seems like the formatting got lost in translation, sorry about
> that :)
>
> I guess both cases (methods) create splits, which are essentially a list of
> bounded/unbounded source instances, each responsible for reading certain
> segments (physical or otherwise) of the data.
>
> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
> wrote:
>
> > hi!
> >
> > I think your strikethrough got lost due to this being a text-only email
> > list. To make sure, I think you're asking the following:
> > " would it be reasonable to think of splitIntoBundles as generateSplits?
> "
> > (ie, you strikethrough'd Initial)
> >
> > They are very similar and I definitely also think of them as occupying
> the
> > same niche. I'll let someone else who was around for naming discuss
> whether
> > it was intentional or not. Conceptually, the way that bounded vs
> streaming
> > are handled means that they are doing slightly different things: a
> bounded
> > source is really kind of creating physical chunks of the data, whereas
> the
> > streaming source is creating conceptual divisions of the data that will
> be
> > used later. I'm not sure that's worth the confusion caused by the
> > differences.
> >
> > One thing to clarify - splitIntoBundles does have an "Initial" aspect to
> > it. I don't believe there is a publicly defined/written down order the
> > Sources & Reader methods are called in, but a runner trying to get
> > efficiency would be able to use splitIntoBundles during job startup to be
> > able to split up the work before creating readers rather than after
> > creating readers and waiting to use splitAtFraction.
> >
> > S
> >
> > On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > A short terminology question regarding "bundle", and
> > > particularly splitIntoBundles vs. generateInitialSplits.
> > >
> > > In *BoundedSource* we have:
> > > List<? extends BoundedSource<T>> *splitIntoBundles*(...)
> > >
> > > In *UnboundedSource* we have:
> > > List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
> > > *generateInitialSplits*(...)
> > >
> > > I was wondering if the names were intentionally made different, i.e.
> > "into
> > > bundles" vs "into splits"?
> > > In a way these two methods carry out a very similar task, would it be
> > > reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
> *
> > > (strikethrough due to "initial" not being applicable in the case of
> > bounded
> > > sources)
> > >
> > > Regards,
> > > Stas
> > >
> >
>

Re: splitIntoBundles vs. generateInitialSplits

Posted by Stas Levin <st...@gmail.com>.

Definitely seems like the formatting got lost in translation, sorry about
that :)

I guess both cases (methods) create splits, which are essentially a list of
bounded/unbounded source instances, each responsible for reading certain
segments (physical or otherwise) of the data.

On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <si...@google.com.invalid>
wrote:

> hi!
>
> I think your strikethrough got lost due to this being a text-only email
> list. To make sure, I think you're asking the following:
> " would it be reasonable to think of splitIntoBundles as generateSplits? "
> (ie, you strikethrough'd Initial)
>
> They are very similar and I definitely also think of them as occupying the
> same niche. I'll let someone else who was around for naming discuss whether
> it was intentional or not. Conceptually, the way that bounded vs streaming
> are handled means that they are doing slightly different things: a bounded
> source is really kind of creating physical chunks of the data, whereas the
> streaming source is creating conceptual divisions of the data that will be
> used later. I'm not sure that's worth the confusion caused by the
> differences.
>
> One thing to clarify - splitIntoBundles does have an "Initial" aspect to
> it. I don't believe there is a publicly defined/written down order the
> Sources & Reader methods are called in, but a runner trying to get
> efficiency would be able to use splitIntoBundles during job startup to be
> able to split up the work before creating readers rather than after
> creating readers and waiting to use splitAtFraction.
>
> S
>
> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:
>
> > Hi,
> >
> > A short terminology question regarding "bundle", and
> > particularly splitIntoBundles vs. generateInitialSplits.
> >
> > In *BoundedSource* we have:
> > List<? extends BoundedSource<T>> *splitIntoBundles*(...)
> >
> > In *UnboundedSource* we have:
> > List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
> > *generateInitialSplits*(...)
> >
> > I was wondering if the names were intentionally made different, i.e.
> "into
> > bundles" vs "into splits"?
> > In a way these two methods carry out a very similar task, would it be
> > reasonable to think of *splitIntoBundles *as *generate*Initial*Splits? *
> > (strikethrough due to "initial" not being applicable in the case of
> bounded
> > sources)
> >
> > Regards,
> > Stas
> >
>

Re: splitIntoBundles vs. generateInitialSplits

Posted by Stephen Sisk <si...@google.com.INVALID>.

hi!

I think your strikethrough got lost due to this being a text-only email
list. To make sure, I think you're asking the following:
" would it be reasonable to think of splitIntoBundles as generateSplits? "
(ie, you strikethrough'd Initial)

They are very similar and I definitely also think of them as occupying the
same niche. I'll let someone else who was around for naming discuss whether
it was intentional or not. Conceptually, the way that bounded vs streaming
are handled means that they are doing slightly different things: a bounded
source is really kind of creating physical chunks of the data, whereas the
streaming source is creating conceptual divisions of the data that will be
used later. I'm not sure that's worth the confusion caused by the
differences.

One thing to clarify - splitIntoBundles does have an "Initial" aspect to
it. I don't believe there is a publicly defined/written down order the
Sources & Reader methods are called in, but a runner trying to get
efficiency would be able to use splitIntoBundles during job startup to be
able to split up the work before creating readers rather than after
creating readers and waiting to use splitAtFraction.

S

On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <st...@gmail.com> wrote:

> Hi,
>
> A short terminology question regarding "bundle", and
> particularly splitIntoBundles vs. generateInitialSplits.
>
> In *BoundedSource* we have:
> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
>
> In *UnboundedSource* we have:
> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
> *generateInitialSplits*(...)
>
> I was wondering if the names were intentionally made different, i.e. "into
> bundles" vs "into splits"?
> In a way these two methods carry out a very similar task, would it be
> reasonable to think of *splitIntoBundles *as *generate*Initial*Splits? *
> (strikethrough due to "initial" not being applicable in the case of bounded
> sources)
>
> Regards,
> Stas
>