You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pulsar.apache.org by Sanjeev Kulkarni <sa...@gmail.com> on 2020/05/14 05:34:44 UTC

[DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Hi all,

The current interfaces for sources in Pulsar IO are geared towards
streaming sources where data is available on a continuous basis. There
exist a whole bunch of data sources where data is not available on a
continuous/streaming fashion, but rather arrives periodically/in spurts.
These set of 'Batch Sources' have a set of common characteristics that
might warrant framework level support in Pulsar IO.

Jerry and myself have jotted down the ideas around this in PIP-65. Please
review it and let us know what you think.

https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources

Thanks!

Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Posted by Sanjeev Kulkarni <sa...@gmail.com>.

Hi Sijie,
As described in the proposal, there will be no change(either in the api or
in the runtime) for the existing streaming sources as all. The changes
proposed by this PIP are
1. Adding an explicit api for writing batch source.
2. Providing an executor based implementation for the above.
In this sense, this is very similar to what exists for Window Functions.
The core function api and runtime remained the same, only a new api was
added to simplify window function user's experience.


On Wed, May 20, 2020 at 6:19 PM Sijie Guo <gu...@gmail.com> wrote:

> Hi Jerry,
>
> I understand the concerns. I think it falls into a broker discussion of
> function composition.
>
> I am fine with the current proposal. But I wish that we don't introduce a
> lot of specialized code in the runtime to just handle this use case. It
> would be better if we can reuse the existing function framework. Because it
> will be easier to implement such multiple-function functionality with a
> more general function composition approach.
>
> - Sijie
>
> On Wed, May 20, 2020 at 3:54 PM Jerry Peng <je...@gmail.com>
> wrote:
>
> > Hi Sijie,
> >
> > We have considered a two stag function as a way implement a "batch"
> source,
> > however because there are two independent functions, it adds complexity
> to
> > management especially when there are failures.  The two functions will
> need
> > to be submitted and registered in an atomic fashion which cannot be
> > guaranteed in the current framework.  Moreover, all CRUD operations for
> the
> > "batch" source will also need to be atomic across the two functions.  The
> > other approach is implement logic that will run two separate instance
> types
> > for a single function.  One type of the function instance will run the
> > "discovery" phase and the other type will read the tasks from discovery
> > phase and execute them.  Though that approach is very similar to the
> > proposed approach.
> >
> > I think the most important thing right now is to make sure the interface
> > for the batch source is appropriate since its hard to change in the
> future.
> > How the execution works in the backend can always be modified/optimized
> in
> > the future.
> >
> > On Wed, May 20, 2020 at 3:28 PM Sijie Guo <gu...@gmail.com> wrote:
> >
> > > Hi Sanjeev,
> > >
> > > Just a couple of thoughts here. It seems to me that the BatchSource API
> > is
> > > a bit complicated and it can be achieved by using existing functions
> > > framework.
> > >
> > > - BatchSourceTrigger: can be implemented using a one-instance function.
> > > That is used for discovering the batch source tasks and returning the
> > > discovered tasks. So the discovered tasks are published to its output
> > > topic.
> > > - BatchSource: can be implemented using a function that is receiving
> the
> > > batch source tasks and execute the source task.
> > >
> > > So it seems that this can be achieved using the existing framework by
> > > combining two functions together. It seems that we can achieve with a
> > much
> > > clearer approach and keep the function & connector API relatively
> simple
> > > and consistent. Thoughts?
> > >
> > > - Sijie
> > >
> > > On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni <sa...@gmail.com>
> > > wrote:
> > >
> > > > Pinging the community about this. Would love feedback on this.
> > > > Thanks!
> > > >
> > > > On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <
> sanjeevrk@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > The current interfaces for sources in Pulsar IO are geared towards
> > > > > streaming sources where data is available on a continuous basis.
> > There
> > > > > exist a whole bunch of data sources where data is not available on
> a
> > > > > continuous/streaming fashion, but rather arrives periodically/in
> > > spurts.
> > > > > These set of 'Batch Sources' have a set of common characteristics
> > that
> > > > > might warrant framework level support in Pulsar IO.
> > > > >
> > > > > Jerry and myself have jotted down the ideas around this in PIP-65.
> > > Please
> > > > > review it and let us know what you think.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Posted by Sanjeev Kulkarni <sa...@gmail.com>.

Hi Devin,
The complexity here is orchestrating multiple functions that notionally
form the same connector. Thus stopping a batch connector would be
equivalent of stopping two functions and the corresponding complexities of
dealing with failure in-between the calls. Same goes for other calls.

On Thu, May 21, 2020 at 7:41 PM Devin Bost <de...@gmail.com> wrote:

> I apologize for not fully understanding the context here, but is the
> concern about using the existing function architecture the complexity of
> needing two sequential operations in a function flow to be synchronous with
> respect to transactions, such as to avoid race conditions and issues with
> parallelism that could result from them being transactionally independent
> of one another?
>
> Perhaps there's a larger use case here that could be represented with a
> pattern.
>
> --
> Devin G. Bost
>
> On Wed, May 20, 2020, 7:19 PM Sijie Guo <gu...@gmail.com> wrote:
>
> > Hi Jerry,
> >
> > I understand the concerns. I think it falls into a broker discussion of
> > function composition.
> >
> > I am fine with the current proposal. But I wish that we don't introduce a
> > lot of specialized code in the runtime to just handle this use case. It
> > would be better if we can reuse the existing function framework. Because
> it
> > will be easier to implement such multiple-function functionality with a
> > more general function composition approach.
> >
> > - Sijie
> >
> > On Wed, May 20, 2020 at 3:54 PM Jerry Peng <je...@gmail.com>
> > wrote:
> >
> > > Hi Sijie,
> > >
> > > We have considered a two stag function as a way implement a "batch"
> > source,
> > > however because there are two independent functions, it adds complexity
> > to
> > > management especially when there are failures.  The two functions will
> > need
> > > to be submitted and registered in an atomic fashion which cannot be
> > > guaranteed in the current framework.  Moreover, all CRUD operations for
> > the
> > > "batch" source will also need to be atomic across the two functions.
> The
> > > other approach is implement logic that will run two separate instance
> > types
> > > for a single function.  One type of the function instance will run the
> > > "discovery" phase and the other type will read the tasks from discovery
> > > phase and execute them.  Though that approach is very similar to the
> > > proposed approach.
> > >
> > > I think the most important thing right now is to make sure the
> interface
> > > for the batch source is appropriate since its hard to change in the
> > future.
> > > How the execution works in the backend can always be modified/optimized
> > in
> > > the future.
> > >
> > > On Wed, May 20, 2020 at 3:28 PM Sijie Guo <gu...@gmail.com> wrote:
> > >
> > > > Hi Sanjeev,
> > > >
> > > > Just a couple of thoughts here. It seems to me that the BatchSource
> API
> > > is
> > > > a bit complicated and it can be achieved by using existing functions
> > > > framework.
> > > >
> > > > - BatchSourceTrigger: can be implemented using a one-instance
> function.
> > > > That is used for discovering the batch source tasks and returning the
> > > > discovered tasks. So the discovered tasks are published to its output
> > > > topic.
> > > > - BatchSource: can be implemented using a function that is receiving
> > the
> > > > batch source tasks and execute the source task.
> > > >
> > > > So it seems that this can be achieved using the existing framework by
> > > > combining two functions together. It seems that we can achieve with a
> > > much
> > > > clearer approach and keep the function & connector API relatively
> > simple
> > > > and consistent. Thoughts?
> > > >
> > > > - Sijie
> > > >
> > > > On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni <
> sanjeevrk@gmail.com>
> > > > wrote:
> > > >
> > > > > Pinging the community about this. Would love feedback on this.
> > > > > Thanks!
> > > > >
> > > > > On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <
> > sanjeevrk@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > The current interfaces for sources in Pulsar IO are geared
> towards
> > > > > > streaming sources where data is available on a continuous basis.
> > > There
> > > > > > exist a whole bunch of data sources where data is not available
> on
> > a
> > > > > > continuous/streaming fashion, but rather arrives periodically/in
> > > > spurts.
> > > > > > These set of 'Batch Sources' have a set of common characteristics
> > > that
> > > > > > might warrant framework level support in Pulsar IO.
> > > > > >
> > > > > > Jerry and myself have jotted down the ideas around this in
> PIP-65.
> > > > Please
> > > > > > review it and let us know what you think.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Posted by Devin Bost <de...@gmail.com>.

I apologize for not fully understanding the context here, but is the
concern about using the existing function architecture the complexity of
needing two sequential operations in a function flow to be synchronous with
respect to transactions, such as to avoid race conditions and issues with
parallelism that could result from them being transactionally independent
of one another?

Perhaps there's a larger use case here that could be represented with a
pattern.

--
Devin G. Bost

On Wed, May 20, 2020, 7:19 PM Sijie Guo <gu...@gmail.com> wrote:

> Hi Jerry,
>
> I understand the concerns. I think it falls into a broker discussion of
> function composition.
>
> I am fine with the current proposal. But I wish that we don't introduce a
> lot of specialized code in the runtime to just handle this use case. It
> would be better if we can reuse the existing function framework. Because it
> will be easier to implement such multiple-function functionality with a
> more general function composition approach.
>
> - Sijie
>
> On Wed, May 20, 2020 at 3:54 PM Jerry Peng <je...@gmail.com>
> wrote:
>
> > Hi Sijie,
> >
> > We have considered a two stag function as a way implement a "batch"
> source,
> > however because there are two independent functions, it adds complexity
> to
> > management especially when there are failures.  The two functions will
> need
> > to be submitted and registered in an atomic fashion which cannot be
> > guaranteed in the current framework.  Moreover, all CRUD operations for
> the
> > "batch" source will also need to be atomic across the two functions.  The
> > other approach is implement logic that will run two separate instance
> types
> > for a single function.  One type of the function instance will run the
> > "discovery" phase and the other type will read the tasks from discovery
> > phase and execute them.  Though that approach is very similar to the
> > proposed approach.
> >
> > I think the most important thing right now is to make sure the interface
> > for the batch source is appropriate since its hard to change in the
> future.
> > How the execution works in the backend can always be modified/optimized
> in
> > the future.
> >
> > On Wed, May 20, 2020 at 3:28 PM Sijie Guo <gu...@gmail.com> wrote:
> >
> > > Hi Sanjeev,
> > >
> > > Just a couple of thoughts here. It seems to me that the BatchSource API
> > is
> > > a bit complicated and it can be achieved by using existing functions
> > > framework.
> > >
> > > - BatchSourceTrigger: can be implemented using a one-instance function.
> > > That is used for discovering the batch source tasks and returning the
> > > discovered tasks. So the discovered tasks are published to its output
> > > topic.
> > > - BatchSource: can be implemented using a function that is receiving
> the
> > > batch source tasks and execute the source task.
> > >
> > > So it seems that this can be achieved using the existing framework by
> > > combining two functions together. It seems that we can achieve with a
> > much
> > > clearer approach and keep the function & connector API relatively
> simple
> > > and consistent. Thoughts?
> > >
> > > - Sijie
> > >
> > > On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni <sa...@gmail.com>
> > > wrote:
> > >
> > > > Pinging the community about this. Would love feedback on this.
> > > > Thanks!
> > > >
> > > > On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <
> sanjeevrk@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > The current interfaces for sources in Pulsar IO are geared towards
> > > > > streaming sources where data is available on a continuous basis.
> > There
> > > > > exist a whole bunch of data sources where data is not available on
> a
> > > > > continuous/streaming fashion, but rather arrives periodically/in
> > > spurts.
> > > > > These set of 'Batch Sources' have a set of common characteristics
> > that
> > > > > might warrant framework level support in Pulsar IO.
> > > > >
> > > > > Jerry and myself have jotted down the ideas around this in PIP-65.
> > > Please
> > > > > review it and let us know what you think.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Posted by Sijie Guo <gu...@gmail.com>.

Hi Jerry,

I understand the concerns. I think it falls into a broker discussion of
function composition.

I am fine with the current proposal. But I wish that we don't introduce a
lot of specialized code in the runtime to just handle this use case. It
would be better if we can reuse the existing function framework. Because it
will be easier to implement such multiple-function functionality with a
more general function composition approach.

- Sijie

On Wed, May 20, 2020 at 3:54 PM Jerry Peng <je...@gmail.com>
wrote:

> Hi Sijie,
>
> We have considered a two stag function as a way implement a "batch" source,
> however because there are two independent functions, it adds complexity to
> management especially when there are failures.  The two functions will need
> to be submitted and registered in an atomic fashion which cannot be
> guaranteed in the current framework.  Moreover, all CRUD operations for the
> "batch" source will also need to be atomic across the two functions.  The
> other approach is implement logic that will run two separate instance types
> for a single function.  One type of the function instance will run the
> "discovery" phase and the other type will read the tasks from discovery
> phase and execute them.  Though that approach is very similar to the
> proposed approach.
>
> I think the most important thing right now is to make sure the interface
> for the batch source is appropriate since its hard to change in the future.
> How the execution works in the backend can always be modified/optimized in
> the future.
>
> On Wed, May 20, 2020 at 3:28 PM Sijie Guo <gu...@gmail.com> wrote:
>
> > Hi Sanjeev,
> >
> > Just a couple of thoughts here. It seems to me that the BatchSource API
> is
> > a bit complicated and it can be achieved by using existing functions
> > framework.
> >
> > - BatchSourceTrigger: can be implemented using a one-instance function.
> > That is used for discovering the batch source tasks and returning the
> > discovered tasks. So the discovered tasks are published to its output
> > topic.
> > - BatchSource: can be implemented using a function that is receiving the
> > batch source tasks and execute the source task.
> >
> > So it seems that this can be achieved using the existing framework by
> > combining two functions together. It seems that we can achieve with a
> much
> > clearer approach and keep the function & connector API relatively simple
> > and consistent. Thoughts?
> >
> > - Sijie
> >
> > On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni <sa...@gmail.com>
> > wrote:
> >
> > > Pinging the community about this. Would love feedback on this.
> > > Thanks!
> > >
> > > On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <sanjeevrk@gmail.com
> >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > The current interfaces for sources in Pulsar IO are geared towards
> > > > streaming sources where data is available on a continuous basis.
> There
> > > > exist a whole bunch of data sources where data is not available on a
> > > > continuous/streaming fashion, but rather arrives periodically/in
> > spurts.
> > > > These set of 'Batch Sources' have a set of common characteristics
> that
> > > > might warrant framework level support in Pulsar IO.
> > > >
> > > > Jerry and myself have jotted down the ideas around this in PIP-65.
> > Please
> > > > review it and let us know what you think.
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
> > > >
> > > > Thanks!
> > > >
> > >
> >
>

Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Posted by Jerry Peng <je...@gmail.com>.

Hi Sijie,

We have considered a two stag function as a way implement a "batch" source,
however because there are two independent functions, it adds complexity to
management especially when there are failures.  The two functions will need
to be submitted and registered in an atomic fashion which cannot be
guaranteed in the current framework.  Moreover, all CRUD operations for the
"batch" source will also need to be atomic across the two functions.  The
other approach is implement logic that will run two separate instance types
for a single function.  One type of the function instance will run the
"discovery" phase and the other type will read the tasks from discovery
phase and execute them.  Though that approach is very similar to the
proposed approach.

I think the most important thing right now is to make sure the interface
for the batch source is appropriate since its hard to change in the future.
How the execution works in the backend can always be modified/optimized in
the future.

On Wed, May 20, 2020 at 3:28 PM Sijie Guo <gu...@gmail.com> wrote:

> Hi Sanjeev,
>
> Just a couple of thoughts here. It seems to me that the BatchSource API is
> a bit complicated and it can be achieved by using existing functions
> framework.
>
> - BatchSourceTrigger: can be implemented using a one-instance function.
> That is used for discovering the batch source tasks and returning the
> discovered tasks. So the discovered tasks are published to its output
> topic.
> - BatchSource: can be implemented using a function that is receiving the
> batch source tasks and execute the source task.
>
> So it seems that this can be achieved using the existing framework by
> combining two functions together. It seems that we can achieve with a much
> clearer approach and keep the function & connector API relatively simple
> and consistent. Thoughts?
>
> - Sijie
>
> On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni <sa...@gmail.com>
> wrote:
>
> > Pinging the community about this. Would love feedback on this.
> > Thanks!
> >
> > On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <sa...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > The current interfaces for sources in Pulsar IO are geared towards
> > > streaming sources where data is available on a continuous basis. There
> > > exist a whole bunch of data sources where data is not available on a
> > > continuous/streaming fashion, but rather arrives periodically/in
> spurts.
> > > These set of 'Batch Sources' have a set of common characteristics that
> > > might warrant framework level support in Pulsar IO.
> > >
> > > Jerry and myself have jotted down the ideas around this in PIP-65.
> Please
> > > review it and let us know what you think.
> > >
> > >
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
> > >
> > > Thanks!
> > >
> >
>

Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Posted by Sijie Guo <gu...@gmail.com>.

Hi Sanjeev,

Just a couple of thoughts here. It seems to me that the BatchSource API is
a bit complicated and it can be achieved by using existing functions
framework.

- BatchSourceTrigger: can be implemented using a one-instance function.
That is used for discovering the batch source tasks and returning the
discovered tasks. So the discovered tasks are published to its output topic.
- BatchSource: can be implemented using a function that is receiving the
batch source tasks and execute the source task.

So it seems that this can be achieved using the existing framework by
combining two functions together. It seems that we can achieve with a much
clearer approach and keep the function & connector API relatively simple
and consistent. Thoughts?

- Sijie

On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni <sa...@gmail.com>
wrote:

> Pinging the community about this. Would love feedback on this.
> Thanks!
>
> On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <sa...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > The current interfaces for sources in Pulsar IO are geared towards
> > streaming sources where data is available on a continuous basis. There
> > exist a whole bunch of data sources where data is not available on a
> > continuous/streaming fashion, but rather arrives periodically/in spurts.
> > These set of 'Batch Sources' have a set of common characteristics that
> > might warrant framework level support in Pulsar IO.
> >
> > Jerry and myself have jotted down the ideas around this in PIP-65. Please
> > review it and let us know what you think.
> >
> >
> >
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
> >
> > Thanks!
> >
>

Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Posted by Sanjeev Kulkarni <sa...@gmail.com>.

Pinging the community about this. Would love feedback on this.
Thanks!

On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <sa...@gmail.com>
wrote:

> Hi all,
>
> The current interfaces for sources in Pulsar IO are geared towards
> streaming sources where data is available on a continuous basis. There
> exist a whole bunch of data sources where data is not available on a
> continuous/streaming fashion, but rather arrives periodically/in spurts.
> These set of 'Batch Sources' have a set of common characteristics that
> might warrant framework level support in Pulsar IO.
>
> Jerry and myself have jotted down the ideas around this in PIP-65. Please
> review it and let us know what you think.
>
>
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
>
> Thanks!
>