You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Jagadish Venkatraman <ja...@gmail.com> on 2016/08/30 19:09:18 UTC

[DISCUSS] End of Stream Feature in Samza

Currently, Samza works with streaming input sources like Kafka topics. This
proposal will build an idea of 'end-of-stream' into Samza to support data
sources that are bounded - like HDFS files, snapshots on disk etc.

Proposal:

https://issues.apache.org/jira/secure/attachment/12825119/ProposalforEndofStreaminSamza.pdf

This is tracked in SAMZA-974.

-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: [DISCUSS] End of Stream Feature in Samza

Posted by Jagadish Venkatraman <ja...@gmail.com>.
Hi Navina,

*>> after it has exhausted all messages from the source, it will generate a
special SENTINEL envelope that will not have any key or message and only
contain a special offset. Is that a valid understanding?*

The table in the doc already specifies how the sentinel envelope is
constructed, and how individual fields are populated. I've re-worded some
parts of the document to make it explicit and address your feedback.
Hopefully, it makes it clearer.


*>> I haven't reviewed the RB yet and I would really appreciate if you
can explain the behavior in the design doc.*

I've updated the design doc with a section that has a couple of examples on
restart scenarios.

Thanks,
Jagadish


On Thu, Sep 1, 2016 at 10:35 AM, Navina Ramesh <nramesh@linkedin.com.invalid
> wrote:

> It means that the message is not delivered to the "*StreamTask
> implementation*" ie. user code. Yes, we are creating a contract between
> SystemConsumer and Samza - " the SystemConsumer implementation will
> generate an IncomingMessageEnvelope by invoking buildEndOfStream()"
> >> Ok. Perhaps I didn't frame my question right. I understand that the
> sentinel message will not be delivered to the StreamTask. It is not clear
> why the "last message" in the source is not delivered to the task. I think
> you meant to say that the contract with the SystemConsumer is -  after it
> has exhausted all messages from the source, it will generate a special
> SENTINEL envelope that will not have any key or message and only contain a
> special offset. Is that a valid understanding?
>
> "Detecting" end of stream from the source should be simply parsing the
> offset from the message that the consumer returns during poll() and
> checking if it is end-of-stream. I'm happy to add this to the document.
> >> That sounds good. I think it will be useful to clearly describe the
> contract expected from a SystemConsumer that needs to provide a bounded
> data stream.
>
> Nope, It does not. However, the container would not terminate (if you
> have a source that has not reached end of stream yet)
> >> Awesome!
>
> No, we do not. The dance in the AsyncRunLoop state's machine / flow
> control in https://reviews.apache.org/r/51346/ guarantees that
> end-of-stream SSPs are not polled any more.
> >> I haven't reviewed the RB yet and I would really appreciate if you can
> explain the behavior in the design doc. I will get to your RB soon
> (hopefully). Having a comprehensive design doc will also help you update
> the documentation later on.
>
> Thanks for solving this problem so graciously. It is a big step for Samza!
>
> Cheers!
> Navina
>
> On Thu, Sep 1, 2016 at 5:04 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
> > Thanks Jagadish! This is great!
> > Can you share some thoughts/opinions on how this feature relates on using
> > punctuations (at some point) in Samza? I mean do you think that using
> > punctuated streams could be seen as a generalization of this problem? And
> > if so, could this feature be used later on as a building block for
> > implementing punctuations into Samza?
> >
> >
> > Best,
> >
> > Renato M.
> >
> > 2016-09-01 1:37 GMT+02:00 Jagadish Venkatraman <ja...@gmail.com>:
> >
> > > Thanks for reviewing and the comments. Please find my replies inline.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *"The offset in the partition that the message was received from.
> Ifthis
> > is
> > > the last message in the SSP, this field is set to END_OF_STREAM.Such a
> > > message is not delivered to the actual StreamTask implementation."Does
> > this
> > > mean the last message is not delivered to the Task? Does the
> > sourceprovide
> > > that info? If that is the case, then it kind of creates contractbetween
> > any
> > > bounded system consumer and samza. Or did you mean to say thatwe assume
> > > end-of-stream has been reached when there is no message returnedon
> poll?
> > *
> > >
> > >
> > > >> It means that the message is not delivered to the "*StreamTask
> > > implementation*" ie. user code. Yes, we are creating a contract between
> > > SystemConsumer and Samza - " the SystemConsumer implementation will
> > > generate an IncomingMessageEnvelope by invoking buildEndOfStream()"
> > >
> > >
> > >
> > > *" Please do clarify. I think what is missing in this document ishow to
> > > "detect" an end-of-stream from the source."*
> > >
> > > >>"Detecting" end of stream from the source should be simply parsing
> the
> > > offset from the message that the consumer returns during poll() and
> > > checking if it is end-of-stream. I'm happy to add this to the document.
> > >
> > > 2. Does this design preclude the possibility of consuming bounded and
> > > unbounded stream partitions in the task ?
> > >
> > > >> Nope, It does not. However, the container would not terminate (if
> you
> > > have a source that has not reached end of stream yet)
> > >
> > > 3. During checkpoint, let's say some of the partitions have reached
> EOF.
> > Do
> > > we write a special offset in the checkpoint message that indicates that
> > it
> > > has reached end of stream and don't need to poll anymore?
> > >
> > > >> No, we do not. The dance in the AsyncRunLoop state's machine / flow
> > > control in https://reviews.apache.org/r/51346/ guarantees that
> > > end-of-stream SSPs are not polled any more.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Aug 31, 2016 at 2:01 PM, Navina Ramesh
> > > <nramesh@linkedin.com.invalid
> > > > wrote:
> > >
> > > > Hi Jagadish,
> > > > Thanks for sharing the design with the community. I have a couple of
> > > > questions that were not very clear from the design document.
> > > >
> > > > 1. Under mechanism for indicating the end-of-stream to Samza, you
> > mention
> > > > "The offset in the partition that the message was received from. If
> > > > this is the last message in the SSP, this field is set to
> > END_OF_STREAM.
> > > > Such a message is not delivered to the actual StreamTask
> > implementation."
> > > > Does this mean the last message is not delivered to the Task? How do
> > you
> > > > identify that it is indeed the last message in the SSP? Does the
> source
> > > > provide that info? If that is the case, then it kind of creates
> > contract
> > > > between any bounded system consumer and samza. Or did you mean to say
> > > that
> > > > we assume end-of-stream has been reached when there is no message
> > > returned
> > > > on poll?  Please do clarify. I think what is missing in this document
> > is
> > > > how to "detect" an end-of-stream from the source.
> > > >
> > > > 2. Does this design preclude the possibility of consuming bounded and
> > > > unbounded stream partitions in the task ?
> > > >
> > > > 3. During checkpoint, let's say some of the partitions have reached
> > EOF.
> > > Do
> > > > we write a special offset in the checkpoint message that indicates
> that
> > > it
> > > > has reached end of stream and don't need to poll anymore?
> > > >
> > > > Thanks!
> > > > Navina
> > > >
> > > > On Tue, Aug 30, 2016 at 4:50 PM, Julian Hyde <jh...@apache.org>
> wrote:
> > > >
> > > > >
> > > > > > On Aug 30, 2016, at 4:44 PM, Xinyu Liu <xi...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > It's very exciting that Samza is adding support of bounded input
> > > > streams.
> > > > >
> > > > > +1!
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Navina R.
> > > >
> > >
> > >
> > >
> > > --
> > > Jagadish V,
> > > Graduate Student,
> > > Department of Computer Science,
> > > Stanford University
> > >
> >
>
>
>
> --
> Navina R.
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: [DISCUSS] End of Stream Feature in Samza

Posted by Navina Ramesh <nr...@linkedin.com.INVALID>.
It means that the message is not delivered to the "*StreamTask
implementation*" ie. user code. Yes, we are creating a contract between
SystemConsumer and Samza - " the SystemConsumer implementation will
generate an IncomingMessageEnvelope by invoking buildEndOfStream()"
>> Ok. Perhaps I didn't frame my question right. I understand that the
sentinel message will not be delivered to the StreamTask. It is not clear
why the "last message" in the source is not delivered to the task. I think
you meant to say that the contract with the SystemConsumer is -  after it
has exhausted all messages from the source, it will generate a special
SENTINEL envelope that will not have any key or message and only contain a
special offset. Is that a valid understanding?

"Detecting" end of stream from the source should be simply parsing the
offset from the message that the consumer returns during poll() and
checking if it is end-of-stream. I'm happy to add this to the document.
>> That sounds good. I think it will be useful to clearly describe the
contract expected from a SystemConsumer that needs to provide a bounded
data stream.

Nope, It does not. However, the container would not terminate (if you
have a source that has not reached end of stream yet)
>> Awesome!

No, we do not. The dance in the AsyncRunLoop state's machine / flow
control in https://reviews.apache.org/r/51346/ guarantees that
end-of-stream SSPs are not polled any more.
>> I haven't reviewed the RB yet and I would really appreciate if you can
explain the behavior in the design doc. I will get to your RB soon
(hopefully). Having a comprehensive design doc will also help you update
the documentation later on.

Thanks for solving this problem so graciously. It is a big step for Samza!

Cheers!
Navina

On Thu, Sep 1, 2016 at 5:04 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Thanks Jagadish! This is great!
> Can you share some thoughts/opinions on how this feature relates on using
> punctuations (at some point) in Samza? I mean do you think that using
> punctuated streams could be seen as a generalization of this problem? And
> if so, could this feature be used later on as a building block for
> implementing punctuations into Samza?
>
>
> Best,
>
> Renato M.
>
> 2016-09-01 1:37 GMT+02:00 Jagadish Venkatraman <ja...@gmail.com>:
>
> > Thanks for reviewing and the comments. Please find my replies inline.
> >
> >
> >
> >
> >
> >
> >
> >
> > *"The offset in the partition that the message was received from. Ifthis
> is
> > the last message in the SSP, this field is set to END_OF_STREAM.Such a
> > message is not delivered to the actual StreamTask implementation."Does
> this
> > mean the last message is not delivered to the Task? Does the
> sourceprovide
> > that info? If that is the case, then it kind of creates contractbetween
> any
> > bounded system consumer and samza. Or did you mean to say thatwe assume
> > end-of-stream has been reached when there is no message returnedon poll?
> *
> >
> >
> > >> It means that the message is not delivered to the "*StreamTask
> > implementation*" ie. user code. Yes, we are creating a contract between
> > SystemConsumer and Samza - " the SystemConsumer implementation will
> > generate an IncomingMessageEnvelope by invoking buildEndOfStream()"
> >
> >
> >
> > *" Please do clarify. I think what is missing in this document ishow to
> > "detect" an end-of-stream from the source."*
> >
> > >>"Detecting" end of stream from the source should be simply parsing the
> > offset from the message that the consumer returns during poll() and
> > checking if it is end-of-stream. I'm happy to add this to the document.
> >
> > 2. Does this design preclude the possibility of consuming bounded and
> > unbounded stream partitions in the task ?
> >
> > >> Nope, It does not. However, the container would not terminate (if you
> > have a source that has not reached end of stream yet)
> >
> > 3. During checkpoint, let's say some of the partitions have reached EOF.
> Do
> > we write a special offset in the checkpoint message that indicates that
> it
> > has reached end of stream and don't need to poll anymore?
> >
> > >> No, we do not. The dance in the AsyncRunLoop state's machine / flow
> > control in https://reviews.apache.org/r/51346/ guarantees that
> > end-of-stream SSPs are not polled any more.
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Aug 31, 2016 at 2:01 PM, Navina Ramesh
> > <nramesh@linkedin.com.invalid
> > > wrote:
> >
> > > Hi Jagadish,
> > > Thanks for sharing the design with the community. I have a couple of
> > > questions that were not very clear from the design document.
> > >
> > > 1. Under mechanism for indicating the end-of-stream to Samza, you
> mention
> > > "The offset in the partition that the message was received from. If
> > > this is the last message in the SSP, this field is set to
> END_OF_STREAM.
> > > Such a message is not delivered to the actual StreamTask
> implementation."
> > > Does this mean the last message is not delivered to the Task? How do
> you
> > > identify that it is indeed the last message in the SSP? Does the source
> > > provide that info? If that is the case, then it kind of creates
> contract
> > > between any bounded system consumer and samza. Or did you mean to say
> > that
> > > we assume end-of-stream has been reached when there is no message
> > returned
> > > on poll?  Please do clarify. I think what is missing in this document
> is
> > > how to "detect" an end-of-stream from the source.
> > >
> > > 2. Does this design preclude the possibility of consuming bounded and
> > > unbounded stream partitions in the task ?
> > >
> > > 3. During checkpoint, let's say some of the partitions have reached
> EOF.
> > Do
> > > we write a special offset in the checkpoint message that indicates that
> > it
> > > has reached end of stream and don't need to poll anymore?
> > >
> > > Thanks!
> > > Navina
> > >
> > > On Tue, Aug 30, 2016 at 4:50 PM, Julian Hyde <jh...@apache.org> wrote:
> > >
> > > >
> > > > > On Aug 30, 2016, at 4:44 PM, Xinyu Liu <xi...@gmail.com>
> > wrote:
> > > > >
> > > > > It's very exciting that Samza is adding support of bounded input
> > > streams.
> > > >
> > > > +1!
> > > >
> > > >
> > >
> > >
> > > --
> > > Navina R.
> > >
> >
> >
> >
> > --
> > Jagadish V,
> > Graduate Student,
> > Department of Computer Science,
> > Stanford University
> >
>



-- 
Navina R.

Re: [DISCUSS] End of Stream Feature in Samza

Posted by Jagadish Venkatraman <ja...@gmail.com>.
Thank you Renato for the insightful feedback! :-)

I don't think we have a complete scoping and design for the "punctuated
streams" feature as of yet. However, one way to implement it would be to
leverage something similar to what this feature does. We could have a
special offset for a "punctuation message". In that case, the end-of-stream
is just a special case of punctuation message. A punctuation could
conceivably be implemented with a special offset (or even as a sub-class of
IncomingMessageEnvelope).

On Thu, Sep 1, 2016 at 5:04 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Thanks Jagadish! This is great!
> Can you share some thoughts/opinions on how this feature relates on using
> punctuations (at some point) in Samza? I mean do you think that using
> punctuated streams could be seen as a generalization of this problem? And
> if so, could this feature be used later on as a building block for
> implementing punctuations into Samza?
>
>
> Best,
>
> Renato M.
>
> 2016-09-01 1:37 GMT+02:00 Jagadish Venkatraman <ja...@gmail.com>:
>
> > Thanks for reviewing and the comments. Please find my replies inline.
> >
> >
> >
> >
> >
> >
> >
> >
> > *"The offset in the partition that the message was received from. Ifthis
> is
> > the last message in the SSP, this field is set to END_OF_STREAM.Such a
> > message is not delivered to the actual StreamTask implementation."Does
> this
> > mean the last message is not delivered to the Task? Does the
> sourceprovide
> > that info? If that is the case, then it kind of creates contractbetween
> any
> > bounded system consumer and samza. Or did you mean to say thatwe assume
> > end-of-stream has been reached when there is no message returnedon poll?
> *
> >
> >
> > >> It means that the message is not delivered to the "*StreamTask
> > implementation*" ie. user code. Yes, we are creating a contract between
> > SystemConsumer and Samza - " the SystemConsumer implementation will
> > generate an IncomingMessageEnvelope by invoking buildEndOfStream()"
> >
> >
> >
> > *" Please do clarify. I think what is missing in this document ishow to
> > "detect" an end-of-stream from the source."*
> >
> > >>"Detecting" end of stream from the source should be simply parsing the
> > offset from the message that the consumer returns during poll() and
> > checking if it is end-of-stream. I'm happy to add this to the document.
> >
> > 2. Does this design preclude the possibility of consuming bounded and
> > unbounded stream partitions in the task ?
> >
> > >> Nope, It does not. However, the container would not terminate (if you
> > have a source that has not reached end of stream yet)
> >
> > 3. During checkpoint, let's say some of the partitions have reached EOF.
> Do
> > we write a special offset in the checkpoint message that indicates that
> it
> > has reached end of stream and don't need to poll anymore?
> >
> > >> No, we do not. The dance in the AsyncRunLoop state's machine / flow
> > control in https://reviews.apache.org/r/51346/ guarantees that
> > end-of-stream SSPs are not polled any more.
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Aug 31, 2016 at 2:01 PM, Navina Ramesh
> > <nramesh@linkedin.com.invalid
> > > wrote:
> >
> > > Hi Jagadish,
> > > Thanks for sharing the design with the community. I have a couple of
> > > questions that were not very clear from the design document.
> > >
> > > 1. Under mechanism for indicating the end-of-stream to Samza, you
> mention
> > > "The offset in the partition that the message was received from. If
> > > this is the last message in the SSP, this field is set to
> END_OF_STREAM.
> > > Such a message is not delivered to the actual StreamTask
> implementation."
> > > Does this mean the last message is not delivered to the Task? How do
> you
> > > identify that it is indeed the last message in the SSP? Does the source
> > > provide that info? If that is the case, then it kind of creates
> contract
> > > between any bounded system consumer and samza. Or did you mean to say
> > that
> > > we assume end-of-stream has been reached when there is no message
> > returned
> > > on poll?  Please do clarify. I think what is missing in this document
> is
> > > how to "detect" an end-of-stream from the source.
> > >
> > > 2. Does this design preclude the possibility of consuming bounded and
> > > unbounded stream partitions in the task ?
> > >
> > > 3. During checkpoint, let's say some of the partitions have reached
> EOF.
> > Do
> > > we write a special offset in the checkpoint message that indicates that
> > it
> > > has reached end of stream and don't need to poll anymore?
> > >
> > > Thanks!
> > > Navina
> > >
> > > On Tue, Aug 30, 2016 at 4:50 PM, Julian Hyde <jh...@apache.org> wrote:
> > >
> > > >
> > > > > On Aug 30, 2016, at 4:44 PM, Xinyu Liu <xi...@gmail.com>
> > wrote:
> > > > >
> > > > > It's very exciting that Samza is adding support of bounded input
> > > streams.
> > > >
> > > > +1!
> > > >
> > > >
> > >
> > >
> > > --
> > > Navina R.
> > >
> >
> >
> >
> > --
> > Jagadish V,
> > Graduate Student,
> > Department of Computer Science,
> > Stanford University
> >
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: [DISCUSS] End of Stream Feature in Samza

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Thanks Jagadish! This is great!
Can you share some thoughts/opinions on how this feature relates on using
punctuations (at some point) in Samza? I mean do you think that using
punctuated streams could be seen as a generalization of this problem? And
if so, could this feature be used later on as a building block for
implementing punctuations into Samza?


Best,

Renato M.

2016-09-01 1:37 GMT+02:00 Jagadish Venkatraman <ja...@gmail.com>:

> Thanks for reviewing and the comments. Please find my replies inline.
>
>
>
>
>
>
>
>
> *"The offset in the partition that the message was received from. Ifthis is
> the last message in the SSP, this field is set to END_OF_STREAM.Such a
> message is not delivered to the actual StreamTask implementation."Does this
> mean the last message is not delivered to the Task? Does the sourceprovide
> that info? If that is the case, then it kind of creates contractbetween any
> bounded system consumer and samza. Or did you mean to say thatwe assume
> end-of-stream has been reached when there is no message returnedon poll? *
>
>
> >> It means that the message is not delivered to the "*StreamTask
> implementation*" ie. user code. Yes, we are creating a contract between
> SystemConsumer and Samza - " the SystemConsumer implementation will
> generate an IncomingMessageEnvelope by invoking buildEndOfStream()"
>
>
>
> *" Please do clarify. I think what is missing in this document ishow to
> "detect" an end-of-stream from the source."*
>
> >>"Detecting" end of stream from the source should be simply parsing the
> offset from the message that the consumer returns during poll() and
> checking if it is end-of-stream. I'm happy to add this to the document.
>
> 2. Does this design preclude the possibility of consuming bounded and
> unbounded stream partitions in the task ?
>
> >> Nope, It does not. However, the container would not terminate (if you
> have a source that has not reached end of stream yet)
>
> 3. During checkpoint, let's say some of the partitions have reached EOF. Do
> we write a special offset in the checkpoint message that indicates that it
> has reached end of stream and don't need to poll anymore?
>
> >> No, we do not. The dance in the AsyncRunLoop state's machine / flow
> control in https://reviews.apache.org/r/51346/ guarantees that
> end-of-stream SSPs are not polled any more.
>
>
>
>
>
>
>
> On Wed, Aug 31, 2016 at 2:01 PM, Navina Ramesh
> <nramesh@linkedin.com.invalid
> > wrote:
>
> > Hi Jagadish,
> > Thanks for sharing the design with the community. I have a couple of
> > questions that were not very clear from the design document.
> >
> > 1. Under mechanism for indicating the end-of-stream to Samza, you mention
> > "The offset in the partition that the message was received from. If
> > this is the last message in the SSP, this field is set to END_OF_STREAM.
> > Such a message is not delivered to the actual StreamTask implementation."
> > Does this mean the last message is not delivered to the Task? How do you
> > identify that it is indeed the last message in the SSP? Does the source
> > provide that info? If that is the case, then it kind of creates contract
> > between any bounded system consumer and samza. Or did you mean to say
> that
> > we assume end-of-stream has been reached when there is no message
> returned
> > on poll?  Please do clarify. I think what is missing in this document is
> > how to "detect" an end-of-stream from the source.
> >
> > 2. Does this design preclude the possibility of consuming bounded and
> > unbounded stream partitions in the task ?
> >
> > 3. During checkpoint, let's say some of the partitions have reached EOF.
> Do
> > we write a special offset in the checkpoint message that indicates that
> it
> > has reached end of stream and don't need to poll anymore?
> >
> > Thanks!
> > Navina
> >
> > On Tue, Aug 30, 2016 at 4:50 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> > >
> > > > On Aug 30, 2016, at 4:44 PM, Xinyu Liu <xi...@gmail.com>
> wrote:
> > > >
> > > > It's very exciting that Samza is adding support of bounded input
> > streams.
> > >
> > > +1!
> > >
> > >
> >
> >
> > --
> > Navina R.
> >
>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>

Re: [DISCUSS] End of Stream Feature in Samza

Posted by Jagadish Venkatraman <ja...@gmail.com>.
Thanks for reviewing and the comments. Please find my replies inline.








*"The offset in the partition that the message was received from. Ifthis is
the last message in the SSP, this field is set to END_OF_STREAM.Such a
message is not delivered to the actual StreamTask implementation."Does this
mean the last message is not delivered to the Task? Does the sourceprovide
that info? If that is the case, then it kind of creates contractbetween any
bounded system consumer and samza. Or did you mean to say thatwe assume
end-of-stream has been reached when there is no message returnedon poll? *


>> It means that the message is not delivered to the "*StreamTask
implementation*" ie. user code. Yes, we are creating a contract between
SystemConsumer and Samza - " the SystemConsumer implementation will
generate an IncomingMessageEnvelope by invoking buildEndOfStream()"



*" Please do clarify. I think what is missing in this document ishow to
"detect" an end-of-stream from the source."*

>>"Detecting" end of stream from the source should be simply parsing the
offset from the message that the consumer returns during poll() and
checking if it is end-of-stream. I'm happy to add this to the document.

2. Does this design preclude the possibility of consuming bounded and
unbounded stream partitions in the task ?

>> Nope, It does not. However, the container would not terminate (if you
have a source that has not reached end of stream yet)

3. During checkpoint, let's say some of the partitions have reached EOF. Do
we write a special offset in the checkpoint message that indicates that it
has reached end of stream and don't need to poll anymore?

>> No, we do not. The dance in the AsyncRunLoop state's machine / flow
control in https://reviews.apache.org/r/51346/ guarantees that
end-of-stream SSPs are not polled any more.







On Wed, Aug 31, 2016 at 2:01 PM, Navina Ramesh <nramesh@linkedin.com.invalid
> wrote:

> Hi Jagadish,
> Thanks for sharing the design with the community. I have a couple of
> questions that were not very clear from the design document.
>
> 1. Under mechanism for indicating the end-of-stream to Samza, you mention
> "The offset in the partition that the message was received from. If
> this is the last message in the SSP, this field is set to END_OF_STREAM.
> Such a message is not delivered to the actual StreamTask implementation."
> Does this mean the last message is not delivered to the Task? How do you
> identify that it is indeed the last message in the SSP? Does the source
> provide that info? If that is the case, then it kind of creates contract
> between any bounded system consumer and samza. Or did you mean to say that
> we assume end-of-stream has been reached when there is no message returned
> on poll?  Please do clarify. I think what is missing in this document is
> how to "detect" an end-of-stream from the source.
>
> 2. Does this design preclude the possibility of consuming bounded and
> unbounded stream partitions in the task ?
>
> 3. During checkpoint, let's say some of the partitions have reached EOF. Do
> we write a special offset in the checkpoint message that indicates that it
> has reached end of stream and don't need to poll anymore?
>
> Thanks!
> Navina
>
> On Tue, Aug 30, 2016 at 4:50 PM, Julian Hyde <jh...@apache.org> wrote:
>
> >
> > > On Aug 30, 2016, at 4:44 PM, Xinyu Liu <xi...@gmail.com> wrote:
> > >
> > > It's very exciting that Samza is adding support of bounded input
> streams.
> >
> > +1!
> >
> >
>
>
> --
> Navina R.
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: [DISCUSS] End of Stream Feature in Samza

Posted by Navina Ramesh <nr...@linkedin.com.INVALID>.
Hi Jagadish,
Thanks for sharing the design with the community. I have a couple of
questions that were not very clear from the design document.

1. Under mechanism for indicating the end-of-stream to Samza, you mention
"The offset in the partition that the message was received from. If
this is the last message in the SSP, this field is set to END_OF_STREAM.
Such a message is not delivered to the actual StreamTask implementation."
Does this mean the last message is not delivered to the Task? How do you
identify that it is indeed the last message in the SSP? Does the source
provide that info? If that is the case, then it kind of creates contract
between any bounded system consumer and samza. Or did you mean to say that
we assume end-of-stream has been reached when there is no message returned
on poll?  Please do clarify. I think what is missing in this document is
how to "detect" an end-of-stream from the source.

2. Does this design preclude the possibility of consuming bounded and
unbounded stream partitions in the task ?

3. During checkpoint, let's say some of the partitions have reached EOF. Do
we write a special offset in the checkpoint message that indicates that it
has reached end of stream and don't need to poll anymore?

Thanks!
Navina

On Tue, Aug 30, 2016 at 4:50 PM, Julian Hyde <jh...@apache.org> wrote:

>
> > On Aug 30, 2016, at 4:44 PM, Xinyu Liu <xi...@gmail.com> wrote:
> >
> > It's very exciting that Samza is adding support of bounded input streams.
>
> +1!
>
>


-- 
Navina R.

Re: [DISCUSS] End of Stream Feature in Samza

Posted by Julian Hyde <jh...@apache.org>.
> On Aug 30, 2016, at 4:44 PM, Xinyu Liu <xi...@gmail.com> wrote:
> 
> It's very exciting that Samza is adding support of bounded input streams.

+1!


Re: [DISCUSS] End of Stream Feature in Samza

Posted by xinyu liu <xi...@gmail.com>.
It's very exciting that Samza is adding support of bounded input streams.
Nice write-up of different scenarios and options. Look forward to having
this feature work with the upcoming HDFS consumer!

Thanks,
Xinyu

On Tue, Aug 30, 2016 at 12:09 PM, Jagadish Venkatraman <
jagadish1989@gmail.com> wrote:

> Currently, Samza works with streaming input sources like Kafka topics. This
> proposal will build an idea of 'end-of-stream' into Samza to support data
> sources that are bounded - like HDFS files, snapshots on disk etc.
>
> Proposal:
>
> https://issues.apache.org/jira/secure/attachment/12825119/
> ProposalforEndofStreaminSamza.pdf
>
> This is tracked in SAMZA-974.
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>