You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@apex.apache.org by Ananth Gundabattula <ag...@gmail.com> on 2016/06/14 19:41:48 UTC

A few questions about the Kafka 0.9 operator

Hello Siyuan/All,

I have a couple of questions regarding the Kafka 0.9 operator. Could you
please help me in understanding this operator a bit better?


   - As stated in
   http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator ,
   kafka 0.9 operator stores it "check-pointed offsets" in Kafka itself using
   the App name ? It sounds like -originalAppID is not used by this operator
   at all - In other words, I cant force an app to process starting from the
   beginning until I change the App name if the App is based on the Kafka 0.9
   operator as the input operator
   - How does the kafka 0.9 operator handle downstream operators failure ?
   By this I mean, an Apex downstream operator fails, and is brought back up
   by STRAM. However this operator was significantly lagging behind the
   current window of the kafka 0.9 operator window. Does the buffer server
   within the Kafka 0.9 operator buffer many windows to handle this situation
   ? ( and hence replays accordingly ? ) . I ask this to fine tune the buffer
   memory property.
   - Is EXACTLY_ONCE processing supported in this operator ? if yes, is it
   fair to assume that HDFS would be used to manage this type of configuration
   ?
   - Is EXACTLY_ONCE based off the streaming window or the Application
   Window in Apex ?


Regards,

Ananth

Re: A few questions about the Kafka 0.9 operator

Posted by Ananth Gundabattula <ag...@gmail.com>.
Thanks a lot for the answers Thomas. The blog entry really helps.

Regarding the comment about offset management initialization, we are using
APPLICATION_OR_EARLIEST.  I was interpreting APPLICATION in this
configuration as the "ApplicationID" ( and hence my question regarding
originalAppId parameter ) and the behavior for the Kafka 0.9 operator seems
to be interpreting it as the application name. I shall wait for Siyuan's
thoughts on this.

Regards,
Ananth

On Wed, Jun 15, 2016 at 7:57 AM, Thomas Weise <th...@gmail.com>
wrote:

> You can also have a look at this blog and linked example that specifically
> covers exactly-once with input from Kafka:
>
> https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/
>
>
> On Tue, Jun 14, 2016 at 2:47 PM, Thomas Weise <th...@gmail.com>
> wrote:
>
>> See response below:
>>
>> On Tue, Jun 14, 2016 at 12:41 PM, Ananth Gundabattula <
>> agundabattula@gmail.com> wrote:
>>
>>> Hello Siyuan/All,
>>>
>>> I have a couple of questions regarding the Kafka 0.9 operator. Could you
>>> please help me in understanding this operator a bit better?
>>>
>>>
>>>    - As stated in
>>>    http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator
>>>    , kafka 0.9 operator stores it "check-pointed offsets" in Kafka itself
>>>    using the App name ? It sounds like -originalAppID is not used by this
>>>    operator at all - In other words, I cant force an app to process starting
>>>    from the beginning until I change the App name if the App is based on the
>>>    Kafka 0.9 operator as the input operator
>>>
>>> The start offset configuration option should determine where the
>> operator starts consuming on cold start (earliest, latest, last consumed).
>> If that's not the case then it would be a bug. Siyuan, please comment.
>>
>>>
>>>    -
>>>    - How does the kafka 0.9 operator handle downstream operators
>>>    failure ? By this I mean, an Apex downstream operator fails, and is brought
>>>    back up by STRAM. However this operator was significantly lagging behind
>>>    the current window of the kafka 0.9 operator window. Does the buffer server
>>>    within the Kafka 0.9 operator buffer many windows to handle this situation
>>>    ? ( and hence replays accordingly ? ) . I ask this to fine tune the buffer
>>>    memory property.
>>>
>>> The upstream buffer server will hold the data until processed by the
>> downstream operator. The buffer server, by default, will start to spool to
>> disk when the allocated memory is used up. Back pressure will cause the
>> consumer to slow down accordingly.
>>
>>>
>>>    - Is EXACTLY_ONCE processing supported in this operator ? if yes, is
>>>    it fair to assume that HDFS would be used to manage this type of
>>>    configuration ?
>>>
>>> Yes, when you enable idempotency on the operator, exactly once
>> processing semantics in downstream operators are supported (affects those
>> that write to external systems). To enable this you can configure to use
>> the window data manager that writes to HDFS, essentially it will keep track
>> of the consumer offsets for each window.
>>
>>>
>>>    -
>>>    - Is EXACTLY_ONCE based off the streaming window or the Application
>>>    Window in Apex ?
>>>
>>> The operator only sees the "application window". Make sure to align the
>> checkpoint window interval.
>>
>> For more information about the Kafka input operator, please see:
>> http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator
>>
>>
>>
>>
>

Re: A few questions about the Kafka 0.9 operator

Posted by Thomas Weise <th...@gmail.com>.
You can also have a look at this blog and linked example that specifically
covers exactly-once with input from Kafka:

https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/


On Tue, Jun 14, 2016 at 2:47 PM, Thomas Weise <th...@gmail.com>
wrote:

> See response below:
>
> On Tue, Jun 14, 2016 at 12:41 PM, Ananth Gundabattula <
> agundabattula@gmail.com> wrote:
>
>> Hello Siyuan/All,
>>
>> I have a couple of questions regarding the Kafka 0.9 operator. Could you
>> please help me in understanding this operator a bit better?
>>
>>
>>    - As stated in
>>    http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator
>>    , kafka 0.9 operator stores it "check-pointed offsets" in Kafka itself
>>    using the App name ? It sounds like -originalAppID is not used by this
>>    operator at all - In other words, I cant force an app to process starting
>>    from the beginning until I change the App name if the App is based on the
>>    Kafka 0.9 operator as the input operator
>>
>> The start offset configuration option should determine where the operator
> starts consuming on cold start (earliest, latest, last consumed). If that's
> not the case then it would be a bug. Siyuan, please comment.
>
>>
>>    -
>>    - How does the kafka 0.9 operator handle downstream operators failure
>>    ? By this I mean, an Apex downstream operator fails, and is brought back up
>>    by STRAM. However this operator was significantly lagging behind the
>>    current window of the kafka 0.9 operator window. Does the buffer server
>>    within the Kafka 0.9 operator buffer many windows to handle this situation
>>    ? ( and hence replays accordingly ? ) . I ask this to fine tune the buffer
>>    memory property.
>>
>> The upstream buffer server will hold the data until processed by the
> downstream operator. The buffer server, by default, will start to spool to
> disk when the allocated memory is used up. Back pressure will cause the
> consumer to slow down accordingly.
>
>>
>>    - Is EXACTLY_ONCE processing supported in this operator ? if yes, is
>>    it fair to assume that HDFS would be used to manage this type of
>>    configuration ?
>>
>> Yes, when you enable idempotency on the operator, exactly once processing
> semantics in downstream operators are supported (affects those that write
> to external systems). To enable this you can configure to use the window
> data manager that writes to HDFS, essentially it will keep track of the
> consumer offsets for each window.
>
>>
>>    -
>>    - Is EXACTLY_ONCE based off the streaming window or the Application
>>    Window in Apex ?
>>
>> The operator only sees the "application window". Make sure to align the
> checkpoint window interval.
>
> For more information about the Kafka input operator, please see:
> http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator
>
>
>
>

Re: A few questions about the Kafka 0.9 operator

Posted by Thomas Weise <th...@gmail.com>.
See response below:

On Tue, Jun 14, 2016 at 12:41 PM, Ananth Gundabattula <
agundabattula@gmail.com> wrote:

> Hello Siyuan/All,
>
> I have a couple of questions regarding the Kafka 0.9 operator. Could you
> please help me in understanding this operator a bit better?
>
>
>    - As stated in
>    http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator
>    , kafka 0.9 operator stores it "check-pointed offsets" in Kafka itself
>    using the App name ? It sounds like -originalAppID is not used by this
>    operator at all - In other words, I cant force an app to process starting
>    from the beginning until I change the App name if the App is based on the
>    Kafka 0.9 operator as the input operator
>
> The start offset configuration option should determine where the operator
starts consuming on cold start (earliest, latest, last consumed). If that's
not the case then it would be a bug. Siyuan, please comment.

>
>    -
>    - How does the kafka 0.9 operator handle downstream operators failure
>    ? By this I mean, an Apex downstream operator fails, and is brought back up
>    by STRAM. However this operator was significantly lagging behind the
>    current window of the kafka 0.9 operator window. Does the buffer server
>    within the Kafka 0.9 operator buffer many windows to handle this situation
>    ? ( and hence replays accordingly ? ) . I ask this to fine tune the buffer
>    memory property.
>
> The upstream buffer server will hold the data until processed by the
downstream operator. The buffer server, by default, will start to spool to
disk when the allocated memory is used up. Back pressure will cause the
consumer to slow down accordingly.

>
>    - Is EXACTLY_ONCE processing supported in this operator ? if yes, is
>    it fair to assume that HDFS would be used to manage this type of
>    configuration ?
>
> Yes, when you enable idempotency on the operator, exactly once processing
semantics in downstream operators are supported (affects those that write
to external systems). To enable this you can configure to use the window
data manager that writes to HDFS, essentially it will keep track of the
consumer offsets for each window.

>
>    -
>    - Is EXACTLY_ONCE based off the streaming window or the Application
>    Window in Apex ?
>
> The operator only sees the "application window". Make sure to align the
checkpoint window interval.

For more information about the Kafka input operator, please see:
http://www.slideshare.net/ApacheApex/apache-apex-kafka-input-operator