You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Abhishek Agarwal <ab...@gmail.com> on 2016/08/25 15:50:12 UTC

Kafka streams

Hi,
I was reading up on kafka streams for a project and came across this blog
https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/
I wanted to validate some assertions made in blog, with kafka community

- Kafka streams is kafka-in, kafka-out application. Does the user need
kafka connect to transfer data from kafka to any external store?
- No support for asynchronous processing - Can I use more threads than
number of partitions for processors without sacrificing at-least once
guarantees?



-- 
Regards,
Abhishek Agarwal

Re: Kafka streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Hi,

If you need multiple polls to receive more data before you start
processing, you should disable auto commit (via
`auto.commit.enable=false`). Thus, no commit happens on poll() -- of
course, you need to do commits manually. Kafka Streams also uses this
strategy internally.

About KTable: if a task fails and gets restarted on the same host, and
if the local state store is not corrupted, it will just re-use the local
state store.
Thus, the changelog will only be read (from beginning), if state store
is corrupted or task was move to a different host (ie, application
instance. For both cases, local state store is rebuild on start up
before actual processing begins.


-Matthias



On 08/26/2016 10:52 AM, Abhishek Agarwal wrote:
> Thanks Matthias and Eno. For at least once guarantees to be effective in
> any system, the source (Kafka receiver) needs to know that the message has
> been successfully processed. What I understood is that since processing
> happens in single thread, if another record is polled, it is implicitly
> assumed that previous record has been successfully processed. Given that,
> 
> - When I am batching records in memory for kafka producer, is there a
> possibility of data loss since records have not been committed yet and
> source has already polled for more records.
> 
> - Similarly for operations such as aggregations, records have to be
> buffered and they may be lost if source has already committed the offset.
> The documentation suggests that the buffered memory is actually a KTable
> backed by local state store and each single update to KTable is separately
> into changelog topic (doesn't that add latency)
> 
> 
>    - When the task fails and re-spawns, does it read the change-log topic
>    from beginning?
> 
> Thank you for your patience.
> 
> 
> 
> 
> 
> 
> 
> On Aug 26, 2016 3:35 AM, "Matthias J. Sax" <ma...@confluent.io> wrote:
> 
>> Just want to add something:
>>
>> I you use Kafka Streams DSL, the library is Kafka centric.
>>
>> However, you could use low-level Processor API to get data into your
>> topology from other systems. The problem will be missing fault-tolerance
>> that you would need to code by yourself. When reading from Kafka,
>> fault-tolerance is "for free" because Kafka Streams uses Java
>> KafkaConsumer client internally.
>>
>> So as Eno said, it is recommended to use Kafka Connect to get data in
>> and out from the Kafka cluster. Of course, you can also use a different
>> tool than Connect for data import and export to/from Kafka.
>>
>> From Eno's answer
>>
>>> Multiple threads would make sense to run separate DAGs.
>>
>> If I understand this correctly, he means that you can write multiple
>> application and connect them via a topic to get multiple threads for
>> different processors (ie, the downstream application consumes the output
>> from the upstream application)
>>
>>
>> -Matthias
>>
>> On 08/25/2016 07:45 PM, Eno Thereska wrote:
>>> Good question. All of them would run in a single thread. That is the
>> model. Multiple threads would make sense to run separate DAGs.
>>>
>>> Eno
>>>
>>>
>>>> On 25 Aug 2016, at 18:32, Abhishek Agarwal <ab...@gmail.com>
>> wrote:
>>>>
>>>> Hi Eno,
>>>>
>>>> Thanks for your reply. If my application DAG has three stream
>> processors,
>>>> first of which is source, would all of them run in single thread? There
>> may
>>>> be scenarios wherein I want to have different number of threads for
>>>> different processors since some may be CPU bound and some may be IO
>> bound.
>>>>
>>>> On Thu, Aug 25, 2016 at 10:49 PM, Eno Thereska <en...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Abhishek,
>>>>>
>>>>> - Correct on connecting to external stores. You can use Kafka Connect
>> to
>>>>> get things in or out. (Note that in the 0.10.1 release KIP-67 allows
>> you to
>>>>> directly query Kafka Stream's stores so, for some kind of data you
>> don't
>>>>> need to move it to an external store. This is pushed in trunk.)
>>>>>
>>>>> - You can definitely use more threads than partitions, but that will
>> not
>>>>> buy you much since some threads will be idle. No two threads will work
>> on
>>>>> the same partition, so you don't have to worry about them repeating
>> work.
>>>>>
>>>>> Hope this helps.
>>>>> Eno
>>>>>
>>>>>> On 25 Aug 2016, at 16:50, Abhishek Agarwal <ab...@gmail.com>
>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I was reading up on kafka streams for a project and came across this
>> blog
>>>>>> https://softwaremill.com/kafka-streams-how-does-it-fit-strea
>> m-landscape/
>>>>>> I wanted to validate some assertions made in blog, with kafka
>> community
>>>>>>
>>>>>> - Kafka streams is kafka-in, kafka-out application. Does the user need
>>>>>> kafka connect to transfer data from kafka to any external store?
>>>>>> - No support for asynchronous processing - Can I use more threads than
>>>>>> number of partitions for processors without sacrificing at-least once
>>>>>> guarantees?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Abhishek Agarwal
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Abhishek Agarwal
>>>
>>
>>
> 


Re: Kafka streams

Posted by Abhishek Agarwal <ab...@gmail.com>.
Thanks Matthias and Eno. For at least once guarantees to be effective in
any system, the source (Kafka receiver) needs to know that the message has
been successfully processed. What I understood is that since processing
happens in single thread, if another record is polled, it is implicitly
assumed that previous record has been successfully processed. Given that,

- When I am batching records in memory for kafka producer, is there a
possibility of data loss since records have not been committed yet and
source has already polled for more records.

- Similarly for operations such as aggregations, records have to be
buffered and they may be lost if source has already committed the offset.
The documentation suggests that the buffered memory is actually a KTable
backed by local state store and each single update to KTable is separately
into changelog topic (doesn't that add latency)


   - When the task fails and re-spawns, does it read the change-log topic
   from beginning?

Thank you for your patience.







On Aug 26, 2016 3:35 AM, "Matthias J. Sax" <ma...@confluent.io> wrote:

> Just want to add something:
>
> I you use Kafka Streams DSL, the library is Kafka centric.
>
> However, you could use low-level Processor API to get data into your
> topology from other systems. The problem will be missing fault-tolerance
> that you would need to code by yourself. When reading from Kafka,
> fault-tolerance is "for free" because Kafka Streams uses Java
> KafkaConsumer client internally.
>
> So as Eno said, it is recommended to use Kafka Connect to get data in
> and out from the Kafka cluster. Of course, you can also use a different
> tool than Connect for data import and export to/from Kafka.
>
> From Eno's answer
>
> > Multiple threads would make sense to run separate DAGs.
>
> If I understand this correctly, he means that you can write multiple
> application and connect them via a topic to get multiple threads for
> different processors (ie, the downstream application consumes the output
> from the upstream application)
>
>
> -Matthias
>
> On 08/25/2016 07:45 PM, Eno Thereska wrote:
> > Good question. All of them would run in a single thread. That is the
> model. Multiple threads would make sense to run separate DAGs.
> >
> > Eno
> >
> >
> >> On 25 Aug 2016, at 18:32, Abhishek Agarwal <ab...@gmail.com>
> wrote:
> >>
> >> Hi Eno,
> >>
> >> Thanks for your reply. If my application DAG has three stream
> processors,
> >> first of which is source, would all of them run in single thread? There
> may
> >> be scenarios wherein I want to have different number of threads for
> >> different processors since some may be CPU bound and some may be IO
> bound.
> >>
> >> On Thu, Aug 25, 2016 at 10:49 PM, Eno Thereska <en...@gmail.com>
> >> wrote:
> >>
> >>> Hi Abhishek,
> >>>
> >>> - Correct on connecting to external stores. You can use Kafka Connect
> to
> >>> get things in or out. (Note that in the 0.10.1 release KIP-67 allows
> you to
> >>> directly query Kafka Stream's stores so, for some kind of data you
> don't
> >>> need to move it to an external store. This is pushed in trunk.)
> >>>
> >>> - You can definitely use more threads than partitions, but that will
> not
> >>> buy you much since some threads will be idle. No two threads will work
> on
> >>> the same partition, so you don't have to worry about them repeating
> work.
> >>>
> >>> Hope this helps.
> >>> Eno
> >>>
> >>>> On 25 Aug 2016, at 16:50, Abhishek Agarwal <ab...@gmail.com>
> wrote:
> >>>>
> >>>> Hi,
> >>>> I was reading up on kafka streams for a project and came across this
> blog
> >>>> https://softwaremill.com/kafka-streams-how-does-it-fit-strea
> m-landscape/
> >>>> I wanted to validate some assertions made in blog, with kafka
> community
> >>>>
> >>>> - Kafka streams is kafka-in, kafka-out application. Does the user need
> >>>> kafka connect to transfer data from kafka to any external store?
> >>>> - No support for asynchronous processing - Can I use more threads than
> >>>> number of partitions for processors without sacrificing at-least once
> >>>> guarantees?
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Regards,
> >>>> Abhishek Agarwal
> >>>
> >>>
> >>
> >>
> >> --
> >> Regards,
> >> Abhishek Agarwal
> >
>
>

Re: Kafka streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Just want to add something:

I you use Kafka Streams DSL, the library is Kafka centric.

However, you could use low-level Processor API to get data into your
topology from other systems. The problem will be missing fault-tolerance
that you would need to code by yourself. When reading from Kafka,
fault-tolerance is "for free" because Kafka Streams uses Java
KafkaConsumer client internally.

So as Eno said, it is recommended to use Kafka Connect to get data in
and out from the Kafka cluster. Of course, you can also use a different
tool than Connect for data import and export to/from Kafka.

From Eno's answer

> Multiple threads would make sense to run separate DAGs.

If I understand this correctly, he means that you can write multiple
application and connect them via a topic to get multiple threads for
different processors (ie, the downstream application consumes the output
from the upstream application)


-Matthias

On 08/25/2016 07:45 PM, Eno Thereska wrote:
> Good question. All of them would run in a single thread. That is the model. Multiple threads would make sense to run separate DAGs. 
> 
> Eno
> 
> 
>> On 25 Aug 2016, at 18:32, Abhishek Agarwal <ab...@gmail.com> wrote:
>>
>> Hi Eno,
>>
>> Thanks for your reply. If my application DAG has three stream processors,
>> first of which is source, would all of them run in single thread? There may
>> be scenarios wherein I want to have different number of threads for
>> different processors since some may be CPU bound and some may be IO bound.
>>
>> On Thu, Aug 25, 2016 at 10:49 PM, Eno Thereska <en...@gmail.com>
>> wrote:
>>
>>> Hi Abhishek,
>>>
>>> - Correct on connecting to external stores. You can use Kafka Connect to
>>> get things in or out. (Note that in the 0.10.1 release KIP-67 allows you to
>>> directly query Kafka Stream's stores so, for some kind of data you don't
>>> need to move it to an external store. This is pushed in trunk.)
>>>
>>> - You can definitely use more threads than partitions, but that will not
>>> buy you much since some threads will be idle. No two threads will work on
>>> the same partition, so you don't have to worry about them repeating work.
>>>
>>> Hope this helps.
>>> Eno
>>>
>>>> On 25 Aug 2016, at 16:50, Abhishek Agarwal <ab...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>> I was reading up on kafka streams for a project and came across this blog
>>>> https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/
>>>> I wanted to validate some assertions made in blog, with kafka community
>>>>
>>>> - Kafka streams is kafka-in, kafka-out application. Does the user need
>>>> kafka connect to transfer data from kafka to any external store?
>>>> - No support for asynchronous processing - Can I use more threads than
>>>> number of partitions for processors without sacrificing at-least once
>>>> guarantees?
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Abhishek Agarwal
>>>
>>>
>>
>>
>> -- 
>> Regards,
>> Abhishek Agarwal
> 


Re: Kafka streams

Posted by Eno Thereska <en...@gmail.com>.
Good question. All of them would run in a single thread. That is the model. Multiple threads would make sense to run separate DAGs. 

Eno


> On 25 Aug 2016, at 18:32, Abhishek Agarwal <ab...@gmail.com> wrote:
> 
> Hi Eno,
> 
> Thanks for your reply. If my application DAG has three stream processors,
> first of which is source, would all of them run in single thread? There may
> be scenarios wherein I want to have different number of threads for
> different processors since some may be CPU bound and some may be IO bound.
> 
> On Thu, Aug 25, 2016 at 10:49 PM, Eno Thereska <en...@gmail.com>
> wrote:
> 
>> Hi Abhishek,
>> 
>> - Correct on connecting to external stores. You can use Kafka Connect to
>> get things in or out. (Note that in the 0.10.1 release KIP-67 allows you to
>> directly query Kafka Stream's stores so, for some kind of data you don't
>> need to move it to an external store. This is pushed in trunk.)
>> 
>> - You can definitely use more threads than partitions, but that will not
>> buy you much since some threads will be idle. No two threads will work on
>> the same partition, so you don't have to worry about them repeating work.
>> 
>> Hope this helps.
>> Eno
>> 
>>> On 25 Aug 2016, at 16:50, Abhishek Agarwal <ab...@gmail.com> wrote:
>>> 
>>> Hi,
>>> I was reading up on kafka streams for a project and came across this blog
>>> https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/
>>> I wanted to validate some assertions made in blog, with kafka community
>>> 
>>> - Kafka streams is kafka-in, kafka-out application. Does the user need
>>> kafka connect to transfer data from kafka to any external store?
>>> - No support for asynchronous processing - Can I use more threads than
>>> number of partitions for processors without sacrificing at-least once
>>> guarantees?
>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> Abhishek Agarwal
>> 
>> 
> 
> 
> -- 
> Regards,
> Abhishek Agarwal


Re: Kafka streams

Posted by Abhishek Agarwal <ab...@gmail.com>.
Hi Eno,

Thanks for your reply. If my application DAG has three stream processors,
first of which is source, would all of them run in single thread? There may
be scenarios wherein I want to have different number of threads for
different processors since some may be CPU bound and some may be IO bound.

On Thu, Aug 25, 2016 at 10:49 PM, Eno Thereska <en...@gmail.com>
wrote:

> Hi Abhishek,
>
> - Correct on connecting to external stores. You can use Kafka Connect to
> get things in or out. (Note that in the 0.10.1 release KIP-67 allows you to
> directly query Kafka Stream's stores so, for some kind of data you don't
> need to move it to an external store. This is pushed in trunk.)
>
> - You can definitely use more threads than partitions, but that will not
> buy you much since some threads will be idle. No two threads will work on
> the same partition, so you don't have to worry about them repeating work.
>
> Hope this helps.
> Eno
>
> > On 25 Aug 2016, at 16:50, Abhishek Agarwal <ab...@gmail.com> wrote:
> >
> > Hi,
> > I was reading up on kafka streams for a project and came across this blog
> > https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/
> > I wanted to validate some assertions made in blog, with kafka community
> >
> > - Kafka streams is kafka-in, kafka-out application. Does the user need
> > kafka connect to transfer data from kafka to any external store?
> > - No support for asynchronous processing - Can I use more threads than
> > number of partitions for processors without sacrificing at-least once
> > guarantees?
> >
> >
> >
> > --
> > Regards,
> > Abhishek Agarwal
>
>


-- 
Regards,
Abhishek Agarwal

Re: Kafka streams

Posted by Eno Thereska <en...@gmail.com>.
Hi Abhishek,

- Correct on connecting to external stores. You can use Kafka Connect to get things in or out. (Note that in the 0.10.1 release KIP-67 allows you to directly query Kafka Stream's stores so, for some kind of data you don't need to move it to an external store. This is pushed in trunk.)

- You can definitely use more threads than partitions, but that will not buy you much since some threads will be idle. No two threads will work on the same partition, so you don't have to worry about them repeating work.

Hope this helps.
Eno

> On 25 Aug 2016, at 16:50, Abhishek Agarwal <ab...@gmail.com> wrote:
> 
> Hi,
> I was reading up on kafka streams for a project and came across this blog
> https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/
> I wanted to validate some assertions made in blog, with kafka community
> 
> - Kafka streams is kafka-in, kafka-out application. Does the user need
> kafka connect to transfer data from kafka to any external store?
> - No support for asynchronous processing - Can I use more threads than
> number of partitions for processors without sacrificing at-least once
> guarantees?
> 
> 
> 
> -- 
> Regards,
> Abhishek Agarwal