You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Rohit Valsakumar <rv...@tivo.com> on 2016/06/28 21:49:19 UTC

Question about bootstrap processing in KafkaStreams.

Hi all,

Is there a way to consume all the contents of a kafka topic into a KTable before doing a left join with another Kstream?

I am looking at something that simulates a bootstrap topic in a Samza job.

Thanks,
Rohit Valsakumar

________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.

Re: Question about bootstrap processing in KafkaStreams.

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Hi,

there was a similar discussion on the list already "Kafka stream join
scenario":

http://search-hadoop.com/m/uyzND1WsAGW1vB5O91&subj=Kafka+stream+join+scenarios

Long story short: there is no explicit support or guarantee. As Jay
mentioned, some alignment is best effort. However, the main issues is
the question, what does it mean to load a KTable *completely*. As a
KTable consumers a changelog stream, there is no defined end as a KTable
is a always/infinitely updating "dynamic" table...

You might be able to build a custom solution for it, though (see the
email thread I linked above). Hope this helps.


-Matthias

On 06/29/2016 04:26 AM, Gwen Shapira wrote:
> Upgrade :)
> 
> On Tue, Jun 28, 2016 at 6:49 PM, Rohit Valsakumar <rv...@tivo.com> wrote:
>> Hi Jay,
>>
>> Thanks for the reply.
>>
>> Unfortunately in our case due to legacy reasons we are using
>> WallclockTimestampExtractor in the application for all the streams and the
>> existing messages in the stream probably won¹t have timestamps as they are
>> being produced by legacy clients. So the events are being ingested with
>> processing times and it may not be able to synchronize based on the
>> message timestamps. What do you recommend for this scenario?
>>
>> Rohit
>>
>> On 6/28/16, 5:18 PM, "Jay Kreps" <ja...@confluent.io> wrote:
>>
>>> I think you may get this for free as Kafka Streams attempts to align
>>> consumption across different topics/partitions by the timestamp in the
>>> messages. So in a case where you are starting a job fresh and it has a
>>> database changelog to consume and a event stream to consume, it will
>>> attempt to keep the Ktable at the "time" the event stream is at. This is
>>> only a heuristic, of course, since messages are necessarily strongly
>>> ordered by time. I think this is likely mostly the same but slightly
>>> better
>>> than the bootstrap usage in Samza but also covers other cases of
>>> alignment.
>>>
>>> If you want more control you can override the timestamp extractor that
>>> associates time and hence priority for the streams:
>>> https://kafka.apache.org/0100/javadoc/org/apache/kafka/streams/processor/T
>>> imestampExtractor.html
>>>
>>> -Jay
>>>
>>> On Tue, Jun 28, 2016 at 2:49 PM, Rohit Valsakumar <rv...@tivo.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Is there a way to consume all the contents of a kafka topic into a
>>>> KTable
>>>> before doing a left join with another Kstream?
>>>>
>>>> I am looking at something that simulates a bootstrap topic in a Samza
>>>> job.
>>>>
>>>> Thanks,
>>>> Rohit Valsakumar
>>>>
>>>> ________________________________
>>>>
>>>> This email and any attachments may contain confidential and privileged
>>>> material for the sole use of the intended recipient. Any review,
>>>> copying,
>>>> or distribution of this email (or any attachments) by others is
>>>> prohibited.
>>>> If you are not the intended recipient, please contact the sender
>>>> immediately and permanently delete this email and any attachments. No
>>>> employee or agent of TiVo Inc. is authorized to conclude any binding
>>>> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
>>>> Inc. may only be made by a signed written agreement.
>>>>
>>
>>
>> ________________________________
>>
>> This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.


Re: Question about bootstrap processing in KafkaStreams.

Posted by Gwen Shapira <gw...@confluent.io>.
Upgrade :)

On Tue, Jun 28, 2016 at 6:49 PM, Rohit Valsakumar <rv...@tivo.com> wrote:
> Hi Jay,
>
> Thanks for the reply.
>
> Unfortunately in our case due to legacy reasons we are using
> WallclockTimestampExtractor in the application for all the streams and the
> existing messages in the stream probably won¹t have timestamps as they are
> being produced by legacy clients. So the events are being ingested with
> processing times and it may not be able to synchronize based on the
> message timestamps. What do you recommend for this scenario?
>
> Rohit
>
> On 6/28/16, 5:18 PM, "Jay Kreps" <ja...@confluent.io> wrote:
>
>>I think you may get this for free as Kafka Streams attempts to align
>>consumption across different topics/partitions by the timestamp in the
>>messages. So in a case where you are starting a job fresh and it has a
>>database changelog to consume and a event stream to consume, it will
>>attempt to keep the Ktable at the "time" the event stream is at. This is
>>only a heuristic, of course, since messages are necessarily strongly
>>ordered by time. I think this is likely mostly the same but slightly
>>better
>>than the bootstrap usage in Samza but also covers other cases of
>>alignment.
>>
>>If you want more control you can override the timestamp extractor that
>>associates time and hence priority for the streams:
>>https://kafka.apache.org/0100/javadoc/org/apache/kafka/streams/processor/T
>>imestampExtractor.html
>>
>>-Jay
>>
>>On Tue, Jun 28, 2016 at 2:49 PM, Rohit Valsakumar <rv...@tivo.com>
>>wrote:
>>
>>> Hi all,
>>>
>>> Is there a way to consume all the contents of a kafka topic into a
>>>KTable
>>> before doing a left join with another Kstream?
>>>
>>> I am looking at something that simulates a bootstrap topic in a Samza
>>>job.
>>>
>>> Thanks,
>>> Rohit Valsakumar
>>>
>>> ________________________________
>>>
>>> This email and any attachments may contain confidential and privileged
>>> material for the sole use of the intended recipient. Any review,
>>>copying,
>>> or distribution of this email (or any attachments) by others is
>>>prohibited.
>>> If you are not the intended recipient, please contact the sender
>>> immediately and permanently delete this email and any attachments. No
>>> employee or agent of TiVo Inc. is authorized to conclude any binding
>>> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
>>> Inc. may only be made by a signed written agreement.
>>>
>
>
> ________________________________
>
> This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.

Re: Question about bootstrap processing in KafkaStreams.

Posted by Rohit Valsakumar <rv...@tivo.com>.
Hi Jay,

Thanks for the reply.

Unfortunately in our case due to legacy reasons we are using
WallclockTimestampExtractor in the application for all the streams and the
existing messages in the stream probably won¹t have timestamps as they are
being produced by legacy clients. So the events are being ingested with
processing times and it may not be able to synchronize based on the
message timestamps. What do you recommend for this scenario?

Rohit

On 6/28/16, 5:18 PM, "Jay Kreps" <ja...@confluent.io> wrote:

>I think you may get this for free as Kafka Streams attempts to align
>consumption across different topics/partitions by the timestamp in the
>messages. So in a case where you are starting a job fresh and it has a
>database changelog to consume and a event stream to consume, it will
>attempt to keep the Ktable at the "time" the event stream is at. This is
>only a heuristic, of course, since messages are necessarily strongly
>ordered by time. I think this is likely mostly the same but slightly
>better
>than the bootstrap usage in Samza but also covers other cases of
>alignment.
>
>If you want more control you can override the timestamp extractor that
>associates time and hence priority for the streams:
>https://kafka.apache.org/0100/javadoc/org/apache/kafka/streams/processor/T
>imestampExtractor.html
>
>-Jay
>
>On Tue, Jun 28, 2016 at 2:49 PM, Rohit Valsakumar <rv...@tivo.com>
>wrote:
>
>> Hi all,
>>
>> Is there a way to consume all the contents of a kafka topic into a
>>KTable
>> before doing a left join with another Kstream?
>>
>> I am looking at something that simulates a bootstrap topic in a Samza
>>job.
>>
>> Thanks,
>> Rohit Valsakumar
>>
>> ________________________________
>>
>> This email and any attachments may contain confidential and privileged
>> material for the sole use of the intended recipient. Any review,
>>copying,
>> or distribution of this email (or any attachments) by others is
>>prohibited.
>> If you are not the intended recipient, please contact the sender
>> immediately and permanently delete this email and any attachments. No
>> employee or agent of TiVo Inc. is authorized to conclude any binding
>> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
>> Inc. may only be made by a signed written agreement.
>>


________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.

Re: Question about bootstrap processing in KafkaStreams.

Posted by Jay Kreps <ja...@confluent.io>.
I think you may get this for free as Kafka Streams attempts to align
consumption across different topics/partitions by the timestamp in the
messages. So in a case where you are starting a job fresh and it has a
database changelog to consume and a event stream to consume, it will
attempt to keep the Ktable at the "time" the event stream is at. This is
only a heuristic, of course, since messages are necessarily strongly
ordered by time. I think this is likely mostly the same but slightly better
than the bootstrap usage in Samza but also covers other cases of alignment.

If you want more control you can override the timestamp extractor that
associates time and hence priority for the streams:
https://kafka.apache.org/0100/javadoc/org/apache/kafka/streams/processor/TimestampExtractor.html

-Jay

On Tue, Jun 28, 2016 at 2:49 PM, Rohit Valsakumar <rv...@tivo.com>
wrote:

> Hi all,
>
> Is there a way to consume all the contents of a kafka topic into a KTable
> before doing a left join with another Kstream?
>
> I am looking at something that simulates a bootstrap topic in a Samza job.
>
> Thanks,
> Rohit Valsakumar
>
> ________________________________
>
> This email and any attachments may contain confidential and privileged
> material for the sole use of the intended recipient. Any review, copying,
> or distribution of this email (or any attachments) by others is prohibited.
> If you are not the intended recipient, please contact the sender
> immediately and permanently delete this email and any attachments. No
> employee or agent of TiVo Inc. is authorized to conclude any binding
> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
> Inc. may only be made by a signed written agreement.
>