You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Ali Akhtar <al...@gmail.com> on 2017/04/27 21:22:14 UTC

Caching in Kafka Streams to ignore garbage message

I have a Kafka topic which will receive a large amount of data.

This data has an 'id' field. I need to look up the id in an external db,
see if we are tracking that id, and if yes, we process that message, if
not, we ignore it.

99% of the data will be for ids which are not being tracked - 1% or so will
be for ids which are tracked.

My concern is, that there'd be a lot of round trips to the db made just to
check the id, and if it'd be better to cache the ids being tracked
somewhere, so other ids are ignored.

I was considering sending a message to another (or the same topic) whenever
a new id is added to the track list, and that id should then get processed
on the node which will process the messages.

Should I just cache all ids on all nodes (which may be a large amount), or
is there a way to only cache the id on the same kafka streams node which
will receive data for that id?

Re: Caching in Kafka Streams to ignore garbage message

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Ah. Sorry.

You are right. Nevertheless, you can set an non-null dummy value like
`byte[0]` instead of the actual "tuple" to not blow up your storage
requirement.


-Matthias


On 4/30/17 10:24 AM, Michal Borowiecki wrote:
> Apologies, I must have not made myself clear.
> 
> I meant the values in the records coming from the input topic (which in
> turn are coming from kafka connect in the example at hand)
> 
> and not the records coming out of the join.
> 
> My intention was to warn against sending null values from kafka connect
> to the topic that is then meant to be read-in as a ktable to filter against.
> 
> 
> Am I clearer now?
> 
> 
> Cheers,
> 
> Michał
> 
> 
> On 30/04/17 18:14, Matthias J. Sax wrote:
>> Your observation is correct.
>>
>> If  you use inner KStream-KTable join, the join will implement the
>> filter automatically as the join will not return any result.
>>
>>
>> -Matthias
>>
>>
>>
>> On 4/30/17 7:23 AM, Michal Borowiecki wrote:
>>> I have something working on the same principle (except not using
>>> connect), that is, I put ids to filter on into a ktable and then (inner)
>>> join a kstream with that ktable.
>>>
>>> I don't believe the value can be null though. In a changlog null value
>>> is interpreted as a delete so won't be put into a ktable.
>>>
>>> The RocksDB store, for one, does this:
>>>
>>> private void putInternal(byte[] rawKey, byte[] rawValue) {
>>>     if (rawValue == null) {
>>>         try {
>>>             db.delete(wOptions, rawKey);
>>>
>>> But any non-null value would do.
>>> Please correct me if miss-understood.
>>>
>>> Cheers,
>>> Michał
>>>
>>> On 27/04/17 22:44, Matthias J. Sax wrote:
>>>>>> I'd like to avoid repeated trips to the db, and caching a large amount of
>>>>>> data in memory.
>>>> Lookups to the DB would be hard to get done anyway. Ie, it would not
>>>> perform well, as all your calls would need to be synchronous...
>>>>
>>>>
>>>>>> Is it possible to send a message w/ the id as the partition key to a topic,
>>>>>> and then use the same id as the key, so the same node which will receive
>>>>>> the data for an id is the one which will process it?
>>>> That is what I did propose (maybe it was not clear). If you use Connect,
>>>> you can just import the ID into Kafka and leave the value empty (ie,
>>>> null). This reduced you cache data to a minimum. And the KStream-KTable
>>>> join work as you described it :)
>>>>
>>>>
>>>> -Matthias
>>>>
>>>> On 4/27/17 2:37 PM, Ali Akhtar wrote:
>>>>> I'd like to avoid repeated trips to the db, and caching a large amount of
>>>>> data in memory.
>>>>>
>>>>> Is it possible to send a message w/ the id as the partition key to a topic,
>>>>> and then use the same id as the key, so the same node which will receive
>>>>> the data for an id is the one which will process it?
>>>>>
>>>>>
>>>>> On Fri, Apr 28, 2017 at 2:32 AM, Matthias J. Sax <ma...@confluent.io>
>>>>> wrote:
>>>>>
>>>>>> The recommended solution would be to use Kafka Connect to load you DB
>>>>>> data into a Kafka topic.
>>>>>>
>>>>>> With Kafka Streams you read your db-topic as KTable and do a (inne)
>>>>>> KStream-KTable join to lookup the IDs.
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>> On 4/27/17 2:22 PM, Ali Akhtar wrote:
>>>>>>> I have a Kafka topic which will receive a large amount of data.
>>>>>>>
>>>>>>> This data has an 'id' field. I need to look up the id in an external db,
>>>>>>> see if we are tracking that id, and if yes, we process that message, if
>>>>>>> not, we ignore it.
>>>>>>>
>>>>>>> 99% of the data will be for ids which are not being tracked - 1% or so
>>>>>> will
>>>>>>> be for ids which are tracked.
>>>>>>>
>>>>>>> My concern is, that there'd be a lot of round trips to the db made just
>>>>>> to
>>>>>>> check the id, and if it'd be better to cache the ids being tracked
>>>>>>> somewhere, so other ids are ignored.
>>>>>>>
>>>>>>> I was considering sending a message to another (or the same topic)
>>>>>> whenever
>>>>>>> a new id is added to the track list, and that id should then get
>>>>>> processed
>>>>>>> on the node which will process the messages.
>>>>>>>
>>>>>>> Should I just cache all ids on all nodes (which may be a large amount),
>>>>>> or
>>>>>>> is there a way to only cache the id on the same kafka streams node which
>>>>>>> will receive data for that id?
>>>>>>>
>>> -- 
>>> Signature
>>> <http://www.openbet.com/> 	Michal Borowiecki
>>> Senior Software Engineer L4
>>> 	T: 	+44 208 742 1600
>>>
>>> 	
>>> 	+44 203 249 8448
>>>
>>> 	
>>> 	 
>>> 	E: 	michal.borowiecki@openbet.com
>>> 	W: 	www.openbet.com <http://www.openbet.com/>
>>>
>>> 	
>>> 	OpenBet Ltd
>>>
>>> 	Chiswick Park Building 9
>>>
>>> 	566 Chiswick High Rd
>>>
>>> 	London
>>>
>>> 	W4 5XT
>>>
>>> 	UK
>>>
>>> 	
>>> <https://www.openbet.com/email_promo>
>>>
>>> This message is confidential and intended only for the addressee. If you
>>> have received this message in error, please immediately notify the
>>> postmaster@openbet.com <ma...@openbet.com> and delete it
>>> from your system as well as any copies. The content of e-mails as well
>>> as traffic data may be monitored by OpenBet for employment and security
>>> purposes. To protect the environment please do not print this e-mail
>>> unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building
>>> 9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company
>>> registered in England and Wales. Registered no. 3134634. VAT no.
>>> GB927523612
>>>
> 
> -- 
> Signature
> <http://www.openbet.com/> 	Michal Borowiecki
> Senior Software Engineer L4
> 	T: 	+44 208 742 1600
> 
> 	
> 	+44 203 249 8448
> 
> 	
> 	 
> 	E: 	michal.borowiecki@openbet.com
> 	W: 	www.openbet.com <http://www.openbet.com/>
> 
> 	
> 	OpenBet Ltd
> 
> 	Chiswick Park Building 9
> 
> 	566 Chiswick High Rd
> 
> 	London
> 
> 	W4 5XT
> 
> 	UK
> 
> 	
> <https://www.openbet.com/email_promo>
> 
> This message is confidential and intended only for the addressee. If you
> have received this message in error, please immediately notify the
> postmaster@openbet.com <ma...@openbet.com> and delete it
> from your system as well as any copies. The content of e-mails as well
> as traffic data may be monitored by OpenBet for employment and security
> purposes. To protect the environment please do not print this e-mail
> unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building
> 9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company
> registered in England and Wales. Registered no. 3134634. VAT no.
> GB927523612
>

Re: Caching in Kafka Streams to ignore garbage message

Posted by Michal Borowiecki <mi...@openbet.com>.

Apologies, I must have not made myself clear.

I meant the values in the records coming from the input topic (which in 
turn are coming from kafka connect in the example at hand)

and not the records coming out of the join.

My intention was to warn against sending null values from kafka connect 
to the topic that is then meant to be read-in as a ktable to filter against.


Am I clearer now?


Cheers,

Micha\u0142


On 30/04/17 18:14, Matthias J. Sax wrote:
> Your observation is correct.
>
> If  you use inner KStream-KTable join, the join will implement the
> filter automatically as the join will not return any result.
>
>
> -Matthias
>
>
>
> On 4/30/17 7:23 AM, Michal Borowiecki wrote:
>> I have something working on the same principle (except not using
>> connect), that is, I put ids to filter on into a ktable and then (inner)
>> join a kstream with that ktable.
>>
>> I don't believe the value can be null though. In a changlog null value
>> is interpreted as a delete so won't be put into a ktable.
>>
>> The RocksDB store, for one, does this:
>>
>> private void putInternal(byte[] rawKey, byte[] rawValue) {
>>      if (rawValue == null) {
>>          try {
>>              db.delete(wOptions, rawKey);
>>
>> But any non-null value would do.
>> Please correct me if miss-understood.
>>
>> Cheers,
>> Micha\u0142
>>
>> On 27/04/17 22:44, Matthias J. Sax wrote:
>>>>> I'd like to avoid repeated trips to the db, and caching a large amount of
>>>>> data in memory.
>>> Lookups to the DB would be hard to get done anyway. Ie, it would not
>>> perform well, as all your calls would need to be synchronous...
>>>
>>>
>>>>> Is it possible to send a message w/ the id as the partition key to a topic,
>>>>> and then use the same id as the key, so the same node which will receive
>>>>> the data for an id is the one which will process it?
>>> That is what I did propose (maybe it was not clear). If you use Connect,
>>> you can just import the ID into Kafka and leave the value empty (ie,
>>> null). This reduced you cache data to a minimum. And the KStream-KTable
>>> join work as you described it :)
>>>
>>>
>>> -Matthias
>>>
>>> On 4/27/17 2:37 PM, Ali Akhtar wrote:
>>>> I'd like to avoid repeated trips to the db, and caching a large amount of
>>>> data in memory.
>>>>
>>>> Is it possible to send a message w/ the id as the partition key to a topic,
>>>> and then use the same id as the key, so the same node which will receive
>>>> the data for an id is the one which will process it?
>>>>
>>>>
>>>> On Fri, Apr 28, 2017 at 2:32 AM, Matthias J. Sax <ma...@confluent.io>
>>>> wrote:
>>>>
>>>>> The recommended solution would be to use Kafka Connect to load you DB
>>>>> data into a Kafka topic.
>>>>>
>>>>> With Kafka Streams you read your db-topic as KTable and do a (inne)
>>>>> KStream-KTable join to lookup the IDs.
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>> On 4/27/17 2:22 PM, Ali Akhtar wrote:
>>>>>> I have a Kafka topic which will receive a large amount of data.
>>>>>>
>>>>>> This data has an 'id' field. I need to look up the id in an external db,
>>>>>> see if we are tracking that id, and if yes, we process that message, if
>>>>>> not, we ignore it.
>>>>>>
>>>>>> 99% of the data will be for ids which are not being tracked - 1% or so
>>>>> will
>>>>>> be for ids which are tracked.
>>>>>>
>>>>>> My concern is, that there'd be a lot of round trips to the db made just
>>>>> to
>>>>>> check the id, and if it'd be better to cache the ids being tracked
>>>>>> somewhere, so other ids are ignored.
>>>>>>
>>>>>> I was considering sending a message to another (or the same topic)
>>>>> whenever
>>>>>> a new id is added to the track list, and that id should then get
>>>>> processed
>>>>>> on the node which will process the messages.
>>>>>>
>>>>>> Should I just cache all ids on all nodes (which may be a large amount),
>>>>> or
>>>>>> is there a way to only cache the id on the same kafka streams node which
>>>>>> will receive data for that id?
>>>>>>
>> -- 
>> Signature
>> <http://www.openbet.com/> 	Michal Borowiecki
>> Senior Software Engineer L4
>> 	T: 	+44 208 742 1600
>>
>> 	
>> 	+44 203 249 8448
>>
>> 	
>> 	
>> 	E: 	michal.borowiecki@openbet.com
>> 	W: 	www.openbet.com <http://www.openbet.com/>
>>
>> 	
>> 	OpenBet Ltd
>>
>> 	Chiswick Park Building 9
>>
>> 	566 Chiswick High Rd
>>
>> 	London
>>
>> 	W4 5XT
>>
>> 	UK
>>
>> 	
>> <https://www.openbet.com/email_promo>
>>
>> This message is confidential and intended only for the addressee. If you
>> have received this message in error, please immediately notify the
>> postmaster@openbet.com <ma...@openbet.com> and delete it
>> from your system as well as any copies. The content of e-mails as well
>> as traffic data may be monitored by OpenBet for employment and security
>> purposes. To protect the environment please do not print this e-mail
>> unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building
>> 9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company
>> registered in England and Wales. Registered no. 3134634. VAT no.
>> GB927523612
>>

-- 
Signature
<http://www.openbet.com/> 	Michal Borowiecki
Senior Software Engineer L4
	T: 	+44 208 742 1600

	
	+44 203 249 8448

	
	
	E: 	michal.borowiecki@openbet.com
	W: 	www.openbet.com <http://www.openbet.com/>

	
	OpenBet Ltd

	Chiswick Park Building 9

	566 Chiswick High Rd

	London

	W4 5XT

	UK

	
<https://www.openbet.com/email_promo>

This message is confidential and intended only for the addressee. If you 
have received this message in error, please immediately notify the 
postmaster@openbet.com <ma...@openbet.com> and delete it 
from your system as well as any copies. The content of e-mails as well 
as traffic data may be monitored by OpenBet for employment and security 
purposes. To protect the environment please do not print this e-mail 
unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building 
9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company 
registered in England and Wales. Registered no. 3134634. VAT no. 
GB927523612

Re: Caching in Kafka Streams to ignore garbage message

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Your observation is correct.

If  you use inner KStream-KTable join, the join will implement the
filter automatically as the join will not return any result.


-Matthias



On 4/30/17 7:23 AM, Michal Borowiecki wrote:
> I have something working on the same principle (except not using
> connect), that is, I put ids to filter on into a ktable and then (inner)
> join a kstream with that ktable.
> 
> I don't believe the value can be null though. In a changlog null value
> is interpreted as a delete so won't be put into a ktable.
> 
> The RocksDB store, for one, does this:
> 
> private void putInternal(byte[] rawKey, byte[] rawValue) {
>     if (rawValue == null) {
>         try {
>             db.delete(wOptions, rawKey);
> 
> But any non-null value would do.
> Please correct me if miss-understood.
> 
> Cheers,
> Michał
> 
> On 27/04/17 22:44, Matthias J. Sax wrote:
>>>> I'd like to avoid repeated trips to the db, and caching a large amount of
>>>> data in memory.
>> Lookups to the DB would be hard to get done anyway. Ie, it would not
>> perform well, as all your calls would need to be synchronous...
>>
>>
>>>> Is it possible to send a message w/ the id as the partition key to a topic,
>>>> and then use the same id as the key, so the same node which will receive
>>>> the data for an id is the one which will process it?
>> That is what I did propose (maybe it was not clear). If you use Connect,
>> you can just import the ID into Kafka and leave the value empty (ie,
>> null). This reduced you cache data to a minimum. And the KStream-KTable
>> join work as you described it :)
>>
>>
>> -Matthias
>>
>> On 4/27/17 2:37 PM, Ali Akhtar wrote:
>>> I'd like to avoid repeated trips to the db, and caching a large amount of
>>> data in memory.
>>>
>>> Is it possible to send a message w/ the id as the partition key to a topic,
>>> and then use the same id as the key, so the same node which will receive
>>> the data for an id is the one which will process it?
>>>
>>>
>>> On Fri, Apr 28, 2017 at 2:32 AM, Matthias J. Sax <ma...@confluent.io>
>>> wrote:
>>>
>>>> The recommended solution would be to use Kafka Connect to load you DB
>>>> data into a Kafka topic.
>>>>
>>>> With Kafka Streams you read your db-topic as KTable and do a (inne)
>>>> KStream-KTable join to lookup the IDs.
>>>>
>>>>
>>>> -Matthias
>>>>
>>>> On 4/27/17 2:22 PM, Ali Akhtar wrote:
>>>>> I have a Kafka topic which will receive a large amount of data.
>>>>>
>>>>> This data has an 'id' field. I need to look up the id in an external db,
>>>>> see if we are tracking that id, and if yes, we process that message, if
>>>>> not, we ignore it.
>>>>>
>>>>> 99% of the data will be for ids which are not being tracked - 1% or so
>>>> will
>>>>> be for ids which are tracked.
>>>>>
>>>>> My concern is, that there'd be a lot of round trips to the db made just
>>>> to
>>>>> check the id, and if it'd be better to cache the ids being tracked
>>>>> somewhere, so other ids are ignored.
>>>>>
>>>>> I was considering sending a message to another (or the same topic)
>>>> whenever
>>>>> a new id is added to the track list, and that id should then get
>>>> processed
>>>>> on the node which will process the messages.
>>>>>
>>>>> Should I just cache all ids on all nodes (which may be a large amount),
>>>> or
>>>>> is there a way to only cache the id on the same kafka streams node which
>>>>> will receive data for that id?
>>>>>
>>>>
> 
> -- 
> Signature
> <http://www.openbet.com/> 	Michal Borowiecki
> Senior Software Engineer L4
> 	T: 	+44 208 742 1600
> 
> 	
> 	+44 203 249 8448
> 
> 	
> 	 
> 	E: 	michal.borowiecki@openbet.com
> 	W: 	www.openbet.com <http://www.openbet.com/>
> 
> 	
> 	OpenBet Ltd
> 
> 	Chiswick Park Building 9
> 
> 	566 Chiswick High Rd
> 
> 	London
> 
> 	W4 5XT
> 
> 	UK
> 
> 	
> <https://www.openbet.com/email_promo>
> 
> This message is confidential and intended only for the addressee. If you
> have received this message in error, please immediately notify the
> postmaster@openbet.com <ma...@openbet.com> and delete it
> from your system as well as any copies. The content of e-mails as well
> as traffic data may be monitored by OpenBet for employment and security
> purposes. To protect the environment please do not print this e-mail
> unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building
> 9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company
> registered in England and Wales. Registered no. 3134634. VAT no.
> GB927523612
>

Re: Caching in Kafka Streams to ignore garbage message

Posted by Michal Borowiecki <mi...@openbet.com>.

I have something working on the same principle (except not using 
connect), that is, I put ids to filter on into a ktable and then (inner) 
join a kstream with that ktable.

I don't believe the value can be null though. In a changlog null value 
is interpreted as a delete so won't be put into a ktable.

The RocksDB store, for one, does this:

private void putInternal(byte[] rawKey,byte[] rawValue) {
     if (rawValue ==null) {
         try {
             db.delete(wOptions, rawKey);

But any non-null value would do.
Please correct me if miss-understood.

Cheers,
Micha\u0142

On 27/04/17 22:44, Matthias J. Sax wrote:
>>> I'd like to avoid repeated trips to the db, and caching a large amount of
>>> data in memory.
> Lookups to the DB would be hard to get done anyway. Ie, it would not
> perform well, as all your calls would need to be synchronous...
>
>
>>> Is it possible to send a message w/ the id as the partition key to a topic,
>>> and then use the same id as the key, so the same node which will receive
>>> the data for an id is the one which will process it?
> That is what I did propose (maybe it was not clear). If you use Connect,
> you can just import the ID into Kafka and leave the value empty (ie,
> null). This reduced you cache data to a minimum. And the KStream-KTable
> join work as you described it :)
>
>
> -Matthias
>
> On 4/27/17 2:37 PM, Ali Akhtar wrote:
>> I'd like to avoid repeated trips to the db, and caching a large amount of
>> data in memory.
>>
>> Is it possible to send a message w/ the id as the partition key to a topic,
>> and then use the same id as the key, so the same node which will receive
>> the data for an id is the one which will process it?
>>
>>
>> On Fri, Apr 28, 2017 at 2:32 AM, Matthias J. Sax <ma...@confluent.io>
>> wrote:
>>
>>> The recommended solution would be to use Kafka Connect to load you DB
>>> data into a Kafka topic.
>>>
>>> With Kafka Streams you read your db-topic as KTable and do a (inne)
>>> KStream-KTable join to lookup the IDs.
>>>
>>>
>>> -Matthias
>>>
>>> On 4/27/17 2:22 PM, Ali Akhtar wrote:
>>>> I have a Kafka topic which will receive a large amount of data.
>>>>
>>>> This data has an 'id' field. I need to look up the id in an external db,
>>>> see if we are tracking that id, and if yes, we process that message, if
>>>> not, we ignore it.
>>>>
>>>> 99% of the data will be for ids which are not being tracked - 1% or so
>>> will
>>>> be for ids which are tracked.
>>>>
>>>> My concern is, that there'd be a lot of round trips to the db made just
>>> to
>>>> check the id, and if it'd be better to cache the ids being tracked
>>>> somewhere, so other ids are ignored.
>>>>
>>>> I was considering sending a message to another (or the same topic)
>>> whenever
>>>> a new id is added to the track list, and that id should then get
>>> processed
>>>> on the node which will process the messages.
>>>>
>>>> Should I just cache all ids on all nodes (which may be a large amount),
>>> or
>>>> is there a way to only cache the id on the same kafka streams node which
>>>> will receive data for that id?
>>>>
>>>

-- 
Signature
<http://www.openbet.com/> 	Michal Borowiecki
Senior Software Engineer L4
	T: 	+44 208 742 1600

	
	+44 203 249 8448

	
	
	E: 	michal.borowiecki@openbet.com
	W: 	www.openbet.com <http://www.openbet.com/>

	
	OpenBet Ltd

	Chiswick Park Building 9

	566 Chiswick High Rd

	London

	W4 5XT

	UK

	
<https://www.openbet.com/email_promo>

This message is confidential and intended only for the addressee. If you 
have received this message in error, please immediately notify the 
postmaster@openbet.com <ma...@openbet.com> and delete it 
from your system as well as any copies. The content of e-mails as well 
as traffic data may be monitored by OpenBet for employment and security 
purposes. To protect the environment please do not print this e-mail 
unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building 
9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company 
registered in England and Wales. Registered no. 3134634. VAT no. 
GB927523612

Re: Caching in Kafka Streams to ignore garbage message

Posted by "Matthias J. Sax" <ma...@confluent.io>.

>> I'd like to avoid repeated trips to the db, and caching a large amount of
>> data in memory.

Lookups to the DB would be hard to get done anyway. Ie, it would not
perform well, as all your calls would need to be synchronous...


>> Is it possible to send a message w/ the id as the partition key to a topic,
>> and then use the same id as the key, so the same node which will receive
>> the data for an id is the one which will process it?

That is what I did propose (maybe it was not clear). If you use Connect,
you can just import the ID into Kafka and leave the value empty (ie,
null). This reduced you cache data to a minimum. And the KStream-KTable
join work as you described it :)


-Matthias

On 4/27/17 2:37 PM, Ali Akhtar wrote:
> I'd like to avoid repeated trips to the db, and caching a large amount of
> data in memory.
> 
> Is it possible to send a message w/ the id as the partition key to a topic,
> and then use the same id as the key, so the same node which will receive
> the data for an id is the one which will process it?
> 
> 
> On Fri, Apr 28, 2017 at 2:32 AM, Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> The recommended solution would be to use Kafka Connect to load you DB
>> data into a Kafka topic.
>>
>> With Kafka Streams you read your db-topic as KTable and do a (inne)
>> KStream-KTable join to lookup the IDs.
>>
>>
>> -Matthias
>>
>> On 4/27/17 2:22 PM, Ali Akhtar wrote:
>>> I have a Kafka topic which will receive a large amount of data.
>>>
>>> This data has an 'id' field. I need to look up the id in an external db,
>>> see if we are tracking that id, and if yes, we process that message, if
>>> not, we ignore it.
>>>
>>> 99% of the data will be for ids which are not being tracked - 1% or so
>> will
>>> be for ids which are tracked.
>>>
>>> My concern is, that there'd be a lot of round trips to the db made just
>> to
>>> check the id, and if it'd be better to cache the ids being tracked
>>> somewhere, so other ids are ignored.
>>>
>>> I was considering sending a message to another (or the same topic)
>> whenever
>>> a new id is added to the track list, and that id should then get
>> processed
>>> on the node which will process the messages.
>>>
>>> Should I just cache all ids on all nodes (which may be a large amount),
>> or
>>> is there a way to only cache the id on the same kafka streams node which
>>> will receive data for that id?
>>>
>>
>>
>

Re: Caching in Kafka Streams to ignore garbage message

Posted by Ali Akhtar <al...@gmail.com>.

I'd like to avoid repeated trips to the db, and caching a large amount of
data in memory.

Is it possible to send a message w/ the id as the partition key to a topic,
and then use the same id as the key, so the same node which will receive
the data for an id is the one which will process it?


On Fri, Apr 28, 2017 at 2:32 AM, Matthias J. Sax <ma...@confluent.io>
wrote:

> The recommended solution would be to use Kafka Connect to load you DB
> data into a Kafka topic.
>
> With Kafka Streams you read your db-topic as KTable and do a (inne)
> KStream-KTable join to lookup the IDs.
>
>
> -Matthias
>
> On 4/27/17 2:22 PM, Ali Akhtar wrote:
> > I have a Kafka topic which will receive a large amount of data.
> >
> > This data has an 'id' field. I need to look up the id in an external db,
> > see if we are tracking that id, and if yes, we process that message, if
> > not, we ignore it.
> >
> > 99% of the data will be for ids which are not being tracked - 1% or so
> will
> > be for ids which are tracked.
> >
> > My concern is, that there'd be a lot of round trips to the db made just
> to
> > check the id, and if it'd be better to cache the ids being tracked
> > somewhere, so other ids are ignored.
> >
> > I was considering sending a message to another (or the same topic)
> whenever
> > a new id is added to the track list, and that id should then get
> processed
> > on the node which will process the messages.
> >
> > Should I just cache all ids on all nodes (which may be a large amount),
> or
> > is there a way to only cache the id on the same kafka streams node which
> > will receive data for that id?
> >
>
>

Re: Caching in Kafka Streams to ignore garbage message

Posted by "Matthias J. Sax" <ma...@confluent.io>.

The recommended solution would be to use Kafka Connect to load you DB
data into a Kafka topic.

With Kafka Streams you read your db-topic as KTable and do a (inne)
KStream-KTable join to lookup the IDs.


-Matthias

On 4/27/17 2:22 PM, Ali Akhtar wrote:
> I have a Kafka topic which will receive a large amount of data.
> 
> This data has an 'id' field. I need to look up the id in an external db,
> see if we are tracking that id, and if yes, we process that message, if
> not, we ignore it.
> 
> 99% of the data will be for ids which are not being tracked - 1% or so will
> be for ids which are tracked.
> 
> My concern is, that there'd be a lot of round trips to the db made just to
> check the id, and if it'd be better to cache the ids being tracked
> somewhere, so other ids are ignored.
> 
> I was considering sending a message to another (or the same topic) whenever
> a new id is added to the track list, and that id should then get processed
> on the node which will process the messages.
> 
> Should I just cache all ids on all nodes (which may be a large amount), or
> is there a way to only cache the id on the same kafka streams node which
> will receive data for that id?
>