You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Xuyen On <xo...@ancestry.com> on 2014/01/09 02:21:21 UTC

Duplicate records in Kafka 0.7

Hi,

I would like to check to see if other people are seeing duplicate records with Kafka 0.7. I read the Jira's and I believe that duplicates are still possible when using message compression on Kafka 0.7. I'm seeing duplicate records from the range of 6-13%. Is this normal? 

If you're using Kafka 0.7 with message compression enabled, can you please let me know any duplicate records and if so, what %?

Also, please let me know what sort of deduplication strategy you're using.

Thanks!

Re: Duplicate records in Kafka 0.7

Posted by Jun Rao <ju...@gmail.com>.

It depends on how you process a batch of compressed messages. In 0.7, the
message offset only advances at the compressed message set boundary. So, if
you always finish processing all messages in a compressed set, there
shouldn't be any duplicates. If say, you stop after consuming only 3
messages in a compressed set of 10, when you refetch, you will get the
first 3 messages again.

Thanks,

Jun


On Fri, Jan 10, 2014 at 11:17 PM, Xuyen On <xo...@ancestry.com> wrote:

> Actually, most of the duplicates I was seeing was due to a bug in an old
> Hive version I'm using 0.9.
> But I am still seeing some, although fewer duplicates. Instead of 3-13%
> I'm now only seeing less than 1%. This appears to be the case for each of
> the batch messages for my consumer which is set to be 1,000,000 messages
> right now. Does that seem more reasonable?
>
> -----Original Message-----
> From: Joel Koshy [mailto:jjkoshy.w@gmail.com]
> Sent: Thursday, January 09, 2014 7:07 AM
> To: users@kafka.apache.org
> Subject: Re: Duplicate records in Kafka 0.7
>
> You mean duplicate records on the consumer side? Duplicates are possible
> if there are consumer failures and a another consumer instance resumes from
> an earlier offset. It is also possible if there are producer retries due to
> exceptions while producing. Do you see any of these errors in your logs?
> Besides these scenarios though, you shouldn't be seeing duplicates.
>
> Thanks,
>
> Joel
>
>
> On Wed, Jan 8, 2014 at 5:21 PM, Xuyen On <xo...@ancestry.com> wrote:
> > Hi,
> >
> > I would like to check to see if other people are seeing duplicate
> records with Kafka 0.7. I read the Jira's and I believe that duplicates are
> still possible when using message compression on Kafka 0.7. I'm seeing
> duplicate records from the range of 6-13%. Is this normal?
> >
> > If you're using Kafka 0.7 with message compression enabled, can you
> please let me know any duplicate records and if so, what %?
> >
> > Also, please let me know what sort of deduplication strategy you're
> using.
> >
> > Thanks!
> >
> >
>
>
>

RE: Duplicate records in Kafka 0.7

Posted by Xuyen On <xo...@ancestry.com>.

Actually, most of the duplicates I was seeing was due to a bug in an old Hive version I'm using 0.9. 
But I am still seeing some, although fewer duplicates. Instead of 3-13% I'm now only seeing less than 1%. This appears to be the case for each of the batch messages for my consumer which is set to be 1,000,000 messages right now. Does that seem more reasonable?

-----Original Message-----
From: Joel Koshy [mailto:jjkoshy.w@gmail.com] 
Sent: Thursday, January 09, 2014 7:07 AM
To: users@kafka.apache.org
Subject: Re: Duplicate records in Kafka 0.7

You mean duplicate records on the consumer side? Duplicates are possible if there are consumer failures and a another consumer instance resumes from an earlier offset. It is also possible if there are producer retries due to exceptions while producing. Do you see any of these errors in your logs? Besides these scenarios though, you shouldn't be seeing duplicates.

Thanks,

Joel

On Wed, Jan 8, 2014 at 5:21 PM, Xuyen On <xo...@ancestry.com> wrote:
> Hi,
>
> I would like to check to see if other people are seeing duplicate records with Kafka 0.7. I read the Jira's and I believe that duplicates are still possible when using message compression on Kafka 0.7. I'm seeing duplicate records from the range of 6-13%. Is this normal?
>
> If you're using Kafka 0.7 with message compression enabled, can you please let me know any duplicate records and if so, what %?
>
> Also, please let me know what sort of deduplication strategy you're using.
>
> Thanks!
>
>

Re: Duplicate records in Kafka 0.7

Posted by Joel Koshy <jj...@gmail.com>.

You mean duplicate records on the consumer side? Duplicates are
possible if there are consumer failures and a another consumer
instance resumes from an earlier offset. It is also possible if there
are producer retries due to exceptions while producing. Do you see any
of these errors in your logs? Besides these scenarios though, you
shouldn't be seeing duplicates.

Thanks,

Joel

On Wed, Jan 8, 2014 at 5:21 PM, Xuyen On <xo...@ancestry.com> wrote:
> Hi,
>
> I would like to check to see if other people are seeing duplicate records with Kafka 0.7. I read the Jira's and I believe that duplicates are still possible when using message compression on Kafka 0.7. I'm seeing duplicate records from the range of 6-13%. Is this normal?
>
> If you're using Kafka 0.7 with message compression enabled, can you please let me know any duplicate records and if so, what %?
>
> Also, please let me know what sort of deduplication strategy you're using.
>
> Thanks!
>
>