You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@chukwa.apache.org by Corbin Hoenes <co...@tynt.com> on 2010/03/18 16:59:50 UTC

duplicate data

Does anyone have more information about how chukwa removes duplicates during demux? How does it decide what is a duplicate?  There are two cases I am thinking of...

1 - we send the same log file to chukwa 2x 
2 - we have the exact same line in a log file 2x

Re: duplicate data

Posted by Ariel Rabkin <as...@gmail.com>.

The sequence ID of a chunk is, by default, the offset in the file of
its first byte.  We do some fairly complex hacks for file rotation, to
make sure that the IDs continue growing monotonically in that case.
If you start a tailer on a file, and leave it running, each line will
get numbered uniquely. if you stop it, and then start a new one at the
beginning of the file, you'll get duplicate data.

If you start a tailer, stop it, modify or overwrite the file, and then
start a new tailer, you'll be spurious duplicates.

--Ari

On Thu, Mar 18, 2010 at 9:50 PM, Corbin Hoenes <co...@tynt.com> wrote:
> So in scenario the stream name should be the same but how do sequence IDs get generated?  If I tried to tail the same log file 24 hours after doing it the first time would they have the same seq id?
>
> On Mar 18, 2010, at 11:24 AM, Ariel Rabkin wrote:
>
>> Howdy,
>>
>> Chukwa does duplicate detection as follows: Each Chunk of data comes
>> with a stream name (such as the name of a log file) and a sequence ID.
>> If two chunks have the same name and ID, they're duplicate.  The
>> content isn't inspected.
>>
>> So in your example, the former will be treated as a duplicate, not the latter.
>>
>> --Ari
>>
>> On Thu, Mar 18, 2010 at 8:59 AM, Corbin Hoenes <co...@tynt.com> wrote:
>>> Does anyone have more information about how chukwa removes duplicates during demux? How does it decide what is a duplicate?  There are two cases I am thinking of...
>>>
>>> 1 - we send the same log file to chukwa 2x
>>> 2 - we have the exact same line in a log file 2x
>>
>>
>>
>> --
>> Ari Rabkin asrabkin@gmail.com
>> UC Berkeley Computer Science Department
>
>



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: duplicate data

Posted by Corbin Hoenes <co...@tynt.com>.

So in scenario the stream name should be the same but how do sequence IDs get generated?  If I tried to tail the same log file 24 hours after doing it the first time would they have the same seq id?

On Mar 18, 2010, at 11:24 AM, Ariel Rabkin wrote:

> Howdy,
> 
> Chukwa does duplicate detection as follows: Each Chunk of data comes
> with a stream name (such as the name of a log file) and a sequence ID.
> If two chunks have the same name and ID, they're duplicate.  The
> content isn't inspected.
> 
> So in your example, the former will be treated as a duplicate, not the latter.
> 
> --Ari
> 
> On Thu, Mar 18, 2010 at 8:59 AM, Corbin Hoenes <co...@tynt.com> wrote:
>> Does anyone have more information about how chukwa removes duplicates during demux? How does it decide what is a duplicate?  There are two cases I am thinking of...
>> 
>> 1 - we send the same log file to chukwa 2x
>> 2 - we have the exact same line in a log file 2x
> 
> 
> 
> -- 
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department

Re: duplicate data

Posted by Ariel Rabkin <as...@gmail.com>.

Howdy,

Chukwa does duplicate detection as follows: Each Chunk of data comes
with a stream name (such as the name of a log file) and a sequence ID.
If two chunks have the same name and ID, they're duplicate.  The
content isn't inspected.

So in your example, the former will be treated as a duplicate, not the latter.

--Ari

On Thu, Mar 18, 2010 at 8:59 AM, Corbin Hoenes <co...@tynt.com> wrote:
> Does anyone have more information about how chukwa removes duplicates during demux? How does it decide what is a duplicate?  There are two cases I am thinking of...
>
> 1 - we send the same log file to chukwa 2x
> 2 - we have the exact same line in a log file 2x

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department