You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@chukwa.apache.org by IvyTang <iv...@gmail.com> on 2012/03/28 09:56:03 UTC

More about the removing of duplicate chunks

Thanks to the simple archiver , we do remove almost all the duplicate
chunks.

But we found that there are still few ,very few duplicate chunks left .

And strangely , these chunks's key are't the same. The DataType,StreamName
and SeqId are the same , but the TimePartition are different. The log in
these chunks are the same.

Could we just distinguish the duplicate chunks using
the DataType,StreamName and SeqId ? What's the TimePartition meaning for?

Thanks!


-- 
Best regards,

Ivy Tang

Re: More about the removing of duplicate chunks

Posted by Ariel Rabkin <as...@gmail.com>.

TimePartition is showing you when the data showed up.

I think SeqID + StreamName is the right thing to match on -- if the
data is re-collected later, but it's the same data, yeah, you want to
treat it as duplicate.

On Wed, Mar 28, 2012 at 12:56 AM, IvyTang <iv...@gmail.com> wrote:
> Thanks to the simple archiver , we do remove almost all the duplicate
> chunks.
>
> But we found that there are still few ,very few duplicate chunks left .
>
> And strangely , these chunks's key are't the same. The DataType,StreamName
> and SeqId are the same , but the TimePartition are different. The log in
> these chunks are the same.
>
> Could we just distinguish the duplicate chunks using the DataType,StreamName
> and SeqId ? What's the TimePartition meaning for?
>
> Thanks!
>
>
> --
> Best regards,
>
> Ivy Tang
>
>
>



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department