You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Rajasekar Elango <re...@salesforce.com> on 2013/08/22 23:14:43 UTC

Differences in size of data replicated by mirror maker

Hi,

We are using mirrormaker to replicate data between two kafka clusters. I am
seeing huge difference in size of log in data dir between the broker in
source cluster vs broker in destination cluster:

For eg: Size of ~/data/Topic-0/ is about 910 G in source broker, but only
its only 25G in destination broker. I see segmented log files (~500 M) is
created for about every 2 or 3 mins in source brokers, but I see segmented
log files is created for about every 25 mins in destination broker.

I verified mirrormaker is doing fine using consumer offset checker, not
much lag, offsets are incrementing. I also verified that topics/partitions
are not under replicated in both source and target cluster. What is the
reason for this difference in disk usage?


-- 
Thanks,
Raja.

Re: Differences in size of data replicated by mirror maker

Posted by Jun Rao <ju...@gmail.com>.
We have JMX beans that report #messages per topic? Does the total count
match btw the two clusters?

Thanks,

Jun


On Thu, Aug 22, 2013 at 2:14 PM, Rajasekar Elango <re...@salesforce.com>wrote:

> Hi,
>
> We are using mirrormaker to replicate data between two kafka clusters. I am
> seeing huge difference in size of log in data dir between the broker in
> source cluster vs broker in destination cluster:
>
> For eg: Size of ~/data/Topic-0/ is about 910 G in source broker, but only
> its only 25G in destination broker. I see segmented log files (~500 M) is
> created for about every 2 or 3 mins in source brokers, but I see segmented
> log files is created for about every 25 mins in destination broker.
>
> I verified mirrormaker is doing fine using consumer offset checker, not
> much lag, offsets are incrementing. I also verified that topics/partitions
> are not under replicated in both source and target cluster. What is the
> reason for this difference in disk usage?
>
>
> --
> Thanks,
> Raja.
>

Re: Differences in size of data replicated by mirror maker

Posted by Guozhang Wang <wa...@gmail.com>.
We are currently working on the following JIRA to avoid decompress and
re-compress at MirrorMaker, when this is done, the size of the logs on
source and target clusters should be the same as long as the batch size of
the mirror maker producer is the same as the batch size of the source
producer:

https://issues.apache.org/jira/browse/KAFKA-1011


On Fri, Aug 23, 2013 at 7:22 AM, Jay Kreps <ja...@gmail.com> wrote:

> Ah, one thing to be aware of is that the effectiveness of compression is
> directly related to the producer batch size--more batching, more
> compression. So even if you use compression on both clusters the mirror may
> be much smaller.
>
> -jay
>
> On Friday, August 23, 2013, Rajasekar Elango wrote:
>
> > Thanks Guazhang, Jun,
> >
> > Yes we doing gzip compression and that should be reason for difference in
> > disk usage. I had a typo that the size is actually 91G in source
> cluster.So
> > 25G/91G ratio makes sense for compression.
> >
> > Thanks,
> > Raja.
> >
> >
> > On Thu, Aug 22, 2013 at 7:00 PM, Guozhang Wang <wangguoz@gmail.com
> <javascript:;>>
> > wrote:
> >
> > > When you state the numbers, are they the same across instances in the
> > > cluster, meaning that Topic-0 would have 910*5 GB in source cluster and
> > > 25*5 GB in target cluster?
> > >
> > > Another possibility is that MirrorMaker uses compression on the
> producer
> > > side, but I would be surprised if the compression rate could be 25/910.
> > >
> > > Guozhang
> > >
> > >
> > > On Thu, Aug 22, 2013 at 3:48 PM, Rajasekar Elango <
> > relango@salesforce.com <javascript:;>
> > > >wrote:
> > >
> > > > Yes, both source and target clusters have 5 brokers in cluster.
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Aug 22, 2013, at 6:11 PM, Guozhang Wang <wangguoz@gmail.com
> <javascript:;>>
> > wrote:
> > > >
> > > > > Hello Rajasekar,
> > > > >
> > > > > Are the size of the source cluster and target cluster the same?
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Thu, Aug 22, 2013 at 2:14 PM, Rajasekar Elango <
> > > > relango@salesforce.com <javascript:;>>wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> We are using mirrormaker to replicate data between two kafka
> > clusters.
> > > > I am
> > > > >> seeing huge difference in size of log in data dir between the
> broker
> > > in
> > > > >> source cluster vs broker in destination cluster:
> > > > >>
> > > > >> For eg: Size of ~/data/Topic-0/ is about 910 G in source broker,
> but
> > > > only
> > > > >> its only 25G in destination broker. I see segmented log files
> (~500
> > M)
> > > > is
> > > > >> created for about every 2 or 3 mins in source brokers, but I see
> > > > segmented
> > > > >> log files is created for about every 25 mins in destination
> broker.
> > > > >>
> > > > >> I verified mirrormaker is doing fine using consumer offset
> checker,
> > > not
> > > > >> much lag, offsets are incrementing. I also verified that
> > > > topics/partitions
> > > > >> are not under replicated in both source and target cluster. What
> is
> > > the
> > > > >> reason for this difference in disk usage?
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Thanks,
> > > > >> Raja.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
> >
> >
> > --
> > Thanks,
> > Raja.
> >
>



-- 
-- Guozhang

Re: Differences in size of data replicated by mirror maker

Posted by Jay Kreps <ja...@gmail.com>.
Ah, one thing to be aware of is that the effectiveness of compression is
directly related to the producer batch size--more batching, more
compression. So even if you use compression on both clusters the mirror may
be much smaller.

-jay

On Friday, August 23, 2013, Rajasekar Elango wrote:

> Thanks Guazhang, Jun,
>
> Yes we doing gzip compression and that should be reason for difference in
> disk usage. I had a typo that the size is actually 91G in source cluster.So
> 25G/91G ratio makes sense for compression.
>
> Thanks,
> Raja.
>
>
> On Thu, Aug 22, 2013 at 7:00 PM, Guozhang Wang <wangguoz@gmail.com<javascript:;>>
> wrote:
>
> > When you state the numbers, are they the same across instances in the
> > cluster, meaning that Topic-0 would have 910*5 GB in source cluster and
> > 25*5 GB in target cluster?
> >
> > Another possibility is that MirrorMaker uses compression on the producer
> > side, but I would be surprised if the compression rate could be 25/910.
> >
> > Guozhang
> >
> >
> > On Thu, Aug 22, 2013 at 3:48 PM, Rajasekar Elango <
> relango@salesforce.com <javascript:;>
> > >wrote:
> >
> > > Yes, both source and target clusters have 5 brokers in cluster.
> > >
> > > Sent from my iPhone
> > >
> > > On Aug 22, 2013, at 6:11 PM, Guozhang Wang <wangguoz@gmail.com<javascript:;>>
> wrote:
> > >
> > > > Hello Rajasekar,
> > > >
> > > > Are the size of the source cluster and target cluster the same?
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Thu, Aug 22, 2013 at 2:14 PM, Rajasekar Elango <
> > > relango@salesforce.com <javascript:;>>wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> We are using mirrormaker to replicate data between two kafka
> clusters.
> > > I am
> > > >> seeing huge difference in size of log in data dir between the broker
> > in
> > > >> source cluster vs broker in destination cluster:
> > > >>
> > > >> For eg: Size of ~/data/Topic-0/ is about 910 G in source broker, but
> > > only
> > > >> its only 25G in destination broker. I see segmented log files (~500
> M)
> > > is
> > > >> created for about every 2 or 3 mins in source brokers, but I see
> > > segmented
> > > >> log files is created for about every 25 mins in destination broker.
> > > >>
> > > >> I verified mirrormaker is doing fine using consumer offset checker,
> > not
> > > >> much lag, offsets are incrementing. I also verified that
> > > topics/partitions
> > > >> are not under replicated in both source and target cluster. What is
> > the
> > > >> reason for this difference in disk usage?
> > > >>
> > > >>
> > > >> --
> > > >> Thanks,
> > > >> Raja.
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>
>
>
> --
> Thanks,
> Raja.
>

Re: Differences in size of data replicated by mirror maker

Posted by Rajasekar Elango <re...@salesforce.com>.
Thanks Guazhang, Jun,

Yes we doing gzip compression and that should be reason for difference in
disk usage. I had a typo that the size is actually 91G in source cluster.So
25G/91G ratio makes sense for compression.

Thanks,
Raja.


On Thu, Aug 22, 2013 at 7:00 PM, Guozhang Wang <wa...@gmail.com> wrote:

> When you state the numbers, are they the same across instances in the
> cluster, meaning that Topic-0 would have 910*5 GB in source cluster and
> 25*5 GB in target cluster?
>
> Another possibility is that MirrorMaker uses compression on the producer
> side, but I would be surprised if the compression rate could be 25/910.
>
> Guozhang
>
>
> On Thu, Aug 22, 2013 at 3:48 PM, Rajasekar Elango <relango@salesforce.com
> >wrote:
>
> > Yes, both source and target clusters have 5 brokers in cluster.
> >
> > Sent from my iPhone
> >
> > On Aug 22, 2013, at 6:11 PM, Guozhang Wang <wa...@gmail.com> wrote:
> >
> > > Hello Rajasekar,
> > >
> > > Are the size of the source cluster and target cluster the same?
> > >
> > > Guozhang
> > >
> > >
> > > On Thu, Aug 22, 2013 at 2:14 PM, Rajasekar Elango <
> > relango@salesforce.com>wrote:
> > >
> > >> Hi,
> > >>
> > >> We are using mirrormaker to replicate data between two kafka clusters.
> > I am
> > >> seeing huge difference in size of log in data dir between the broker
> in
> > >> source cluster vs broker in destination cluster:
> > >>
> > >> For eg: Size of ~/data/Topic-0/ is about 910 G in source broker, but
> > only
> > >> its only 25G in destination broker. I see segmented log files (~500 M)
> > is
> > >> created for about every 2 or 3 mins in source brokers, but I see
> > segmented
> > >> log files is created for about every 25 mins in destination broker.
> > >>
> > >> I verified mirrormaker is doing fine using consumer offset checker,
> not
> > >> much lag, offsets are incrementing. I also verified that
> > topics/partitions
> > >> are not under replicated in both source and target cluster. What is
> the
> > >> reason for this difference in disk usage?
> > >>
> > >>
> > >> --
> > >> Thanks,
> > >> Raja.
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> >
>
>
>
> --
> -- Guozhang
>



-- 
Thanks,
Raja.

Re: Differences in size of data replicated by mirror maker

Posted by Guozhang Wang <wa...@gmail.com>.
When you state the numbers, are they the same across instances in the
cluster, meaning that Topic-0 would have 910*5 GB in source cluster and
25*5 GB in target cluster?

Another possibility is that MirrorMaker uses compression on the producer
side, but I would be surprised if the compression rate could be 25/910.

Guozhang


On Thu, Aug 22, 2013 at 3:48 PM, Rajasekar Elango <re...@salesforce.com>wrote:

> Yes, both source and target clusters have 5 brokers in cluster.
>
> Sent from my iPhone
>
> On Aug 22, 2013, at 6:11 PM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Rajasekar,
> >
> > Are the size of the source cluster and target cluster the same?
> >
> > Guozhang
> >
> >
> > On Thu, Aug 22, 2013 at 2:14 PM, Rajasekar Elango <
> relango@salesforce.com>wrote:
> >
> >> Hi,
> >>
> >> We are using mirrormaker to replicate data between two kafka clusters.
> I am
> >> seeing huge difference in size of log in data dir between the broker in
> >> source cluster vs broker in destination cluster:
> >>
> >> For eg: Size of ~/data/Topic-0/ is about 910 G in source broker, but
> only
> >> its only 25G in destination broker. I see segmented log files (~500 M)
> is
> >> created for about every 2 or 3 mins in source brokers, but I see
> segmented
> >> log files is created for about every 25 mins in destination broker.
> >>
> >> I verified mirrormaker is doing fine using consumer offset checker, not
> >> much lag, offsets are incrementing. I also verified that
> topics/partitions
> >> are not under replicated in both source and target cluster. What is the
> >> reason for this difference in disk usage?
> >>
> >>
> >> --
> >> Thanks,
> >> Raja.
> >
> >
> >
> > --
> > -- Guozhang
>



-- 
-- Guozhang

Re: Differences in size of data replicated by mirror maker

Posted by Rajasekar Elango <re...@salesforce.com>.
Yes, both source and target clusters have 5 brokers in cluster.

Sent from my iPhone

On Aug 22, 2013, at 6:11 PM, Guozhang Wang <wa...@gmail.com> wrote:

> Hello Rajasekar,
>
> Are the size of the source cluster and target cluster the same?
>
> Guozhang
>
>
> On Thu, Aug 22, 2013 at 2:14 PM, Rajasekar Elango <re...@salesforce.com>wrote:
>
>> Hi,
>>
>> We are using mirrormaker to replicate data between two kafka clusters. I am
>> seeing huge difference in size of log in data dir between the broker in
>> source cluster vs broker in destination cluster:
>>
>> For eg: Size of ~/data/Topic-0/ is about 910 G in source broker, but only
>> its only 25G in destination broker. I see segmented log files (~500 M) is
>> created for about every 2 or 3 mins in source brokers, but I see segmented
>> log files is created for about every 25 mins in destination broker.
>>
>> I verified mirrormaker is doing fine using consumer offset checker, not
>> much lag, offsets are incrementing. I also verified that topics/partitions
>> are not under replicated in both source and target cluster. What is the
>> reason for this difference in disk usage?
>>
>>
>> --
>> Thanks,
>> Raja.
>
>
>
> --
> -- Guozhang

Re: Differences in size of data replicated by mirror maker

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Rajasekar,

Are the size of the source cluster and target cluster the same?

Guozhang


On Thu, Aug 22, 2013 at 2:14 PM, Rajasekar Elango <re...@salesforce.com>wrote:

> Hi,
>
> We are using mirrormaker to replicate data between two kafka clusters. I am
> seeing huge difference in size of log in data dir between the broker in
> source cluster vs broker in destination cluster:
>
> For eg: Size of ~/data/Topic-0/ is about 910 G in source broker, but only
> its only 25G in destination broker. I see segmented log files (~500 M) is
> created for about every 2 or 3 mins in source brokers, but I see segmented
> log files is created for about every 25 mins in destination broker.
>
> I verified mirrormaker is doing fine using consumer offset checker, not
> much lag, offsets are incrementing. I also verified that topics/partitions
> are not under replicated in both source and target cluster. What is the
> reason for this difference in disk usage?
>
>
> --
> Thanks,
> Raja.
>



-- 
-- Guozhang