You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Craig Swift <cr...@returnpath.com.INVALID> on 2015/09/09 17:29:00 UTC

MirrorMaker - Not consuming from all partitions

Hello,

Hope everyone is doing well. I was hoping to get some assistance with a
strange issue we're experiencing while using the MirrorMaker to pull data
down from an 8 node Kafka cluster in AWS into our data center. Both Kafka
clusters and the mirror are using version 0.8.1.1 with dedicated Zookeeper
clusters for each cluster respectively (running 3.4.5).

The problem we're seeing is that the mirror starts up and begins consuming
from the cluster on a specific topic. It correctly attaches to all 24
partitions for that topic - but inevitably there are a series of partitions
that either don't get read or are read at a very slow rate. Those
partitions are always associated with the same brokers. For example, all
partitions on broker 2 won't be read or all partitions on broker 2 and 4
won't be read. On restarting the mirror, these 'stuck' partitions may stay
the same or move. If they move the backlog is drained very quickly. If we
add more mirrors for additional capacity the same situation happens except
that each mirror has it's own set of stuck partitions. I've included the
mirror's configurations below along with samples from the logs.

1) The partition issue seems to happen when the mirror first starts up.
Once in a blue moon it reads from everything normally, but on restart it
can easily get back into this state.

2) We're fairly sure it isn't a processing/throughput issue. We can turn
the mirror off for a while, incur a large backlog of data, and when it is
enabled it chews through the data very quickly minus the handful of stuck
partitions.

3) We've looked at both the zookeeper and broker logs and there doesn't
seem to be anything out of the normal. We see the mirror connecting, there
are a few info messages about zookeeper nodes already existing, etc. No
specific errors.

4) We've enabled debugging on the mirror and we've noticed that during the
zk heartbeat/updates we're missing these messages for the 'stuck'
partitions:

[2015-09-08 18:38:12,157] DEBUG Reading reply sessionid:0x14f956bd57d21ee,
packet:: clientPath:null serverPath:null finished:false header:: 357,5
 replyHeader:: 357,8597251893,0  request::
'/consumers/mirror-kafkablk-kafka-gold-east-to-kafkablk-den/offsets/MessageHeadersBody/5,#34303537353838,-1
 response::
s{4295371756,8597251893,1439969185754,1441759092134,19500,0,0,0,7,0,4295371756}
 (org.apache.zookeeper.ClientCnxn)

i.e. we see this message for all the processing partitions, but never for
the stuck ones. There are no errors in the log prior to this though, and
once in a great while we might see a log entry for one of the stuck
partitions.

5) We've checked latency/response time with zookeeper from the brokers and
the mirror and it appears fine.

Mirror consumer config:
group.id=mirror-kafkablk-kafka-gold-east-to-kafkablk-den
consumer.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
zookeeper.connect=zk.strange.dev.net:2181
fetch.message.max.bytes=15728640
socket.receive.buffer.bytes=64000000
socket.timeout.ms=60000
zookeeper.connection.timeout.ms=60000
zookeeper.session.timeout.ms=30000
zookeeper.sync.time.ms=4000
auto.offset.reset=smallest
auto.commit.interval.ms=20000

Mirror producer config:
client.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
metadata.broker.list=kafka00.lan.strange.dev.net:9092,
kafka01.lan.strange.dev.net:9092,kafka02.lan.strange.dev.net:9092,
kafka03.lan.strange.dev.net:9092,kafka04.lan.strange.dev.net:9092
request.required.acks=1
producer.type=async
request.timeout.ms=20000
retry.backoff.ms=1000
message.send.max.retries=6
serializer.class=kafka.serializer.DefaultEncoder
send.buffer.bytes=134217728
compression.codec=gzip

Mirror startup settings:
--num.streams 2 --num.producers 4

Any thoughts/suggestions would be very helpful. At this point we're running
out of things to try.


Craig J. Swift
Software Engineer

Re: MirrorMaker - Not consuming from all partitions

Posted by Craig Swift <cr...@returnpath.com.INVALID>.
Scary but thanks! :) We'll start digging into the network and see if we can
find a smoking gun. Appreciate the response, thanks again.

Craig J. Swift
Software Engineer - Data Pipeline
ReturnPath Inc.
Work: 303-999-3220 Cell: 720-560-7038

On Fri, Sep 11, 2015 at 11:29 AM, Steve Miller <st...@idrathernotsay.com>
wrote:

>    I have a vague feeling that I've seen stuff like this when the network
> on the broker that's disappearing is actually unreachable from time to time
> -- though I'd like to believe that's not such an issue when talking to AWS
> (though there could be a lot of screwed-up Internet between you and it,
> depending on exactly wht you're doing).
>
>    One thing you could consider doing would be to look at some of the
> traffic.  wireshark/tshark knows how to decode Kafka transactions, though
> digging through the output is... exciting, because there can be so very
> much of it.  I'd written up some notes on how to do that, which you can see
> at:
>
>
> http://mail-archives.apache.org/mod_mbox/kafka-users/201408.mbox/%3C20140812180358.GA24108@idrathernotsay.com%3E
>
> (though I expect in your case, you'd need to be doing a capture all the
> time, and then figure out when MirrorMaker stops fetching from that broker,
> and then stare at a ton of data to find whatever protocol thing has
> happened -- all of which is ugly enough that I'd been reluctant to mention
> it until now).
>
>         -Steve
>
> On Fri, Sep 11, 2015 at 08:52:21AM -0600, Craig Swift wrote:
> > Just wanted to bump this again and see if the community had any thoughts
> or
> > if we're just missing something stupid. For added context the topic we're
> > reading from has 24 partitions and we see roughly 15k messages per
> minute.
> > As I mentioned before the throughput seems fine, but I'm not entirely
> sure
> > how the MirrorMaker cycles through it's topics/partitions and why it
> would
> > read very slowly or bypass reading from certain partitions.
> >
> > Craig J. Swift
> > Software Engineer - Data Pipeline
> > ReturnPath Inc.
> > Work: 303-999-3220 Cell: 720-560-7038
> >
> > On Wed, Sep 9, 2015 at 9:29 AM, Craig Swift <cr...@returnpath.com>
> > wrote:
> >
> > > Hello,
> > >
> > > Hope everyone is doing well. I was hoping to get some assistance with a
> > > strange issue we're experiencing while using the MirrorMaker to pull
> data
> > > down from an 8 node Kafka cluster in AWS into our data center. Both
> Kafka
> > > clusters and the mirror are using version 0.8.1.1 with dedicated
> Zookeeper
> > > clusters for each cluster respectively (running 3.4.5).
> > >
> > > The problem we're seeing is that the mirror starts up and begins
> consuming
> > > from the cluster on a specific topic. It correctly attaches to all 24
> > > partitions for that topic - but inevitably there are a series of
> partitions
> > > that either don't get read or are read at a very slow rate. Those
> > > partitions are always associated with the same brokers. For example,
> all
> > > partitions on broker 2 won't be read or all partitions on broker 2 and
> 4
> > > won't be read. On restarting the mirror, these 'stuck' partitions may
> stay
> > > the same or move. If they move the backlog is drained very quickly. If
> we
> > > add more mirrors for additional capacity the same situation happens
> except
> > > that each mirror has it's own set of stuck partitions. I've included
> the
> > > mirror's configurations below along with samples from the logs.
> > >
> > > 1) The partition issue seems to happen when the mirror first starts up.
> > > Once in a blue moon it reads from everything normally, but on restart
> it
> > > can easily get back into this state.
> > >
> > > 2) We're fairly sure it isn't a processing/throughput issue. We can
> turn
> > > the mirror off for a while, incur a large backlog of data, and when it
> is
> > > enabled it chews through the data very quickly minus the handful of
> stuck
> > > partitions.
> > >
> > > 3) We've looked at both the zookeeper and broker logs and there doesn't
> > > seem to be anything out of the normal. We see the mirror connecting,
> there
> > > are a few info messages about zookeeper nodes already existing, etc. No
> > > specific errors.
> > >
> > > 4) We've enabled debugging on the mirror and we've noticed that during
> the
> > > zk heartbeat/updates we're missing these messages for the 'stuck'
> > > partitions:
> > >
> > > [2015-09-08 18:38:12,157] DEBUG Reading reply
> sessionid:0x14f956bd57d21ee,
> > > packet:: clientPath:null serverPath:null finished:false header:: 357,5
> > >  replyHeader:: 357,8597251893,0  request::
> > >
> '/consumers/mirror-kafkablk-kafka-gold-east-to-kafkablk-den/offsets/MessageHeadersBody/5,#34303537353838,-1
> > >  response::
> > >
> s{4295371756,8597251893,1439969185754,1441759092134,19500,0,0,0,7,0,4295371756}
> > >  (org.apache.zookeeper.ClientCnxn)
> > >
> > > i.e. we see this message for all the processing partitions, but never
> for
> > > the stuck ones. There are no errors in the log prior to this though,
> and
> > > once in a great while we might see a log entry for one of the stuck
> > > partitions.
> > >
> > > 5) We've checked latency/response time with zookeeper from the brokers
> and
> > > the mirror and it appears fine.
> > >
> > > Mirror consumer config:
> > > group.id=mirror-kafkablk-kafka-gold-east-to-kafkablk-den
> > > consumer.id
> =mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
> > > zookeeper.connect=zk.strange.dev.net:2181
> > > fetch.message.max.bytes=15728640
> > > socket.receive.buffer.bytes=64000000
> > > socket.timeout.ms=60000
> > > zookeeper.connection.timeout.ms=60000
> > > zookeeper.session.timeout.ms=30000
> > > zookeeper.sync.time.ms=4000
> > > auto.offset.reset=smallest
> > > auto.commit.interval.ms=20000
> > >
> > > Mirror producer config:
> > > client.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
> > > metadata.broker.list=kafka00.lan.strange.dev.net:9092,
> > > kafka01.lan.strange.dev.net:9092,kafka02.lan.strange.dev.net:9092,
> > > kafka03.lan.strange.dev.net:9092,kafka04.lan.strange.dev.net:9092
> > > request.required.acks=1
> > > producer.type=async
> > > request.timeout.ms=20000
> > > retry.backoff.ms=1000
> > > message.send.max.retries=6
> > > serializer.class=kafka.serializer.DefaultEncoder
> > > send.buffer.bytes=134217728
> > > compression.codec=gzip
> > >
> > > Mirror startup settings:
> > > --num.streams 2 --num.producers 4
> > >
> > > Any thoughts/suggestions would be very helpful. At this point we're
> > > running out of things to try.
> > >
> > >
> > > Craig J. Swift
> > > Software Engineer
> > >
> > >
> > >
>

Re: MirrorMaker - Not consuming from all partitions

Posted by Steve Miller <st...@idrathernotsay.com>.
   I have a vague feeling that I've seen stuff like this when the network on the broker that's disappearing is actually unreachable from time to time -- though I'd like to believe that's not such an issue when talking to AWS (though there could be a lot of screwed-up Internet between you and it, depending on exactly wht you're doing).

   One thing you could consider doing would be to look at some of the traffic.  wireshark/tshark knows how to decode Kafka transactions, though digging through the output is... exciting, because there can be so very much of it.  I'd written up some notes on how to do that, which you can see at:

	http://mail-archives.apache.org/mod_mbox/kafka-users/201408.mbox/%3C20140812180358.GA24108@idrathernotsay.com%3E

(though I expect in your case, you'd need to be doing a capture all the time, and then figure out when MirrorMaker stops fetching from that broker, and then stare at a ton of data to find whatever protocol thing has happened -- all of which is ugly enough that I'd been reluctant to mention it until now).

	-Steve

On Fri, Sep 11, 2015 at 08:52:21AM -0600, Craig Swift wrote:
> Just wanted to bump this again and see if the community had any thoughts or
> if we're just missing something stupid. For added context the topic we're
> reading from has 24 partitions and we see roughly 15k messages per minute.
> As I mentioned before the throughput seems fine, but I'm not entirely sure
> how the MirrorMaker cycles through it's topics/partitions and why it would
> read very slowly or bypass reading from certain partitions.
> 
> Craig J. Swift
> Software Engineer - Data Pipeline
> ReturnPath Inc.
> Work: 303-999-3220 Cell: 720-560-7038
> 
> On Wed, Sep 9, 2015 at 9:29 AM, Craig Swift <cr...@returnpath.com>
> wrote:
> 
> > Hello,
> >
> > Hope everyone is doing well. I was hoping to get some assistance with a
> > strange issue we're experiencing while using the MirrorMaker to pull data
> > down from an 8 node Kafka cluster in AWS into our data center. Both Kafka
> > clusters and the mirror are using version 0.8.1.1 with dedicated Zookeeper
> > clusters for each cluster respectively (running 3.4.5).
> >
> > The problem we're seeing is that the mirror starts up and begins consuming
> > from the cluster on a specific topic. It correctly attaches to all 24
> > partitions for that topic - but inevitably there are a series of partitions
> > that either don't get read or are read at a very slow rate. Those
> > partitions are always associated with the same brokers. For example, all
> > partitions on broker 2 won't be read or all partitions on broker 2 and 4
> > won't be read. On restarting the mirror, these 'stuck' partitions may stay
> > the same or move. If they move the backlog is drained very quickly. If we
> > add more mirrors for additional capacity the same situation happens except
> > that each mirror has it's own set of stuck partitions. I've included the
> > mirror's configurations below along with samples from the logs.
> >
> > 1) The partition issue seems to happen when the mirror first starts up.
> > Once in a blue moon it reads from everything normally, but on restart it
> > can easily get back into this state.
> >
> > 2) We're fairly sure it isn't a processing/throughput issue. We can turn
> > the mirror off for a while, incur a large backlog of data, and when it is
> > enabled it chews through the data very quickly minus the handful of stuck
> > partitions.
> >
> > 3) We've looked at both the zookeeper and broker logs and there doesn't
> > seem to be anything out of the normal. We see the mirror connecting, there
> > are a few info messages about zookeeper nodes already existing, etc. No
> > specific errors.
> >
> > 4) We've enabled debugging on the mirror and we've noticed that during the
> > zk heartbeat/updates we're missing these messages for the 'stuck'
> > partitions:
> >
> > [2015-09-08 18:38:12,157] DEBUG Reading reply sessionid:0x14f956bd57d21ee,
> > packet:: clientPath:null serverPath:null finished:false header:: 357,5
> >  replyHeader:: 357,8597251893,0  request::
> > '/consumers/mirror-kafkablk-kafka-gold-east-to-kafkablk-den/offsets/MessageHeadersBody/5,#34303537353838,-1
> >  response::
> > s{4295371756,8597251893,1439969185754,1441759092134,19500,0,0,0,7,0,4295371756}
> >  (org.apache.zookeeper.ClientCnxn)
> >
> > i.e. we see this message for all the processing partitions, but never for
> > the stuck ones. There are no errors in the log prior to this though, and
> > once in a great while we might see a log entry for one of the stuck
> > partitions.
> >
> > 5) We've checked latency/response time with zookeeper from the brokers and
> > the mirror and it appears fine.
> >
> > Mirror consumer config:
> > group.id=mirror-kafkablk-kafka-gold-east-to-kafkablk-den
> > consumer.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
> > zookeeper.connect=zk.strange.dev.net:2181
> > fetch.message.max.bytes=15728640
> > socket.receive.buffer.bytes=64000000
> > socket.timeout.ms=60000
> > zookeeper.connection.timeout.ms=60000
> > zookeeper.session.timeout.ms=30000
> > zookeeper.sync.time.ms=4000
> > auto.offset.reset=smallest
> > auto.commit.interval.ms=20000
> >
> > Mirror producer config:
> > client.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
> > metadata.broker.list=kafka00.lan.strange.dev.net:9092,
> > kafka01.lan.strange.dev.net:9092,kafka02.lan.strange.dev.net:9092,
> > kafka03.lan.strange.dev.net:9092,kafka04.lan.strange.dev.net:9092
> > request.required.acks=1
> > producer.type=async
> > request.timeout.ms=20000
> > retry.backoff.ms=1000
> > message.send.max.retries=6
> > serializer.class=kafka.serializer.DefaultEncoder
> > send.buffer.bytes=134217728
> > compression.codec=gzip
> >
> > Mirror startup settings:
> > --num.streams 2 --num.producers 4
> >
> > Any thoughts/suggestions would be very helpful. At this point we're
> > running out of things to try.
> >
> >
> > Craig J. Swift
> > Software Engineer
> >
> >
> >

Re: MirrorMaker - Not consuming from all partitions

Posted by Craig Swift <cr...@returnpath.com.INVALID>.
Just wanted to bump this again and see if the community had any thoughts or
if we're just missing something stupid. For added context the topic we're
reading from has 24 partitions and we see roughly 15k messages per minute.
As I mentioned before the throughput seems fine, but I'm not entirely sure
how the MirrorMaker cycles through it's topics/partitions and why it would
read very slowly or bypass reading from certain partitions.

Craig J. Swift
Software Engineer - Data Pipeline
ReturnPath Inc.
Work: 303-999-3220 Cell: 720-560-7038

On Wed, Sep 9, 2015 at 9:29 AM, Craig Swift <cr...@returnpath.com>
wrote:

> Hello,
>
> Hope everyone is doing well. I was hoping to get some assistance with a
> strange issue we're experiencing while using the MirrorMaker to pull data
> down from an 8 node Kafka cluster in AWS into our data center. Both Kafka
> clusters and the mirror are using version 0.8.1.1 with dedicated Zookeeper
> clusters for each cluster respectively (running 3.4.5).
>
> The problem we're seeing is that the mirror starts up and begins consuming
> from the cluster on a specific topic. It correctly attaches to all 24
> partitions for that topic - but inevitably there are a series of partitions
> that either don't get read or are read at a very slow rate. Those
> partitions are always associated with the same brokers. For example, all
> partitions on broker 2 won't be read or all partitions on broker 2 and 4
> won't be read. On restarting the mirror, these 'stuck' partitions may stay
> the same or move. If they move the backlog is drained very quickly. If we
> add more mirrors for additional capacity the same situation happens except
> that each mirror has it's own set of stuck partitions. I've included the
> mirror's configurations below along with samples from the logs.
>
> 1) The partition issue seems to happen when the mirror first starts up.
> Once in a blue moon it reads from everything normally, but on restart it
> can easily get back into this state.
>
> 2) We're fairly sure it isn't a processing/throughput issue. We can turn
> the mirror off for a while, incur a large backlog of data, and when it is
> enabled it chews through the data very quickly minus the handful of stuck
> partitions.
>
> 3) We've looked at both the zookeeper and broker logs and there doesn't
> seem to be anything out of the normal. We see the mirror connecting, there
> are a few info messages about zookeeper nodes already existing, etc. No
> specific errors.
>
> 4) We've enabled debugging on the mirror and we've noticed that during the
> zk heartbeat/updates we're missing these messages for the 'stuck'
> partitions:
>
> [2015-09-08 18:38:12,157] DEBUG Reading reply sessionid:0x14f956bd57d21ee,
> packet:: clientPath:null serverPath:null finished:false header:: 357,5
>  replyHeader:: 357,8597251893,0  request::
> '/consumers/mirror-kafkablk-kafka-gold-east-to-kafkablk-den/offsets/MessageHeadersBody/5,#34303537353838,-1
>  response::
> s{4295371756,8597251893,1439969185754,1441759092134,19500,0,0,0,7,0,4295371756}
>  (org.apache.zookeeper.ClientCnxn)
>
> i.e. we see this message for all the processing partitions, but never for
> the stuck ones. There are no errors in the log prior to this though, and
> once in a great while we might see a log entry for one of the stuck
> partitions.
>
> 5) We've checked latency/response time with zookeeper from the brokers and
> the mirror and it appears fine.
>
> Mirror consumer config:
> group.id=mirror-kafkablk-kafka-gold-east-to-kafkablk-den
> consumer.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
> zookeeper.connect=zk.strange.dev.net:2181
> fetch.message.max.bytes=15728640
> socket.receive.buffer.bytes=64000000
> socket.timeout.ms=60000
> zookeeper.connection.timeout.ms=60000
> zookeeper.session.timeout.ms=30000
> zookeeper.sync.time.ms=4000
> auto.offset.reset=smallest
> auto.commit.interval.ms=20000
>
> Mirror producer config:
> client.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
> metadata.broker.list=kafka00.lan.strange.dev.net:9092,
> kafka01.lan.strange.dev.net:9092,kafka02.lan.strange.dev.net:9092,
> kafka03.lan.strange.dev.net:9092,kafka04.lan.strange.dev.net:9092
> request.required.acks=1
> producer.type=async
> request.timeout.ms=20000
> retry.backoff.ms=1000
> message.send.max.retries=6
> serializer.class=kafka.serializer.DefaultEncoder
> send.buffer.bytes=134217728
> compression.codec=gzip
>
> Mirror startup settings:
> --num.streams 2 --num.producers 4
>
> Any thoughts/suggestions would be very helpful. At this point we're
> running out of things to try.
>
>
> Craig J. Swift
> Software Engineer
>
>
>