You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Shlomi Hazan <hz...@gmail.com> on 2014/09/03 21:01:25 UTC

Error in acceptor (kafka.network.Acceptor)

Hi,

I am trying to load a cluster with over than 10K connections, and bumped
into the error in the subject.
Is there any limitation on Kafka's side? if so it configurable? how?
on first look, it looks like the selector accepting the connection is
overflowing...

Thanks.
-- 
Shlomi

Re: Error in acceptor (kafka.network.Acceptor)

Posted by Shlomi Hazan <sh...@viber.com>.

No, just a bare centos 6.5 on an EC2 instance
On Sep 11, 2014 1:39 AM, "Jun Rao" <ju...@gmail.com> wrote:

> I meant whether you start the broker in service containers like jetty or
> tomcat.
>
> Thanks,
>
> Jun
>
> On Wed, Sep 10, 2014 at 12:28 AM, Shlomi Hazan <sh...@viber.com> wrote:
>
> > Hi, sorry, what do you mean by 'container'? I use bare EC2 instances...
> > Shlomi
> >
> > On Wed, Sep 10, 2014 at 1:41 AM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > Are you starting the broker in some container? You want to make sure
> that
> > > the container doesn't overwrite the open file handler limit.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Sep 9, 2014 at 12:05 AM, Shlomi Hazan <sh...@viber.com>
> wrote:
> > >
> > > > Hi,
> > > > it's probably beyond that. it may be an issue with the number of
> files
> > > > Kafka can have opened concurrently.
> > > > A previous conversation with Joe about (build failes for latest
> stable
> > > > source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by
> > Joe,
> > > > A's by me):
> > > >
> > > > 1. what else on the logs? [*see below*]
> > > > 2. other broker failure reason? [*"*]
> > > > 3. other broker failure after taking leadership? [*how can I be sure?
> > ask
> > > > another to describe topic?*]
> > > > 4. how do I measure number of connections? [*ls -l /proc/<pid>/fd |
> > grep
> > > > socket | wc -l, also did watch on that*]
> > > > 5. is that number equals the number of {new Producer}? [*yes*]
> > > > 6. how many topics? [*1*] how many partitions [*504*]
> > > > 7. Are u using a partition key? [*yes, I use the python client with*
> ]
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *class ProducerIdPartitioner(Partitioner):    """    Implements a
> > > > partitioner which selects the target partition based on the sending
> > > > producer ID    """    def partition(self, key, partitions):
> > size =
> > > > len(partitions)        prod_id = int(key)        idx = prod_id %
> > > > size        return partitions[idx]*
> > > > 8. maybe running into over partitioned topic? [*producer instances
> is 6
> > > > machines * 84 procs * 24 threads, but never got to start them
> all*,*b/c
> > > of
> > > > errors*]
> > > > 9. r u running anything else? [*yes, zookeeper*]
> > > >
> > > >
> > > > answer to 1,2:
> > > > the error's I see on the python client are first timeouts and then
> > > message
> > > > send failures, using sync send.
> > > >
> > > > on the controller log:
> > > >
> > > > ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR
> > > > [Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed
> to
> > > send
> > > > StopReplica request with correlation id 519 to broker
> > > > id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker.
> > > > (kafka.controller.RequestSendThread)
> > > > controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR
> > > > [Controller-1-to-broker-3-send-thread], Controller 1's connection to
> > > broker
> > > > id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful
> > > > (kafka.controller.RequestSendThread)
> > > >
> > > > on the server log (selected greps):
> > > > ...
> > > > server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR
> > > > [ReplicaFetcherThread-4-2], Error for partition
> [vpq_android_gcm_h,270]
> > > to
> > > > broker 2:class kafka.common.NotLeaderForPartitionException
> > > > (kafka.server.ReplicaFetcherThread)
> > > > ...
> > > > server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing
> socket
> > > for
> > > > /10.184.150.54 because of error (kafka.network.Processor)
> > > >
> > > > ...
> > > > server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR
> [KafkaApi-1]
> > > > Error
> > > > when processing fetch request for partition [vpq_android_gcm_h,184]
> > > offset
> > > > 8798 from follower with correlation id 0 (kafka.server.KafkaApis)
> > > > ...
> > > > erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR
> > > > [ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest;
> Version:
> > > 0;
> > > > CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId:
> > 1;
> > > > MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo:
> > [vpq_android_gcm_h,196]
> > > > -> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] ->
> > > > PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] ->
> > > > PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] ->
> > > > PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] ->
> > > > PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] ->
> > > > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] ->
> > > > PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] ->
> > > > PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] ->
> > > > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] ->
> > > > PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] ->
> > > > PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] ->
> > > > PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] ->
> > > > PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] ->
> > > > PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] ->
> > > > PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] ->
> > > > PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] ->
> > > > PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] ->
> > > > PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] ->
> > > > PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] ->
> > > > PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] ->
> > > > PartitionFetchInfo(378533,8388608)
> (kafka.server.ReplicaFetcherThread)
> > > > ...
> > > > server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in
> > > acceptor
> > > > (kafka.network.Acceptor)
> > > > ...
> > > >
> > > >
> > > > and these may not be all (other logs may have some more of that)....
> > > >
> > > >
> > > > Joe said to just lower the number of connections but I still can't
> see
> > > the
> > > > exact problem.
> > > > is there a kafka limit to the number of concurrent open files? cause
> > the
> > > > process was not limited...
> > > >
> > > > Thanks,
> > > > Shlomi
> > > >
> > > > On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <ju...@gmail.com> wrote:
> > > >
> > > > > What type of error did you see? You may need to configure a larger
> > open
> > > > > file handler limit.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hz...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to load a cluster with over than 10K connections, and
> > > > bumped
> > > > > > into the error in the subject.
> > > > > > Is there any limitation on Kafka's side? if so it configurable?
> > how?
> > > > > > on first look, it looks like the selector accepting the
> connection
> > is
> > > > > > overflowing...
> > > > > >
> > > > > > Thanks.
> > > > > > --
> > > > > > Shlomi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Error in acceptor (kafka.network.Acceptor)

Posted by Jun Rao <ju...@gmail.com>.

I meant whether you start the broker in service containers like jetty or
tomcat.

Thanks,

Jun

On Wed, Sep 10, 2014 at 12:28 AM, Shlomi Hazan <sh...@viber.com> wrote:

> Hi, sorry, what do you mean by 'container'? I use bare EC2 instances...
> Shlomi
>
> On Wed, Sep 10, 2014 at 1:41 AM, Jun Rao <ju...@gmail.com> wrote:
>
> > Are you starting the broker in some container? You want to make sure that
> > the container doesn't overwrite the open file handler limit.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Sep 9, 2014 at 12:05 AM, Shlomi Hazan <sh...@viber.com> wrote:
> >
> > > Hi,
> > > it's probably beyond that. it may be an issue with the number of files
> > > Kafka can have opened concurrently.
> > > A previous conversation with Joe about (build failes for latest stable
> > > source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by
> Joe,
> > > A's by me):
> > >
> > > 1. what else on the logs? [*see below*]
> > > 2. other broker failure reason? [*"*]
> > > 3. other broker failure after taking leadership? [*how can I be sure?
> ask
> > > another to describe topic?*]
> > > 4. how do I measure number of connections? [*ls -l /proc/<pid>/fd |
> grep
> > > socket | wc -l, also did watch on that*]
> > > 5. is that number equals the number of {new Producer}? [*yes*]
> > > 6. how many topics? [*1*] how many partitions [*504*]
> > > 7. Are u using a partition key? [*yes, I use the python client with* ]
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *class ProducerIdPartitioner(Partitioner):    """    Implements a
> > > partitioner which selects the target partition based on the sending
> > > producer ID    """    def partition(self, key, partitions):
> size =
> > > len(partitions)        prod_id = int(key)        idx = prod_id %
> > > size        return partitions[idx]*
> > > 8. maybe running into over partitioned topic? [*producer instances is 6
> > > machines * 84 procs * 24 threads, but never got to start them all*,*b/c
> > of
> > > errors*]
> > > 9. r u running anything else? [*yes, zookeeper*]
> > >
> > >
> > > answer to 1,2:
> > > the error's I see on the python client are first timeouts and then
> > message
> > > send failures, using sync send.
> > >
> > > on the controller log:
> > >
> > > ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR
> > > [Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed to
> > send
> > > StopReplica request with correlation id 519 to broker
> > > id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker.
> > > (kafka.controller.RequestSendThread)
> > > controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR
> > > [Controller-1-to-broker-3-send-thread], Controller 1's connection to
> > broker
> > > id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful
> > > (kafka.controller.RequestSendThread)
> > >
> > > on the server log (selected greps):
> > > ...
> > > server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR
> > > [ReplicaFetcherThread-4-2], Error for partition [vpq_android_gcm_h,270]
> > to
> > > broker 2:class kafka.common.NotLeaderForPartitionException
> > > (kafka.server.ReplicaFetcherThread)
> > > ...
> > > server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing socket
> > for
> > > /10.184.150.54 because of error (kafka.network.Processor)
> > >
> > > ...
> > > server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR [KafkaApi-1]
> > > Error
> > > when processing fetch request for partition [vpq_android_gcm_h,184]
> > offset
> > > 8798 from follower with correlation id 0 (kafka.server.KafkaApis)
> > > ...
> > > erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR
> > > [ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest; Version:
> > 0;
> > > CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId:
> 1;
> > > MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo:
> [vpq_android_gcm_h,196]
> > > -> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] ->
> > > PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] ->
> > > PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] ->
> > > PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] ->
> > > PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] ->
> > > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] ->
> > > PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] ->
> > > PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] ->
> > > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] ->
> > > PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] ->
> > > PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] ->
> > > PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] ->
> > > PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] ->
> > > PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] ->
> > > PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] ->
> > > PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] ->
> > > PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] ->
> > > PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] ->
> > > PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] ->
> > > PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] ->
> > > PartitionFetchInfo(378533,8388608) (kafka.server.ReplicaFetcherThread)
> > > ...
> > > server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in
> > acceptor
> > > (kafka.network.Acceptor)
> > > ...
> > >
> > >
> > > and these may not be all (other logs may have some more of that)....
> > >
> > >
> > > Joe said to just lower the number of connections but I still can't see
> > the
> > > exact problem.
> > > is there a kafka limit to the number of concurrent open files? cause
> the
> > > process was not limited...
> > >
> > > Thanks,
> > > Shlomi
> > >
> > > On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <ju...@gmail.com> wrote:
> > >
> > > > What type of error did you see? You may need to configure a larger
> open
> > > > file handler limit.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hz...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am trying to load a cluster with over than 10K connections, and
> > > bumped
> > > > > into the error in the subject.
> > > > > Is there any limitation on Kafka's side? if so it configurable?
> how?
> > > > > on first look, it looks like the selector accepting the connection
> is
> > > > > overflowing...
> > > > >
> > > > > Thanks.
> > > > > --
> > > > > Shlomi
> > > > >
> > > >
> > >
> >
>

Re: Error in acceptor (kafka.network.Acceptor)

Posted by Shlomi Hazan <sh...@viber.com>.

Hi, sorry, what do you mean by 'container'? I use bare EC2 instances...
Shlomi

On Wed, Sep 10, 2014 at 1:41 AM, Jun Rao <ju...@gmail.com> wrote:

> Are you starting the broker in some container? You want to make sure that
> the container doesn't overwrite the open file handler limit.
>
> Thanks,
>
> Jun
>
> On Tue, Sep 9, 2014 at 12:05 AM, Shlomi Hazan <sh...@viber.com> wrote:
>
> > Hi,
> > it's probably beyond that. it may be an issue with the number of files
> > Kafka can have opened concurrently.
> > A previous conversation with Joe about (build failes for latest stable
> > source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by Joe,
> > A's by me):
> >
> > 1. what else on the logs? [*see below*]
> > 2. other broker failure reason? [*"*]
> > 3. other broker failure after taking leadership? [*how can I be sure? ask
> > another to describe topic?*]
> > 4. how do I measure number of connections? [*ls -l /proc/<pid>/fd | grep
> > socket | wc -l, also did watch on that*]
> > 5. is that number equals the number of {new Producer}? [*yes*]
> > 6. how many topics? [*1*] how many partitions [*504*]
> > 7. Are u using a partition key? [*yes, I use the python client with* ]
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *class ProducerIdPartitioner(Partitioner):    """    Implements a
> > partitioner which selects the target partition based on the sending
> > producer ID    """    def partition(self, key, partitions):        size =
> > len(partitions)        prod_id = int(key)        idx = prod_id %
> > size        return partitions[idx]*
> > 8. maybe running into over partitioned topic? [*producer instances is 6
> > machines * 84 procs * 24 threads, but never got to start them all*,*b/c
> of
> > errors*]
> > 9. r u running anything else? [*yes, zookeeper*]
> >
> >
> > answer to 1,2:
> > the error's I see on the python client are first timeouts and then
> message
> > send failures, using sync send.
> >
> > on the controller log:
> >
> > ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR
> > [Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed to
> send
> > StopReplica request with correlation id 519 to broker
> > id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker.
> > (kafka.controller.RequestSendThread)
> > controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR
> > [Controller-1-to-broker-3-send-thread], Controller 1's connection to
> broker
> > id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful
> > (kafka.controller.RequestSendThread)
> >
> > on the server log (selected greps):
> > ...
> > server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR
> > [ReplicaFetcherThread-4-2], Error for partition [vpq_android_gcm_h,270]
> to
> > broker 2:class kafka.common.NotLeaderForPartitionException
> > (kafka.server.ReplicaFetcherThread)
> > ...
> > server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing socket
> for
> > /10.184.150.54 because of error (kafka.network.Processor)
> >
> > ...
> > server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR [KafkaApi-1]
> > Error
> > when processing fetch request for partition [vpq_android_gcm_h,184]
> offset
> > 8798 from follower with correlation id 0 (kafka.server.KafkaApis)
> > ...
> > erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR
> > [ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest; Version:
> 0;
> > CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId: 1;
> > MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo: [vpq_android_gcm_h,196]
> > -> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] ->
> > PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] ->
> > PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] ->
> > PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] ->
> > PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] ->
> > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] ->
> > PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] ->
> > PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] ->
> > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] ->
> > PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] ->
> > PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] ->
> > PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] ->
> > PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] ->
> > PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] ->
> > PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] ->
> > PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] ->
> > PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] ->
> > PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] ->
> > PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] ->
> > PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] ->
> > PartitionFetchInfo(378533,8388608) (kafka.server.ReplicaFetcherThread)
> > ...
> > server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in
> acceptor
> > (kafka.network.Acceptor)
> > ...
> >
> >
> > and these may not be all (other logs may have some more of that)....
> >
> >
> > Joe said to just lower the number of connections but I still can't see
> the
> > exact problem.
> > is there a kafka limit to the number of concurrent open files? cause the
> > process was not limited...
> >
> > Thanks,
> > Shlomi
> >
> > On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > What type of error did you see? You may need to configure a larger open
> > > file handler limit.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hz...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am trying to load a cluster with over than 10K connections, and
> > bumped
> > > > into the error in the subject.
> > > > Is there any limitation on Kafka's side? if so it configurable? how?
> > > > on first look, it looks like the selector accepting the connection is
> > > > overflowing...
> > > >
> > > > Thanks.
> > > > --
> > > > Shlomi
> > > >
> > >
> >
>

Re: Error in acceptor (kafka.network.Acceptor)

Posted by Jun Rao <ju...@gmail.com>.

Are you starting the broker in some container? You want to make sure that
the container doesn't overwrite the open file handler limit.

Thanks,

Jun

On Tue, Sep 9, 2014 at 12:05 AM, Shlomi Hazan <sh...@viber.com> wrote:

> Hi,
> it's probably beyond that. it may be an issue with the number of files
> Kafka can have opened concurrently.
> A previous conversation with Joe about (build failes for latest stable
> source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by Joe,
> A's by me):
>
> 1. what else on the logs? [*see below*]
> 2. other broker failure reason? [*"*]
> 3. other broker failure after taking leadership? [*how can I be sure? ask
> another to describe topic?*]
> 4. how do I measure number of connections? [*ls -l /proc/<pid>/fd | grep
> socket | wc -l, also did watch on that*]
> 5. is that number equals the number of {new Producer}? [*yes*]
> 6. how many topics? [*1*] how many partitions [*504*]
> 7. Are u using a partition key? [*yes, I use the python client with* ]
>
>
>
>
>
>
>
>
>
> *class ProducerIdPartitioner(Partitioner):    """    Implements a
> partitioner which selects the target partition based on the sending
> producer ID    """    def partition(self, key, partitions):        size =
> len(partitions)        prod_id = int(key)        idx = prod_id %
> size        return partitions[idx]*
> 8. maybe running into over partitioned topic? [*producer instances is 6
> machines * 84 procs * 24 threads, but never got to start them all*,*b/c of
> errors*]
> 9. r u running anything else? [*yes, zookeeper*]
>
>
> answer to 1,2:
> the error's I see on the python client are first timeouts and then message
> send failures, using sync send.
>
> on the controller log:
>
> ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR
> [Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed to send
> StopReplica request with correlation id 519 to broker
> id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker.
> (kafka.controller.RequestSendThread)
> controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR
> [Controller-1-to-broker-3-send-thread], Controller 1's connection to broker
> id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful
> (kafka.controller.RequestSendThread)
>
> on the server log (selected greps):
> ...
> server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR
> [ReplicaFetcherThread-4-2], Error for partition [vpq_android_gcm_h,270] to
> broker 2:class kafka.common.NotLeaderForPartitionException
> (kafka.server.ReplicaFetcherThread)
> ...
> server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing socket for
> /10.184.150.54 because of error (kafka.network.Processor)
>
> ...
> server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR [KafkaApi-1]
> Error
> when processing fetch request for partition [vpq_android_gcm_h,184] offset
> 8798 from follower with correlation id 0 (kafka.server.KafkaApis)
> ...
> erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR
> [ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest; Version: 0;
> CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId: 1;
> MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo: [vpq_android_gcm_h,196]
> -> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] ->
> PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] ->
> PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] ->
> PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] ->
> PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] ->
> PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] ->
> PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] ->
> PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] ->
> PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] ->
> PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] ->
> PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] ->
> PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] ->
> PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] ->
> PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] ->
> PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] ->
> PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] ->
> PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] ->
> PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] ->
> PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] ->
> PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] ->
> PartitionFetchInfo(378533,8388608) (kafka.server.ReplicaFetcherThread)
> ...
> server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in acceptor
> (kafka.network.Acceptor)
> ...
>
>
> and these may not be all (other logs may have some more of that)....
>
>
> Joe said to just lower the number of connections but I still can't see the
> exact problem.
> is there a kafka limit to the number of concurrent open files? cause the
> process was not limited...
>
> Thanks,
> Shlomi
>
> On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <ju...@gmail.com> wrote:
>
> > What type of error did you see? You may need to configure a larger open
> > file handler limit.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hz...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I am trying to load a cluster with over than 10K connections, and
> bumped
> > > into the error in the subject.
> > > Is there any limitation on Kafka's side? if so it configurable? how?
> > > on first look, it looks like the selector accepting the connection is
> > > overflowing...
> > >
> > > Thanks.
> > > --
> > > Shlomi
> > >
> >
>

Re: Error in acceptor (kafka.network.Acceptor)

Posted by Shlomi Hazan <sh...@viber.com>.

Hi,
it's probably beyond that. it may be an issue with the number of files
Kafka can have opened concurrently.
A previous conversation with Joe about (build failes for latest stable
source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by Joe,
A's by me):

1. what else on the logs? [*see below*]
2. other broker failure reason? [*"*]
3. other broker failure after taking leadership? [*how can I be sure? ask
another to describe topic?*]
4. how do I measure number of connections? [*ls -l /proc/<pid>/fd | grep
socket | wc -l, also did watch on that*]
5. is that number equals the number of {new Producer}? [*yes*]
6. how many topics? [*1*] how many partitions [*504*]
7. Are u using a partition key? [*yes, I use the python client with* ]

*class ProducerIdPartitioner(Partitioner):    """    Implements a
partitioner which selects the target partition based on the sending
producer ID    """    def partition(self, key, partitions):        size =
len(partitions)        prod_id = int(key)        idx = prod_id %
size        return partitions[idx]*
8. maybe running into over partitioned topic? [*producer instances is 6
machines * 84 procs * 24 threads, but never got to start them all*,*b/c of
errors*]
9. r u running anything else? [*yes, zookeeper*]

answer to 1,2:
the error's I see on the python client are first timeouts and then message
send failures, using sync send.

on the controller log:

ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR
[Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed to send
StopReplica request with correlation id 519 to broker
id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker.
(kafka.controller.RequestSendThread)
controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR
[Controller-1-to-broker-3-send-thread], Controller 1's connection to broker
id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful
(kafka.controller.RequestSendThread)

on the server log (selected greps):
...
server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR
[ReplicaFetcherThread-4-2], Error for partition [vpq_android_gcm_h,270] to
broker 2:class kafka.common.NotLeaderForPartitionException
(kafka.server.ReplicaFetcherThread)
...
server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing socket for
/10.184.150.54 because of error (kafka.network.Processor)

...
server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR [KafkaApi-1] Error
when processing fetch request for partition [vpq_android_gcm_h,184] offset
8798 from follower with correlation id 0 (kafka.server.KafkaApis)
...
erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR
[ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest; Version: 0;
CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId: 1;
MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo: [vpq_android_gcm_h,196]
-> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] ->
PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] ->
PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] ->
PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] ->
PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] ->
PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] ->
PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] ->
PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] ->
PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] ->
PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] ->
PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] ->
PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] ->
PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] ->
PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] ->
PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] ->
PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] ->
PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] ->
PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] ->
PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] ->
PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] ->
PartitionFetchInfo(378533,8388608) (kafka.server.ReplicaFetcherThread)
...
server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in acceptor
(kafka.network.Acceptor)
...

and these may not be all (other logs may have some more of that)....

Joe said to just lower the number of connections but I still can't see the
exact problem.
is there a kafka limit to the number of concurrent open files? cause the
process was not limited...

Thanks,
Shlomi

On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <ju...@gmail.com> wrote:

> What type of error did you see? You may need to configure a larger open
> file handler limit.
>
> Thanks,
>
> Jun
>
> On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hz...@gmail.com> wrote:
>
> > Hi,
> >
> > I am trying to load a cluster with over than 10K connections, and bumped
> > into the error in the subject.
> > Is there any limitation on Kafka's side? if so it configurable? how?
> > on first look, it looks like the selector accepting the connection is
> > overflowing...
> >
> > Thanks.
> > --
> > Shlomi
> >
>

Re: Error in acceptor (kafka.network.Acceptor)

Posted by Jun Rao <ju...@gmail.com>.

What type of error did you see? You may need to configure a larger open
file handler limit.

Thanks,

Jun

On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hz...@gmail.com> wrote:

> Hi,
>
> I am trying to load a cluster with over than 10K connections, and bumped
> into the error in the subject.
> Is there any limitation on Kafka's side? if so it configurable? how?
> on first look, it looks like the selector accepting the connection is
> overflowing...
>
> Thanks.
> --
> Shlomi
>