You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Louis Calisi <lc...@tripadvisor.com> on 2015/12/18 13:15:26 UTC

Replication Issue with Samza / Kafka

Hi:

I’ve been a quite fan of Kafka / Samza for about a year now. The project I’ve been working for TripAdvisor is about to soft launch soon. Lately we’ve noticed replication error coming from Kafka (0.8.2.0). Samza is exclusively using this Kafka cluster, minus a small application that’s feeding it data. The version of Samza is 0.9.1.

It feels like we’re stumbling on the same issue described here: https://issues.apache.org/jira/browse/KAFKA-2143. I attempted to apply the patch but it was written for a version of Kafka greater than then one we’re using (and recommended by Samza).

Here’s an example:

2015-12-18 02:31:35,573 ERROR kafka.server.ReplicaManager: [Replica Manager on Broker 230]: Error when processing fetch request for partition [Stream_<label>_Prod,9] offset 78214693 from consumer with correlation id 114263. Possible cause: Request for offset 78214693 but we only have log segments in the range 79642852 to 128218996.

We don’t see any errors in the Samza logs till much later.

2015-12-18 04:56:35 ERROR consumer.ConsumerIterator:97 - consumed offset: 76451107 doesn't match fetch offset: 76741109 for Stream_<label>_Prod:6: fetched offset = 76741629: consumed offset = 76451107; Consumer may lose data

We have 3 Kafka nodes and 5 samza nodes. Topics have a 2X replication. There are a total of 60 topics. 10 of these topics carry the bulk of the data.

I’m charting Kafka’s under replicated partitions graph and it show only 4 partitions max are under replicated. This happens briefly when we’re flushing out stale data.

Ironically, the replication errors seem to happen when the system is under low activity. Our application is very bursty. Our front end servers roll logs every hour. Yesterday for example we were replaying old logs at full speed for 8 hours without issues. Once we caught up and the application went into "burst mode” then we started to see the errors.

Has anyone seen issues like this? I checked the samza mailing list for the past 3 months and didn’t see anything indicating it. I’m hoping it’s the way we configured Kafka? Our settings are below. Thanks in advance for any help.

Regards, Louis Calisi

auto.create.topics.enable=true
auto.leader.rebalance.enable=true
broker.id=230
controlled.shutdown.enable=true
default.replication.factor=2
delete.topic.enable=true
jmx_port=XXXX
kafka.http.metrics.host=0.0.0.0
kafka.http.metrics.port=XXXXX
kafka.log4j.dir=/var/log/kafka
log.dirs=/data/2/kafka/data,/data/3/kafka/data,/data/4/kafka/data,/data/5/kafka/data,/data/6/kafka/data,/data/7/kafka/data,/data/8/kafka/data,/data/9/kafka/data,/data/10/kafka/data,/data/11/kafka/data,/data/12/kafka/data
log.retention.bytes=-1
log.retention.hours=12
log.roll.hours=24
log.segment.bytes=1073741824
max.connections.per.ip=400
message.max.bytes=1000000
min.insync.replicas=2
num.partitions=1
port=XXXX
replica.fetch.max.bytes=1048576
replica.lag.max.messages=4000
unclean.leader.election.enable=false
zookeeper.session.timeout.ms=30000
num.io.threads=12
num.network.threads=8
log.cleaner.enable=true
log.cleaner.threads=3
offsets.storage=kafka
dual.commit.enabled=false
zookeeper.connect=…..
kafka.metrics.reporters=nl.techop.kafka.KafkaHttpMetricsReporter