You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Ashutosh Singh <as...@gmail.com> on 2012/04/27 22:22:00 UTC

Failure Guarantee and Expectations

Folks,
I read Jun's paper, but I could not get enough details on where all the
failure scenarios are at. I am trying to use kafka (or something else) for
persistent queues. The big requirement for me would also be not to loose
the message persisted. I am ready to contribute the code to add some sort
of replication, but want to know where all the failures can happen.
1- What happens in zookeeper goes down and comes back up---What messages if
any do we lose...what compensation do we need to do on consumer side if any.
2- What happens when the broker goes down.
    a- when the hard drive has a failure.
    b- data is correctly written to disk but the process goes down and is
restarted.
    c- What will happen to consumers intermittently...
    d- If we replicate the data what reliability guarantees can we have.
    e- If CRC errors happen, can we pick up the record from another copy
saved somewhere

There is a deep interest in my group on this project and if it fits our
need we would like to run with it, both as users and as contributors.

Another question.... why was queue not built on Cassandra? Would that have
met sub second latency SLA's

Ashutosh Singh

Re: Failure Guarantee and Expectations

Posted by Neha Narkhede <ne...@gmail.com>.

Ashutosh,

Good to hear about your interest in using and contributing to Kafka !

Please find some of the answers to your questions inline -

>> 1- What happens in zookeeper goes down and comes back up---What messages
if
any do we lose...what compensation do we need to do on consumer side if any.

The zookeeper clients in the broker and consumer will get disconnected from
zookeeper. If the zookeeper cluster comes back up within the session
timeout, all sessions will be restored. If it comes back after that, the
sessions will expire and new sessions will get established. In any case,
existing data on the brokers will not be lost. The consumers will just
receive it late.

>> 2- What happens when the broker goes down.
   a- when the hard drive has a failure.

Without KAFKA-50, data will be lost. With KAFKA-50, the probability of data
loss will significantly reduce, unless there are multiple correlated
failures.

>>    b- data is correctly written to disk but the process goes down and is
restarted.

No data will be lost in this case.

>>    c- What will happen to consumers intermittently...

If brokers/zookeeper is restarted, the consumers will just get the data
once the cluster is restarted. No data will be lost

>>  d- If we replicate the data what reliability guarantees can we have.

KAFKA-50 will add both sync as well as sync replication support in Kafka.
For more details, see this -
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication

>>   e- If CRC errors happen, can we pick up the record from another copy
saved somewhere

Without KAFKA-50, data might get lost. With KAFKA-50, it will get served
from the remaining replicas.

>> why was queue not built on Cassandra?

I think if you read the design document (
http://incubator.apache.org/kafka/design.html), most design choices will be
easier to understand. Let us know if you have more questions after that.

Thanks,
Neha

On Fri, Apr 27, 2012 at 1:22 PM, Ashutosh Singh <as...@gmail.com>wrote:

> Folks,
> I read Jun's paper, but I could not get enough details on where all the
> failure scenarios are at. I am trying to use kafka (or something else) for
> persistent queues. The big requirement for me would also be not to loose
> the message persisted. I am ready to contribute the code to add some sort
> of replication, but want to know where all the failures can happen.
> 1- What happens in zookeeper goes down and comes back up---What messages if
> any do we lose...what compensation do we need to do on consumer side if
> any.
> 2- What happens when the broker goes down.
>    a- when the hard drive has a failure.
>    b- data is correctly written to disk but the process goes down and is
> restarted.
>    c- What will happen to consumers intermittently...
>    d- If we replicate the data what reliability guarantees can we have.
>    e- If CRC errors happen, can we pick up the record from another copy
> saved somewhere
>
> There is a deep interest in my group on this project and if it fits our
> need we would like to run with it, both as users and as contributors.
>
> Another question.... why was queue not built on Cassandra? Would that have
> met sub second latency SLA's
>
> Ashutosh Singh
>

Re: Failure Guarantee and Expectations

Posted by Neha Narkhede <ne...@gmail.com>.

Ashutosh,

Good to hear about your interest in using and contributing to Kafka !

Please find some of the answers to your questions inline -

>> 1- What happens in zookeeper goes down and comes back up---What messages
if
any do we lose...what compensation do we need to do on consumer side if any.

The zookeeper clients in the broker and consumer will get disconnected from
zookeeper. If the zookeeper cluster comes back up within the session
timeout, all sessions will be restored. If it comes back after that, the
sessions will expire and new sessions will get established. In any case,
existing data on the brokers will not be lost. The consumers will just
receive it late.

>> 2- What happens when the broker goes down.
   a- when the hard drive has a failure.

Without KAFKA-50, data will be lost. With KAFKA-50, the probability of data
loss will significantly reduce, unless there are multiple correlated
failures.

>>    b- data is correctly written to disk but the process goes down and is
restarted.

No data will be lost in this case.

>>    c- What will happen to consumers intermittently...

If brokers/zookeeper is restarted, the consumers will just get the data
once the cluster is restarted. No data will be lost

>>  d- If we replicate the data what reliability guarantees can we have.

KAFKA-50 will add both sync as well as sync replication support in Kafka.
For more details, see this -
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication

>>   e- If CRC errors happen, can we pick up the record from another copy
saved somewhere

Without KAFKA-50, data might get lost. With KAFKA-50, it will get served
from the remaining replicas.

>> why was queue not built on Cassandra?

I think if you read the design document (
http://incubator.apache.org/kafka/design.html), most design choices will be
easier to understand. Let us know if you have more questions after that.

Thanks,
Neha

On Fri, Apr 27, 2012 at 1:22 PM, Ashutosh Singh <as...@gmail.com>wrote:

> Folks,
> I read Jun's paper, but I could not get enough details on where all the
> failure scenarios are at. I am trying to use kafka (or something else) for
> persistent queues. The big requirement for me would also be not to loose
> the message persisted. I am ready to contribute the code to add some sort
> of replication, but want to know where all the failures can happen.
> 1- What happens in zookeeper goes down and comes back up---What messages if
> any do we lose...what compensation do we need to do on consumer side if
> any.
> 2- What happens when the broker goes down.
>    a- when the hard drive has a failure.
>    b- data is correctly written to disk but the process goes down and is
> restarted.
>    c- What will happen to consumers intermittently...
>    d- If we replicate the data what reliability guarantees can we have.
>    e- If CRC errors happen, can we pick up the record from another copy
> saved somewhere
>
> There is a deep interest in my group on this project and if it fits our
> need we would like to run with it, both as users and as contributors.
>
> Another question.... why was queue not built on Cassandra? Would that have
> met sub second latency SLA's
>
> Ashutosh Singh
>