You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Arvind Kandhare <sw...@gmail.com> on 2016/06/18 23:46:26 UTC

[DISCUSS] KIP-64 -Allow underlying distributed filesystem to take over replication depending on configuration

Hi,
Let's use this thread to discuss the above mentioned KIP.

Here is the motivation for it:
"Distributed data stores can be vastly improved by integrating with Kafka.
Some of these improvements are:

   1. They can participate easily in the whole Kafka ecosystem
   2. Data ingesting speeds can be improved

Distributed data stores come with their own replication. Kafka replication
is a duplication of functionality for them.Kafka should defer replication
to underlying file system if the configuration mandates it.

In the newly added configuration a flush to the filesystem should consider
a signal that the message is replicated."

Do let me know your views on this.


Thanks and regards,

Arvind

Re: [DISCUSS] KIP-64 -Allow underlying distributed filesystem to take over replication depending on configuration

Posted by Arvind Kandhare <sw...@gmail.com>.

Hi Gwen,
Thanks for the response. I understand all the concerns you have raised. I
disagree with some of your comments/conclusions. Let me try and answer them
in the order you have raised them (I definitely get that the issues you
raised later are more important)

   1. Extent of changes required: I have spent some time looking at the
   Kafka core storage and replication code. I am confident that the changes
   will be minimal. It is definite not ripping out whole Kafka implementation.
   My strong intuition is that the diff will be in the ballpark of 100 lines,
   including everything :)
   1. No changes in the core log/protocol behavior. The logdirs will be
      mounted as NFS or SMB. All the Kafka code works as it works now (except
      replication). With all the goodness of moving data from
filesystem cache to
      sockets etc.
      2. Replication: IF the deployer chooses so, the replication code is a
      NOP. Here again one of the interesting trick that is used for replication
      comes in handy. Replicas are nothing but special consumers. If
you see that
      that point of view, actually replication is a very critical,important but
      not necessary part of the deployment.
   2. Benefit to Kafka community/users:
      1. Dependence on local storage of nodes: Right now some of the
      limitations of Kafka deployments are because of limitations of local
      storage. (e.g. size of partition limited by the size of disk etc)
      2. In this configuration Kafka deployment is less sticky to the
      nodes. A node failure can be handled easily.
      3. distributed storage solutions can participate easily in higher
      level analytics frameworks working on top of Kafka. Which is a win-win in
      my opinion.
   3. "Claiming that speeds can be improved is pretty easy :) Are
   you talking about ingest to Kafka?" - I am claiming that the ingestion
   speed to underlying stores can be improved as well as they get to
   participate in Kafka ecosystem. I am not claiming that my changes will
   improve Kafka speeds/latency/throughput.
   4. Why should Kafka community do it: I do not have strong justification,
   other than for being a good FOSS player. I think the individual stores
   can/should maintain their own version of Kafka which does this. I do not
   officially represent any such distributed store solution here. This is
   strictly scratching my personal itch. I was studying Kafka and I thought
   this may help the community and new entrants. If the community does not
   think so, I can make my own changes and make them public on github and stop
   at that.

Hope I have answered all your questions. Do let me know whether you have
further feedback/questions.

Thanks and regards,
Arvind

On Sun, Jun 19, 2016 at 9:30 PM Gwen Shapira <gw...@confluent.io> wrote:

> Hi Arvind,
>
> Thank you for proposing this KIP.
>
> I am not sure how much experience you have in modifying Kafka's core
> module, so I don't know if you are aware of how deeply the storage and
> replication layer are integrated within Kafka. There is no clean API
> to rip out, this KIP will essentially require a re-write of most of
> Kafka. Obviously something that has huge risk (and huge amounts of
> work).
>
> For something that has so much risk and so much effort involved, I
> feel that the justification in the KIP is lacking.
>
> For instance, you say:
> "Distributed data stores can be vastly improved by integrating with
> Kafka. Some of these improvements are:
> * They can participate easily in the whole Kafka ecosystem
> * Data ingesting speeds can be improved"
>
> Things that are not clear to me are:
>
> 1) Why should the Kafka community rewrite Kafka in order to improve
> distributed data stores? Shouldn't the community for each data store
> make the effort to improve their application? Where is the benefit to
> Kafka users?
>
> 2) Can you detail in which ways are distributed data stores unable to
> participate in Kafka ecosystem now? In which ways do they want to
> participate?
>
> 3) Claiming that speeds can be improved is pretty easy :) Are you
> talking about ingest to Kafka? or from Kafka to another store? What is
> the current ingest rate? What is the current bottleneck? Where do you
> expect the speed improvement to come from? Are you talking about
> latency or throughput?
>
> Once we all agree that there is indeed a problem, we can discuss your
> proposed solution :)
>
> Personally, I feel that Kafka is a distributed data store (with
> log/queue semantics) and therefore cannot and should not delegate core
> data store responsibilities to an external system. Kafka users came to
> expect very strong reliability, consistency and durability guarantees
> from Kafka and very clear replication semantics and we must be very
> very careful not to compromise and put those at risk. Especially
> without very clear benefits to Kafka users.
>
> Thanks,
>
> Gwen Shapira
>
>
>
>
> On Sat, Jun 18, 2016 at 4:46 PM, Arvind Kandhare <sw...@gmail.com>
> wrote:
> > Hi,
> > Let's use this thread to discuss the above mentioned KIP.
> >
> > Here is the motivation for it:
> > "Distributed data stores can be vastly improved by integrating with
> Kafka.
> > Some of these improvements are:
> >
> >    1. They can participate easily in the whole Kafka ecosystem
> >    2. Data ingesting speeds can be improved
> >
> > Distributed data stores come with their own replication. Kafka
> replication
> > is a duplication of functionality for them.Kafka should defer replication
> > to underlying file system if the configuration mandates it.
> >
> > In the newly added configuration a flush to the filesystem should
> consider
> > a signal that the message is replicated."
> >
> > Do let me know your views on this.
> >
> >
> > Thanks and regards,
> >
> > Arvind
>

Re: [DISCUSS] KIP-64 -Allow underlying distributed filesystem to take over replication depending on configuration

Posted by Gwen Shapira <gw...@confluent.io>.

Hi Arvind,

Thank you for proposing this KIP.

I am not sure how much experience you have in modifying Kafka's core
module, so I don't know if you are aware of how deeply the storage and
replication layer are integrated within Kafka. There is no clean API
to rip out, this KIP will essentially require a re-write of most of
Kafka. Obviously something that has huge risk (and huge amounts of
work).

For something that has so much risk and so much effort involved, I
feel that the justification in the KIP is lacking.

For instance, you say:
"Distributed data stores can be vastly improved by integrating with
Kafka. Some of these improvements are:
* They can participate easily in the whole Kafka ecosystem
* Data ingesting speeds can be improved"

Things that are not clear to me are:

1) Why should the Kafka community rewrite Kafka in order to improve
distributed data stores? Shouldn't the community for each data store
make the effort to improve their application? Where is the benefit to
Kafka users?

2) Can you detail in which ways are distributed data stores unable to
participate in Kafka ecosystem now? In which ways do they want to
participate?

3) Claiming that speeds can be improved is pretty easy :) Are you
talking about ingest to Kafka? or from Kafka to another store? What is
the current ingest rate? What is the current bottleneck? Where do you
expect the speed improvement to come from? Are you talking about
latency or throughput?

Once we all agree that there is indeed a problem, we can discuss your
proposed solution :)

Personally, I feel that Kafka is a distributed data store (with
log/queue semantics) and therefore cannot and should not delegate core
data store responsibilities to an external system. Kafka users came to
expect very strong reliability, consistency and durability guarantees
from Kafka and very clear replication semantics and we must be very
very careful not to compromise and put those at risk. Especially
without very clear benefits to Kafka users.

Thanks,

Gwen Shapira

On Sat, Jun 18, 2016 at 4:46 PM, Arvind Kandhare <sw...@gmail.com> wrote:
> Hi,
> Let's use this thread to discuss the above mentioned KIP.
>
> Here is the motivation for it:
> "Distributed data stores can be vastly improved by integrating with Kafka.
> Some of these improvements are:
>
>    1. They can participate easily in the whole Kafka ecosystem
>    2. Data ingesting speeds can be improved
>
> Distributed data stores come with their own replication. Kafka replication
> is a duplication of functionality for them.Kafka should defer replication
> to underlying file system if the configuration mandates it.
>
> In the newly added configuration a flush to the filesystem should consider
> a signal that the message is replicated."
>
> Do let me know your views on this.
>
>
> Thanks and regards,
>
> Arvind