You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Jay Kreps <ja...@gmail.com> on 2015/02/08 16:26:46 UTC

org.apache.common migration

Hey all,

Someone asked about why there is code duplication between org.apache.common
and core. The answer seemed like it might be useful to others, so including
it here:

Originally Kafka was more of a proof of concept and we didn't separate the
clients from the server. LinkedIn was much smaller and it wasn't open
source, and keeping those separate always adds a lot of overhead. So we
ended up with just one big jar.

Next thing we know the kafka jar is embedded everywhere. Lot's of fallout
from that
- It has to be really sensitive to dependencies
- Scala causes all kinds of pain for users. Ironically it causes the most
pain for people using scala because of compatibility. I think the single
biggest Kafka complaint was the scala clients and resulting scary
exceptions, lack of javadoc, etc.
- Many of the client interfaces weren't well thought out as permanent
long-term commitments.
- We new we had to rewrite both clients due to technical deficiencies
anyway. The clients really needed to move to non-blocking I/O which is
basically a rewrite on it's own.

So how to go about that?

Well we felt we needed to maintain the old client interfaces for a good
period of time. Any kind of breaking cut-over was kind of a non-starter.
But a major refactoring in place was really hard since so many classes were
public and so little attention had been paid to the difference between
public and private classes.

Naturally since the client and server do the inverse of each other there is
a ton of shared logic. So we thought we needed to break it up into three
independent chunks:
1. common - shared helper code used by both clients and server
2. clients - the producer, consumer, and eventually admin java interfaces.
This depends on common.
3. server - the server (and legacy clients). This is currently called core.
This will depend on common and clients (because sometimes the server needs
to make client requests)

Common and clients were left as a single jar and just logically separate so
that people wouldn't have to deal with two jars (and hence the possibility
of getting different versions of each).

The dependency is actually a little counter-intuitive to people--they
usually think of the client as depending on the server since the client
calls the server. But in terms of code dependencies it is the other way--if
you depend on the client you obviously don't want to drag in the server.

So to get all this done we decided to just go big and do a rewrite of the
clients in Java. A result of this is that any shared code would have to
move to Java (so the clients don't pull in Scala). We felt this was
probably a good thing in its own right as it gave a chance to improve a few
of these utility libraries like config parsing, etc.

So the plan was and is:
1. Rewrite producer, release and roll out
2a. Rewrite consumer, release and roll out
2b. Migrate server from scala code to org.apache.common classes
3. Deprecate scala clients

(2a) Is is in flight now, and that means (2b) is totally up for grabs. Of
these the request conversion is definitely the most pressing since having
those defined twice duplicates a ton of work. We will have to be
hyper-conscientious during the conversion about making the shared code in
common really solve the problem well and conveniently on the server as well
(so we don't end up just shoe-horning it in). My hope is that we can treat
this common code really well--it isn't as permanent as the public classes
but ends up heavily used so we should take good care of it. Most the shared
code is private so we can refactor the stuff in common to meet the needs of
the server if we find mismatches or missing functionality. I tried to keep
in mind the eventual server usage while writing it, but I doubt it will be
as trivial as just deleting the old and adding the new.

In terms of the simplicity:
- Converting exceptions should be trivial
- Converting utils is straight-forward but we should evaluate the
individual utilities and see if they actually make sense, have tests, are
used, etc.
- Converting the requests may not be too complex but touches a huge hunk of
code and may require some effort to decouple the network layer.
- Converting the network code will be delicate and may require some changes
in org.apache.common.network to meet the server's needs

This is all a lot of work, but if we stick to it at the end we will have
really nice clients and a nice modular code base. :-)

Cheers,

-Jay

Re: org.apache.common migration

Posted by Jay Kreps <ja...@gmail.com>.
Hey Jun,

I think the existing scala clients should just remain as they are. There is
no point updating them, and as you say it would be quite fragile. The
conversion to the new requests would just be for the server usage.

-Jay

On Mon, Feb 9, 2015 at 9:48 AM, Jun Rao <ju...@confluent.io> wrote:

> We need to be a bit careful when doing 2b. Currently, our public apis
> include SimpleConsumer, which unfortunately exposes our RPC
> requests/responses. Doing 2b would mean api changes to SimpleConsumer. So,
> if we want to do 2b before 3, we would need to agree on making such api
> changes. Otherwise, 2b will need to be done after 3.
>
> Thanks,
>
> Jun
>
> On Sun, Feb 8, 2015 at 7:26 AM, Jay Kreps <ja...@gmail.com> wrote:
>
> > Hey all,
> >
> > Someone asked about why there is code duplication between
> org.apache.common
> > and core. The answer seemed like it might be useful to others, so
> including
> > it here:
> >
> > Originally Kafka was more of a proof of concept and we didn't separate
> the
> > clients from the server. LinkedIn was much smaller and it wasn't open
> > source, and keeping those separate always adds a lot of overhead. So we
> > ended up with just one big jar.
> >
> > Next thing we know the kafka jar is embedded everywhere. Lot's of fallout
> > from that
> > - It has to be really sensitive to dependencies
> > - Scala causes all kinds of pain for users. Ironically it causes the most
> > pain for people using scala because of compatibility. I think the single
> > biggest Kafka complaint was the scala clients and resulting scary
> > exceptions, lack of javadoc, etc.
> > - Many of the client interfaces weren't well thought out as permanent
> > long-term commitments.
> > - We new we had to rewrite both clients due to technical deficiencies
> > anyway. The clients really needed to move to non-blocking I/O which is
> > basically a rewrite on it's own.
> >
> > So how to go about that?
> >
> > Well we felt we needed to maintain the old client interfaces for a good
> > period of time. Any kind of breaking cut-over was kind of a non-starter.
> > But a major refactoring in place was really hard since so many classes
> were
> > public and so little attention had been paid to the difference between
> > public and private classes.
> >
> > Naturally since the client and server do the inverse of each other there
> is
> > a ton of shared logic. So we thought we needed to break it up into three
> > independent chunks:
> > 1. common - shared helper code used by both clients and server
> > 2. clients - the producer, consumer, and eventually admin java
> interfaces.
> > This depends on common.
> > 3. server - the server (and legacy clients). This is currently called
> core.
> > This will depend on common and clients (because sometimes the server
> needs
> > to make client requests)
> >
> > Common and clients were left as a single jar and just logically separate
> so
> > that people wouldn't have to deal with two jars (and hence the
> possibility
> > of getting different versions of each).
> >
> > The dependency is actually a little counter-intuitive to people--they
> > usually think of the client as depending on the server since the client
> > calls the server. But in terms of code dependencies it is the other
> way--if
> > you depend on the client you obviously don't want to drag in the server.
> >
> > So to get all this done we decided to just go big and do a rewrite of the
> > clients in Java. A result of this is that any shared code would have to
> > move to Java (so the clients don't pull in Scala). We felt this was
> > probably a good thing in its own right as it gave a chance to improve a
> few
> > of these utility libraries like config parsing, etc.
> >
> > So the plan was and is:
> > 1. Rewrite producer, release and roll out
> > 2a. Rewrite consumer, release and roll out
> > 2b. Migrate server from scala code to org.apache.common classes
> > 3. Deprecate scala clients
> >
> > (2a) Is is in flight now, and that means (2b) is totally up for grabs. Of
> > these the request conversion is definitely the most pressing since having
> > those defined twice duplicates a ton of work. We will have to be
> > hyper-conscientious during the conversion about making the shared code in
> > common really solve the problem well and conveniently on the server as
> well
> > (so we don't end up just shoe-horning it in). My hope is that we can
> treat
> > this common code really well--it isn't as permanent as the public classes
> > but ends up heavily used so we should take good care of it. Most the
> shared
> > code is private so we can refactor the stuff in common to meet the needs
> of
> > the server if we find mismatches or missing functionality. I tried to
> keep
> > in mind the eventual server usage while writing it, but I doubt it will
> be
> > as trivial as just deleting the old and adding the new.
> >
> > In terms of the simplicity:
> > - Converting exceptions should be trivial
> > - Converting utils is straight-forward but we should evaluate the
> > individual utilities and see if they actually make sense, have tests, are
> > used, etc.
> > - Converting the requests may not be too complex but touches a huge hunk
> of
> > code and may require some effort to decouple the network layer.
> > - Converting the network code will be delicate and may require some
> changes
> > in org.apache.common.network to meet the server's needs
> >
> > This is all a lot of work, but if we stick to it at the end we will have
> > really nice clients and a nice modular code base. :-)
> >
> > Cheers,
> >
> > -Jay
> >
>

Re: org.apache.common migration

Posted by Jun Rao <ju...@confluent.io>.
We need to be a bit careful when doing 2b. Currently, our public apis
include SimpleConsumer, which unfortunately exposes our RPC
requests/responses. Doing 2b would mean api changes to SimpleConsumer. So,
if we want to do 2b before 3, we would need to agree on making such api
changes. Otherwise, 2b will need to be done after 3.

Thanks,

Jun

On Sun, Feb 8, 2015 at 7:26 AM, Jay Kreps <ja...@gmail.com> wrote:

> Hey all,
>
> Someone asked about why there is code duplication between org.apache.common
> and core. The answer seemed like it might be useful to others, so including
> it here:
>
> Originally Kafka was more of a proof of concept and we didn't separate the
> clients from the server. LinkedIn was much smaller and it wasn't open
> source, and keeping those separate always adds a lot of overhead. So we
> ended up with just one big jar.
>
> Next thing we know the kafka jar is embedded everywhere. Lot's of fallout
> from that
> - It has to be really sensitive to dependencies
> - Scala causes all kinds of pain for users. Ironically it causes the most
> pain for people using scala because of compatibility. I think the single
> biggest Kafka complaint was the scala clients and resulting scary
> exceptions, lack of javadoc, etc.
> - Many of the client interfaces weren't well thought out as permanent
> long-term commitments.
> - We new we had to rewrite both clients due to technical deficiencies
> anyway. The clients really needed to move to non-blocking I/O which is
> basically a rewrite on it's own.
>
> So how to go about that?
>
> Well we felt we needed to maintain the old client interfaces for a good
> period of time. Any kind of breaking cut-over was kind of a non-starter.
> But a major refactoring in place was really hard since so many classes were
> public and so little attention had been paid to the difference between
> public and private classes.
>
> Naturally since the client and server do the inverse of each other there is
> a ton of shared logic. So we thought we needed to break it up into three
> independent chunks:
> 1. common - shared helper code used by both clients and server
> 2. clients - the producer, consumer, and eventually admin java interfaces.
> This depends on common.
> 3. server - the server (and legacy clients). This is currently called core.
> This will depend on common and clients (because sometimes the server needs
> to make client requests)
>
> Common and clients were left as a single jar and just logically separate so
> that people wouldn't have to deal with two jars (and hence the possibility
> of getting different versions of each).
>
> The dependency is actually a little counter-intuitive to people--they
> usually think of the client as depending on the server since the client
> calls the server. But in terms of code dependencies it is the other way--if
> you depend on the client you obviously don't want to drag in the server.
>
> So to get all this done we decided to just go big and do a rewrite of the
> clients in Java. A result of this is that any shared code would have to
> move to Java (so the clients don't pull in Scala). We felt this was
> probably a good thing in its own right as it gave a chance to improve a few
> of these utility libraries like config parsing, etc.
>
> So the plan was and is:
> 1. Rewrite producer, release and roll out
> 2a. Rewrite consumer, release and roll out
> 2b. Migrate server from scala code to org.apache.common classes
> 3. Deprecate scala clients
>
> (2a) Is is in flight now, and that means (2b) is totally up for grabs. Of
> these the request conversion is definitely the most pressing since having
> those defined twice duplicates a ton of work. We will have to be
> hyper-conscientious during the conversion about making the shared code in
> common really solve the problem well and conveniently on the server as well
> (so we don't end up just shoe-horning it in). My hope is that we can treat
> this common code really well--it isn't as permanent as the public classes
> but ends up heavily used so we should take good care of it. Most the shared
> code is private so we can refactor the stuff in common to meet the needs of
> the server if we find mismatches or missing functionality. I tried to keep
> in mind the eventual server usage while writing it, but I doubt it will be
> as trivial as just deleting the old and adding the new.
>
> In terms of the simplicity:
> - Converting exceptions should be trivial
> - Converting utils is straight-forward but we should evaluate the
> individual utilities and see if they actually make sense, have tests, are
> used, etc.
> - Converting the requests may not be too complex but touches a huge hunk of
> code and may require some effort to decouple the network layer.
> - Converting the network code will be delicate and may require some changes
> in org.apache.common.network to meet the server's needs
>
> This is all a lot of work, but if we stick to it at the end we will have
> really nice clients and a nice modular code base. :-)
>
> Cheers,
>
> -Jay
>

Re: org.apache.common migration

Posted by Joe Stein <jo...@stealth.ly>.
Argh, I just realized that the producer and consumer have already almost
removed that so it wouldn't be in common but just something for the
broker.  Maybe later this year 0.9/1.0 item to crack into.

On Sun, Feb 8, 2015 at 11:34 AM, Joe Stein <jo...@stealth.ly> wrote:

> Jay,
>
> Can we add another package (or two) to org.apache.kafka.common for
> metadata and consensus.  We can call them something else but the idea would
> be to have 1 common layer for meta data information (right now we put the
> json into zookeeper) and 1 common layer for asynchronous watches (which we
> wait for zookeeper to call us). It would be great to have that code
> something we can wrap zkclient around (or currator) that can insulate the
> different options growing in both of those areas.
>
> Both the meta data code and async watches we would be able to run any
> class we load in supporting the interface expected. The async watch
> interface can have as an input to pass the loaded class a callback and when
> the watcher fires (regardless if from etcd or zookeeper) the code gets the
> response it expected and needed. We should also expose a function that
> returns a future from the watcher.
>
> This may cause a little more work also if we wanted to take the JSON and
> turn that into byte structure ... or we just keep to the JSON and keep to
> making it describable and self documenting?
>
> For the meta data information I think that is separate because that data
> right now (outside of kafka) already resides in other systems like
> databases and/or caches. Folks may opt just to switch the meta data out to
> reduce the burden on zookeeper to just doing the asynchronous watchers.
> Some folks may want to swap both out.
>
> These two layers could also just be 2-3 more files in utils.
>
> - Joestein
>
> On Sun, Feb 8, 2015 at 11:04 AM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
>> Thanks for the background.
>>
>> I picked the Network classes portion of it, since I was already looking at
>> how to refactor send/receive and friends to support extending with TLS and
>> SASL. Having to do this in just one place will be really nice :)
>>
>> Gwen
>>
>> On Sun, Feb 8, 2015 at 7:26 AM, Jay Kreps <ja...@gmail.com> wrote:
>>
>> > Hey all,
>> >
>> > Someone asked about why there is code duplication between
>> org.apache.common
>> > and core. The answer seemed like it might be useful to others, so
>> including
>> > it here:
>> >
>> > Originally Kafka was more of a proof of concept and we didn't separate
>> the
>> > clients from the server. LinkedIn was much smaller and it wasn't open
>> > source, and keeping those separate always adds a lot of overhead. So we
>> > ended up with just one big jar.
>> >
>> > Next thing we know the kafka jar is embedded everywhere. Lot's of
>> fallout
>> > from that
>> > - It has to be really sensitive to dependencies
>> > - Scala causes all kinds of pain for users. Ironically it causes the
>> most
>> > pain for people using scala because of compatibility. I think the single
>> > biggest Kafka complaint was the scala clients and resulting scary
>> > exceptions, lack of javadoc, etc.
>> > - Many of the client interfaces weren't well thought out as permanent
>> > long-term commitments.
>> > - We new we had to rewrite both clients due to technical deficiencies
>> > anyway. The clients really needed to move to non-blocking I/O which is
>> > basically a rewrite on it's own.
>> >
>> > So how to go about that?
>> >
>> > Well we felt we needed to maintain the old client interfaces for a good
>> > period of time. Any kind of breaking cut-over was kind of a non-starter.
>> > But a major refactoring in place was really hard since so many classes
>> were
>> > public and so little attention had been paid to the difference between
>> > public and private classes.
>> >
>> > Naturally since the client and server do the inverse of each other
>> there is
>> > a ton of shared logic. So we thought we needed to break it up into three
>> > independent chunks:
>> > 1. common - shared helper code used by both clients and server
>> > 2. clients - the producer, consumer, and eventually admin java
>> interfaces.
>> > This depends on common.
>> > 3. server - the server (and legacy clients). This is currently called
>> core.
>> > This will depend on common and clients (because sometimes the server
>> needs
>> > to make client requests)
>> >
>> > Common and clients were left as a single jar and just logically
>> separate so
>> > that people wouldn't have to deal with two jars (and hence the
>> possibility
>> > of getting different versions of each).
>> >
>> > The dependency is actually a little counter-intuitive to people--they
>> > usually think of the client as depending on the server since the client
>> > calls the server. But in terms of code dependencies it is the other
>> way--if
>> > you depend on the client you obviously don't want to drag in the server.
>> >
>> > So to get all this done we decided to just go big and do a rewrite of
>> the
>> > clients in Java. A result of this is that any shared code would have to
>> > move to Java (so the clients don't pull in Scala). We felt this was
>> > probably a good thing in its own right as it gave a chance to improve a
>> few
>> > of these utility libraries like config parsing, etc.
>> >
>> > So the plan was and is:
>> > 1. Rewrite producer, release and roll out
>> > 2a. Rewrite consumer, release and roll out
>> > 2b. Migrate server from scala code to org.apache.common classes
>> > 3. Deprecate scala clients
>> >
>> > (2a) Is is in flight now, and that means (2b) is totally up for grabs.
>> Of
>> > these the request conversion is definitely the most pressing since
>> having
>> > those defined twice duplicates a ton of work. We will have to be
>> > hyper-conscientious during the conversion about making the shared code
>> in
>> > common really solve the problem well and conveniently on the server as
>> well
>> > (so we don't end up just shoe-horning it in). My hope is that we can
>> treat
>> > this common code really well--it isn't as permanent as the public
>> classes
>> > but ends up heavily used so we should take good care of it. Most the
>> shared
>> > code is private so we can refactor the stuff in common to meet the
>> needs of
>> > the server if we find mismatches or missing functionality. I tried to
>> keep
>> > in mind the eventual server usage while writing it, but I doubt it will
>> be
>> > as trivial as just deleting the old and adding the new.
>> >
>> > In terms of the simplicity:
>> > - Converting exceptions should be trivial
>> > - Converting utils is straight-forward but we should evaluate the
>> > individual utilities and see if they actually make sense, have tests,
>> are
>> > used, etc.
>> > - Converting the requests may not be too complex but touches a huge
>> hunk of
>> > code and may require some effort to decouple the network layer.
>> > - Converting the network code will be delicate and may require some
>> changes
>> > in org.apache.common.network to meet the server's needs
>> >
>> > This is all a lot of work, but if we stick to it at the end we will have
>> > really nice clients and a nice modular code base. :-)
>> >
>> > Cheers,
>> >
>> > -Jay
>> >
>>
>
>

Re: org.apache.common migration

Posted by Joe Stein <jo...@stealth.ly>.
Jay,

Can we add another package (or two) to org.apache.kafka.common for metadata
and consensus.  We can call them something else but the idea would be to
have 1 common layer for meta data information (right now we put the json
into zookeeper) and 1 common layer for asynchronous watches (which we wait
for zookeeper to call us). It would be great to have that code something we
can wrap zkclient around (or currator) that can insulate the different
options growing in both of those areas.

Both the meta data code and async watches we would be able to run any class
we load in supporting the interface expected. The async watch interface can
have as an input to pass the loaded class a callback and when the watcher
fires (regardless if from etcd or zookeeper) the code gets the response it
expected and needed. We should also expose a function that returns a future
from the watcher.

This may cause a little more work also if we wanted to take the JSON and
turn that into byte structure ... or we just keep to the JSON and keep to
making it describable and self documenting?

For the meta data information I think that is separate because that data
right now (outside of kafka) already resides in other systems like
databases and/or caches. Folks may opt just to switch the meta data out to
reduce the burden on zookeeper to just doing the asynchronous watchers.
Some folks may want to swap both out.

These two layers could also just be 2-3 more files in utils.

- Joestein

On Sun, Feb 8, 2015 at 11:04 AM, Gwen Shapira <gs...@cloudera.com> wrote:

> Thanks for the background.
>
> I picked the Network classes portion of it, since I was already looking at
> how to refactor send/receive and friends to support extending with TLS and
> SASL. Having to do this in just one place will be really nice :)
>
> Gwen
>
> On Sun, Feb 8, 2015 at 7:26 AM, Jay Kreps <ja...@gmail.com> wrote:
>
> > Hey all,
> >
> > Someone asked about why there is code duplication between
> org.apache.common
> > and core. The answer seemed like it might be useful to others, so
> including
> > it here:
> >
> > Originally Kafka was more of a proof of concept and we didn't separate
> the
> > clients from the server. LinkedIn was much smaller and it wasn't open
> > source, and keeping those separate always adds a lot of overhead. So we
> > ended up with just one big jar.
> >
> > Next thing we know the kafka jar is embedded everywhere. Lot's of fallout
> > from that
> > - It has to be really sensitive to dependencies
> > - Scala causes all kinds of pain for users. Ironically it causes the most
> > pain for people using scala because of compatibility. I think the single
> > biggest Kafka complaint was the scala clients and resulting scary
> > exceptions, lack of javadoc, etc.
> > - Many of the client interfaces weren't well thought out as permanent
> > long-term commitments.
> > - We new we had to rewrite both clients due to technical deficiencies
> > anyway. The clients really needed to move to non-blocking I/O which is
> > basically a rewrite on it's own.
> >
> > So how to go about that?
> >
> > Well we felt we needed to maintain the old client interfaces for a good
> > period of time. Any kind of breaking cut-over was kind of a non-starter.
> > But a major refactoring in place was really hard since so many classes
> were
> > public and so little attention had been paid to the difference between
> > public and private classes.
> >
> > Naturally since the client and server do the inverse of each other there
> is
> > a ton of shared logic. So we thought we needed to break it up into three
> > independent chunks:
> > 1. common - shared helper code used by both clients and server
> > 2. clients - the producer, consumer, and eventually admin java
> interfaces.
> > This depends on common.
> > 3. server - the server (and legacy clients). This is currently called
> core.
> > This will depend on common and clients (because sometimes the server
> needs
> > to make client requests)
> >
> > Common and clients were left as a single jar and just logically separate
> so
> > that people wouldn't have to deal with two jars (and hence the
> possibility
> > of getting different versions of each).
> >
> > The dependency is actually a little counter-intuitive to people--they
> > usually think of the client as depending on the server since the client
> > calls the server. But in terms of code dependencies it is the other
> way--if
> > you depend on the client you obviously don't want to drag in the server.
> >
> > So to get all this done we decided to just go big and do a rewrite of the
> > clients in Java. A result of this is that any shared code would have to
> > move to Java (so the clients don't pull in Scala). We felt this was
> > probably a good thing in its own right as it gave a chance to improve a
> few
> > of these utility libraries like config parsing, etc.
> >
> > So the plan was and is:
> > 1. Rewrite producer, release and roll out
> > 2a. Rewrite consumer, release and roll out
> > 2b. Migrate server from scala code to org.apache.common classes
> > 3. Deprecate scala clients
> >
> > (2a) Is is in flight now, and that means (2b) is totally up for grabs. Of
> > these the request conversion is definitely the most pressing since having
> > those defined twice duplicates a ton of work. We will have to be
> > hyper-conscientious during the conversion about making the shared code in
> > common really solve the problem well and conveniently on the server as
> well
> > (so we don't end up just shoe-horning it in). My hope is that we can
> treat
> > this common code really well--it isn't as permanent as the public classes
> > but ends up heavily used so we should take good care of it. Most the
> shared
> > code is private so we can refactor the stuff in common to meet the needs
> of
> > the server if we find mismatches or missing functionality. I tried to
> keep
> > in mind the eventual server usage while writing it, but I doubt it will
> be
> > as trivial as just deleting the old and adding the new.
> >
> > In terms of the simplicity:
> > - Converting exceptions should be trivial
> > - Converting utils is straight-forward but we should evaluate the
> > individual utilities and see if they actually make sense, have tests, are
> > used, etc.
> > - Converting the requests may not be too complex but touches a huge hunk
> of
> > code and may require some effort to decouple the network layer.
> > - Converting the network code will be delicate and may require some
> changes
> > in org.apache.common.network to meet the server's needs
> >
> > This is all a lot of work, but if we stick to it at the end we will have
> > really nice clients and a nice modular code base. :-)
> >
> > Cheers,
> >
> > -Jay
> >
>

Re: org.apache.common migration

Posted by Gwen Shapira <gs...@cloudera.com>.
Thanks for the background.

I picked the Network classes portion of it, since I was already looking at
how to refactor send/receive and friends to support extending with TLS and
SASL. Having to do this in just one place will be really nice :)

Gwen

On Sun, Feb 8, 2015 at 7:26 AM, Jay Kreps <ja...@gmail.com> wrote:

> Hey all,
>
> Someone asked about why there is code duplication between org.apache.common
> and core. The answer seemed like it might be useful to others, so including
> it here:
>
> Originally Kafka was more of a proof of concept and we didn't separate the
> clients from the server. LinkedIn was much smaller and it wasn't open
> source, and keeping those separate always adds a lot of overhead. So we
> ended up with just one big jar.
>
> Next thing we know the kafka jar is embedded everywhere. Lot's of fallout
> from that
> - It has to be really sensitive to dependencies
> - Scala causes all kinds of pain for users. Ironically it causes the most
> pain for people using scala because of compatibility. I think the single
> biggest Kafka complaint was the scala clients and resulting scary
> exceptions, lack of javadoc, etc.
> - Many of the client interfaces weren't well thought out as permanent
> long-term commitments.
> - We new we had to rewrite both clients due to technical deficiencies
> anyway. The clients really needed to move to non-blocking I/O which is
> basically a rewrite on it's own.
>
> So how to go about that?
>
> Well we felt we needed to maintain the old client interfaces for a good
> period of time. Any kind of breaking cut-over was kind of a non-starter.
> But a major refactoring in place was really hard since so many classes were
> public and so little attention had been paid to the difference between
> public and private classes.
>
> Naturally since the client and server do the inverse of each other there is
> a ton of shared logic. So we thought we needed to break it up into three
> independent chunks:
> 1. common - shared helper code used by both clients and server
> 2. clients - the producer, consumer, and eventually admin java interfaces.
> This depends on common.
> 3. server - the server (and legacy clients). This is currently called core.
> This will depend on common and clients (because sometimes the server needs
> to make client requests)
>
> Common and clients were left as a single jar and just logically separate so
> that people wouldn't have to deal with two jars (and hence the possibility
> of getting different versions of each).
>
> The dependency is actually a little counter-intuitive to people--they
> usually think of the client as depending on the server since the client
> calls the server. But in terms of code dependencies it is the other way--if
> you depend on the client you obviously don't want to drag in the server.
>
> So to get all this done we decided to just go big and do a rewrite of the
> clients in Java. A result of this is that any shared code would have to
> move to Java (so the clients don't pull in Scala). We felt this was
> probably a good thing in its own right as it gave a chance to improve a few
> of these utility libraries like config parsing, etc.
>
> So the plan was and is:
> 1. Rewrite producer, release and roll out
> 2a. Rewrite consumer, release and roll out
> 2b. Migrate server from scala code to org.apache.common classes
> 3. Deprecate scala clients
>
> (2a) Is is in flight now, and that means (2b) is totally up for grabs. Of
> these the request conversion is definitely the most pressing since having
> those defined twice duplicates a ton of work. We will have to be
> hyper-conscientious during the conversion about making the shared code in
> common really solve the problem well and conveniently on the server as well
> (so we don't end up just shoe-horning it in). My hope is that we can treat
> this common code really well--it isn't as permanent as the public classes
> but ends up heavily used so we should take good care of it. Most the shared
> code is private so we can refactor the stuff in common to meet the needs of
> the server if we find mismatches or missing functionality. I tried to keep
> in mind the eventual server usage while writing it, but I doubt it will be
> as trivial as just deleting the old and adding the new.
>
> In terms of the simplicity:
> - Converting exceptions should be trivial
> - Converting utils is straight-forward but we should evaluate the
> individual utilities and see if they actually make sense, have tests, are
> used, etc.
> - Converting the requests may not be too complex but touches a huge hunk of
> code and may require some effort to decouple the network layer.
> - Converting the network code will be delicate and may require some changes
> in org.apache.common.network to meet the server's needs
>
> This is all a lot of work, but if we stick to it at the end we will have
> really nice clients and a nice modular code base. :-)
>
> Cheers,
>
> -Jay
>