You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Jake Luciani <ja...@gmail.com> on 2011/01/17 18:12:54 UTC

Avro in the Cassandra core

Hi,

I'd to discuss if/when we should be using Avro or any serialization tool in
the Cassandra core.

Some context: We have begun the process of removing Avro from the service
layer CASSANDRA-926. We currently use Avro for schema migrations internally,
and we have two open items that are using Avro for our internal file
storage. CASSANDRA-1472 and CASSANDRA-674.

My opinion is we need to control the lowest layers of the code and not rely
on a third party library.  By using a third party library like Avro, it
becomes a black box that we need to deeply understand and work around.
  Also, since Avro is developed separately we have another core dependency
that could disrupt releases (say a bug in Avro).

The limitation of using a generic serialization tool is it uses the most
general approach to things which may not be the best when you can optimize
differently based on the specifics of your data. Examples: Block based
compression, Auto-boxing of primitives, Code generation.

Now, there may in fact be ways of doing everything we want in Avro.  And I'm
sure this mail will cause a lot of opinions to be voiced, but the thing I
want everyone to keep in mind is we *ALL* would need to be willing to become
experts in Avro to allow us to hack in and around it.  If we don't we end up
with a disjointed codebase.

Thanks,
-Jake

Re: Avro in the Cassandra core

Posted by Jonathan Ellis <jb...@gmail.com>.
On Mon, Jan 17, 2011 at 11:12 AM, Jake Luciani <ja...@gmail.com> wrote:
> My opinion is we need to control the lowest layers of the code and not rely
> on a third party library.

+1.  I think that a large part of we want to do at the bits-on-disk
level is a poor fit for the kind of serialized objects-with-fields
that Avro is best at.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Avro in the Cassandra core

Posted by Stephen Connolly <st...@gmail.com>.
On 18 January 2011 03:01, Eric Evans <ee...@rackspace.com> wrote:
> On Mon, 2011-01-17 at 12:12 -0500, Jake Luciani wrote:
>> Some context: We have begun the process of removing Avro from the
>> service layer CASSANDRA-926. We currently use Avro for schema
>> migrations internally, and we have two open items that are using Avro
>> for our internal file storage. CASSANDRA-1472 and CASSANDRA-674.
>
> FWIW, this should be done (removing the RPC interface).  Anything missed
> is deserving of a bug report
>
>> My opinion is we need to control the lowest layers of the code and not
>> rely on a third party library.  By using a third party library like
>> Avro, it becomes a black box that we need to deeply understand and
>> work around. Also, since Avro is developed separately we have another
>> core dependency that could disrupt releases (say a bug in Avro).
>
> +1
>
> The Avro RPC interface was an experiment, and it was always the case
> that if it didn't supplant the Thrift interface as status quo, that it'd
> be removed.  However, as I remember it, part of the justification for
> using it in migrations was that it was already there.  In hindsight that
> was probably a mistake.
>
> Anyway, we have too many dependencies as it is, I'd rather move toward

+1

I'd also be in favour of limiting the scope of dependencies so that
they are not _everywhere_

-Stephen

> eliminating it entirely unless there is a very compelling reason not to
> (I don't think there is).
>
> --
> Eric Evans
> eevans@rackspace.com
>
>

Re: Avro in the Cassandra core

Posted by Brandon Williams <dr...@gmail.com>.
On Mon, Jan 17, 2011 at 9:01 PM, Eric Evans <ee...@rackspace.com> wrote:

> > My opinion is we need to control the lowest layers of the code and not
> > rely on a third party library.  By using a third party library like
> > Avro, it becomes a black box that we need to deeply understand and
> > work around. Also, since Avro is developed separately we have another
> > core dependency that could disrupt releases (say a bug in Avro).
>
> +1
>

+1


> Anyway, we have too many dependencies as it is, I'd rather move toward
> eliminating it entirely unless there is a very compelling reason not to
> (I don't think there is).


+1.  It seems that once we introduced the now removed avro interface, it
started infecting everything else 'because it was there.'

-Brandon

Re: Avro in the Cassandra core

Posted by Stu Hood <st...@gmail.com>.
It's impossible to argue against this kind of consensus, so I won't.

Can we please turn this discussion to something productive? Namely: now that
we've eliminated the top (IMO) choice for a standard serialization framework
within Cassandra, which option _will_ we choose? Continuing down our current
path of custom serialization is completely untenable without a massive
overhaul:

1. It is not backwards compatible
2. It does not use compact encodings of integer data
3. It is verbose: serialization code and the java.io package are present in
every object

I would have preferred Avro, but I'd rather use Thrift than no framework at
all.


On Tue, Jan 18, 2011 at 7:49 AM, Jeremy Hanna <je...@gmail.com>wrote:

>
> On Jan 17, 2011, at 9:01 PM, Eric Evans wrote:
>
> > On Mon, 2011-01-17 at 12:12 -0500, Jake Luciani wrote:
> >> Some context: We have begun the process of removing Avro from the
> >> service layer CASSANDRA-926. We currently use Avro for schema
> >> migrations internally, and we have two open items that are using Avro
> >> for our internal file storage. CASSANDRA-1472 and CASSANDRA-674.
> >
> > FWIW, this should be done (removing the RPC interface).  Anything missed
> > is deserving of a bug report
> >
> >> My opinion is we need to control the lowest layers of the code and not
> >> rely on a third party library.  By using a third party library like
> >> Avro, it becomes a black box that we need to deeply understand and
> >> work around. Also, since Avro is developed separately we have another
> >> core dependency that could disrupt releases (say a bug in Avro).
> >
> > +1
> >
> > The Avro RPC interface was an experiment, and it was always the case
> > that if it didn't supplant the Thrift interface as status quo, that it'd
> > be removed.  However, as I remember it, part of the justification for
> > using it in migrations was that it was already there.  In hindsight that
> > was probably a mistake.
> >
> > Anyway, we have too many dependencies as it is, I'd rather move toward
> > eliminating it entirely unless there is a very compelling reason not to
> > (I don't think there is).
>
> Just saw Stu's comment on CASSANDRA-1472:
> "I think that implementing a compressible block based file format is a
> non-trivial task, and that before we commit to re-implementing Avro's (in a
> bounded timeframe especially), we should review our requirements. This
> decision needs to be made for technical reasons and not grounded in NIH."
>
> Is that a compelling reason to use avro for some of the internals or is
> that something that someone is willing to implement/maintain?  Just trying
> to bridge discussion here with comments on the ticket.
>
> >
> > --
> > Eric Evans
> > eevans@rackspace.com
> >
>
>

Re: Avro in the Cassandra core

Posted by Jeremy Hanna <je...@gmail.com>.
On Jan 17, 2011, at 9:01 PM, Eric Evans wrote:

> On Mon, 2011-01-17 at 12:12 -0500, Jake Luciani wrote:
>> Some context: We have begun the process of removing Avro from the
>> service layer CASSANDRA-926. We currently use Avro for schema
>> migrations internally, and we have two open items that are using Avro
>> for our internal file storage. CASSANDRA-1472 and CASSANDRA-674.
> 
> FWIW, this should be done (removing the RPC interface).  Anything missed
> is deserving of a bug report
> 
>> My opinion is we need to control the lowest layers of the code and not
>> rely on a third party library.  By using a third party library like
>> Avro, it becomes a black box that we need to deeply understand and
>> work around. Also, since Avro is developed separately we have another
>> core dependency that could disrupt releases (say a bug in Avro).
> 
> +1
> 
> The Avro RPC interface was an experiment, and it was always the case
> that if it didn't supplant the Thrift interface as status quo, that it'd
> be removed.  However, as I remember it, part of the justification for
> using it in migrations was that it was already there.  In hindsight that
> was probably a mistake.
> 
> Anyway, we have too many dependencies as it is, I'd rather move toward
> eliminating it entirely unless there is a very compelling reason not to
> (I don't think there is).

Just saw Stu's comment on CASSANDRA-1472:
"I think that implementing a compressible block based file format is a non-trivial task, and that before we commit to re-implementing Avro's (in a bounded timeframe especially), we should review our requirements. This decision needs to be made for technical reasons and not grounded in NIH."

Is that a compelling reason to use avro for some of the internals or is that something that someone is willing to implement/maintain?  Just trying to bridge discussion here with comments on the ticket.

> 
> -- 
> Eric Evans
> eevans@rackspace.com
> 


Re: Avro in the Cassandra core

Posted by Eric Evans <ee...@rackspace.com>.
On Mon, 2011-01-17 at 12:12 -0500, Jake Luciani wrote:
> Some context: We have begun the process of removing Avro from the
> service layer CASSANDRA-926. We currently use Avro for schema
> migrations internally, and we have two open items that are using Avro
> for our internal file storage. CASSANDRA-1472 and CASSANDRA-674.

FWIW, this should be done (removing the RPC interface).  Anything missed
is deserving of a bug report

> My opinion is we need to control the lowest layers of the code and not
> rely on a third party library.  By using a third party library like
> Avro, it becomes a black box that we need to deeply understand and
> work around. Also, since Avro is developed separately we have another
> core dependency that could disrupt releases (say a bug in Avro).

+1

The Avro RPC interface was an experiment, and it was always the case
that if it didn't supplant the Thrift interface as status quo, that it'd
be removed.  However, as I remember it, part of the justification for
using it in migrations was that it was already there.  In hindsight that
was probably a mistake.

Anyway, we have too many dependencies as it is, I'd rather move toward
eliminating it entirely unless there is a very compelling reason not to
(I don't think there is).

-- 
Eric Evans
eevans@rackspace.com


Re: Avro in the Cassandra core

Posted by Gary Dusbabek <gd...@gmail.com>.
On Mon, Jan 17, 2011 at 11:12, Jake Luciani <ja...@gmail.com> wrote:
> Hi,
>
> I'd to discuss if/when we should be using Avro or any serialization tool in
> the Cassandra core.
>
> Some context: We have begun the process of removing Avro from the service
> layer CASSANDRA-926. We currently use Avro for schema migrations internally,
> and we have two open items that are using Avro for our internal file
> storage. CASSANDRA-1472 and CASSANDRA-674.
>
> My opinion is we need to control the lowest layers of the code and not rely
> on a third party library.  By using a third party library like Avro, it
> becomes a black box that we need to deeply understand and work around.

+1. We need to control serialization in many cases so that we can
provide interoperability in the face of radically changing the way we
store bytes.  It happens often enough that it is a valid concern.

>
> Now, there may in fact be ways of doing everything we want in Avro.  And I'm
> sure this mail will cause a lot of opinions to be voiced, but the thing I
> want everyone to keep in mind is we *ALL* would need to be willing to become
> experts in Avro to allow us to hack in and around it.  If we don't we end up
> with a disjointed codebase.

I think serialization will have to be evaluated on a per-ticket basis.
 In some cases, it might make sense to hand it off to a library.  As
for standardizing on a particular lib for serialization--I prefer the
promise of avro serialization to thrift (avro is a tad more flexible),
but we already use thrift--maybe we should just standardize on it.  My
experience so far with using avro for migrations serialization
indicates that we are either using avro inappropriately, or it just
doesn't deliver on the promise of deserializing data with slightly
different schemas.  If you want to see first-hand what I'm talking
about, copy system tables from an 0.7 cluster into a trunk config and
watch the breakage.

Gary.