You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Douglas Creager <dc...@dcreager.net> on 2010/10/14 03:45:23 UTC

C library

Quick question about the C library.  It seems like a lot of the code
implements some scaffolding that we could get for free from a library
like glib.  Looking through the svn history, it looks like you've
already taken out some dependencies on apr and apr-util, so I'm guessing
you're trying to limit external dependencies, right?  Is that just a
licensing question?  Or is it to simplify the build process?

cheers
–doug


Re: C library

Posted by Douglas Creager <dc...@dcreager.net>.
> Please continue this discussion on the list since that's what it's
> for.

Will do

> I think it would be great if we could as support for generated code
> to avro-c. I've been itching lately to do some C programming.
> Cloudera is having a Hackathon in about a week so maybe I could
> dedicate some cycles then to help.

Generated code certainly sounds useful, but I don't know if it will help
my particular problem.  In my case, I'm adding Avro support to an
existing application, which already has quite a few custom C structs
that it's aggregating data into.  With the current implementation, I
have to copy this data into a tree of avro_datum_t instances before
writing the data out to an Avro file.  Codegen would probably make that
a bit easier, but there would still be a set of (now automatically
generated) Avro-specific structs that I'd have to copy into.  What I'm
looking for / working on is a different approach, where I provide a set
of callbacks that tell the Avro file writer how to extract the correct
values directly out of my pre-existing, non-Avro-specific struct.  My
hope is that this will be (a) just as easy to code, and (b) faster,
especially when multiplied by tens of millions of rows.

–doug


Re: C library

Posted by Matt Massie <ma...@cloudera.com>.
Please continue this discussion on the list since that's what it's for.  I
think it would be great if we could as support for generated code to avro-c.
 I've been itching lately to do some C programming.  Cloudera is having a
Hackathon in about a week so maybe I could dedicate some cycles then to
help.

-- 
Matt



On Wed, Oct 13, 2010 at 8:58 PM, Bruce Mitchener
<br...@gmail.com>wrote:

> On Thu, Oct 14, 2010 at 10:49 AM, Douglas Creager <dcreager@dcreager.net
> >wrote:
>
> > > Not to me. :)  I'm assuming that you mean something that uses GValue
> and
> > so
> > > on?
> >
> > Ah, whoops.  No, I'm not suggesting GValue.  *shudder*
> >
>
> *whew*
>
>
> > I was thinking more like using:
> >
> >  • GObject for the schema/datum subclassing
> >  • GHashTable or GTree to store a record schema's fields, etc.
> >  • GIO for the generic I/O interfaces
> >  • GQuark instead of the atom implementation that was checked in and
> >   then reverted
>
>
> Okay, I see ... but that can't happen within the Apache implementation due
> to licensing issues.  (It also doesn't work for my usages because it isn't
> clear that LGPL code can be shipped at all legally on some of my target
> platforms.)
>
>
> > > I don't want the overhead of that sort of thing at all in my C code.
>  I'm
> > > supporting resource constrained platforms, so I just want to go from my
> C
> > > struct straight to a buffer without building an intermediate data
> > structure.
> >
> > We're in violent agreement.  One thing I've started experimenting with
> > is a “streaming” API, so that instead of creating a tree of avro_datum_t
> > instances, the file reader calls a series of callback functions as each
> > bit of data is encountered.  We're generating Avro files from an
> > existing C network sensor application, and it's a bit of overhead (in
> > both code and speed) to have to move between our actual data types and
> > the avro_datum_t instances.
> >
>
> Okay, then we're talking about similar things.  But you can also just
> generate code and then you don't need schemas or anything else at runtime,
> no?
>
> What I'm doing is just a low level API that I can use from generated code.
> I
> don't need (or want) schemas or anything else in the way.
>
> Maybe we should talk more off-list.
>
>  - Bruce
>

Re: C library

Posted by Bruce Mitchener <br...@gmail.com>.
On Thu, Oct 14, 2010 at 10:49 AM, Douglas Creager <dc...@dcreager.net>wrote:

> > Not to me. :)  I'm assuming that you mean something that uses GValue and
> so
> > on?
>
> Ah, whoops.  No, I'm not suggesting GValue.  *shudder*
>

*whew*


> I was thinking more like using:
>
>  • GObject for the schema/datum subclassing
>  • GHashTable or GTree to store a record schema's fields, etc.
>  • GIO for the generic I/O interfaces
>  • GQuark instead of the atom implementation that was checked in and
>   then reverted


Okay, I see ... but that can't happen within the Apache implementation due
to licensing issues.  (It also doesn't work for my usages because it isn't
clear that LGPL code can be shipped at all legally on some of my target
platforms.)


> > I don't want the overhead of that sort of thing at all in my C code.  I'm
> > supporting resource constrained platforms, so I just want to go from my C
> > struct straight to a buffer without building an intermediate data
> structure.
>
> We're in violent agreement.  One thing I've started experimenting with
> is a “streaming” API, so that instead of creating a tree of avro_datum_t
> instances, the file reader calls a series of callback functions as each
> bit of data is encountered.  We're generating Avro files from an
> existing C network sensor application, and it's a bit of overhead (in
> both code and speed) to have to move between our actual data types and
> the avro_datum_t instances.
>

Okay, then we're talking about similar things.  But you can also just
generate code and then you don't need schemas or anything else at runtime,
no?

What I'm doing is just a low level API that I can use from generated code. I
don't need (or want) schemas or anything else in the way.

Maybe we should talk more off-list.

 - Bruce

Re: C library

Posted by Douglas Creager <dc...@dcreager.net>.
> Not to me. :)  I'm assuming that you mean something that uses GValue and so
> on?

Ah, whoops.  No, I'm not suggesting GValue.  *shudder*

I was thinking more like using:

 • GObject for the schema/datum subclassing
 • GHashTable or GTree to store a record schema's fields, etc.
 • GIO for the generic I/O interfaces
 • GQuark instead of the atom implementation that was checked in and
   then reverted

> I don't want the overhead of that sort of thing at all in my C code.  I'm
> supporting resource constrained platforms, so I just want to go from my C
> struct straight to a buffer without building an intermediate data structure.

We're in violent agreement.  One thing I've started experimenting with
is a “streaming” API, so that instead of creating a tree of avro_datum_t
instances, the file reader calls a series of callback functions as each
bit of data is encountered.  We're generating Avro files from an
existing C network sensor application, and it's a bit of overhead (in
both code and speed) to have to move between our actual data types and
the avro_datum_t instances.

–doug


Re: C library

Posted by Bruce Mitchener <br...@gmail.com>.
On Thu, Oct 14, 2010 at 9:43 AM, Douglas Creager <dc...@dcreager.net>wrote:

> > I didn't write the existing C library, but I've used it and done some
> work
> > on it.  I'm currently writing my own more minimal and more streamlined
> > implementation of Avro in C ...
> >
> > The issues with glib specifically would be:
> >
> >
> >    - The license is not acceptable for use here. (LGPL)
> >    - It is much bigger than what is needed here.
> >    - Many of the things that make it more general would also make it
> slower
> >    than necessary. The existing C code isn't a speed demon either, but
> the C
> >    implementation should aim for solid performance.
>
> Ha!  Well you're certainly right that glib's not small.  Are you sure
> about the speed claims, though?  Would it be worth banging out a LGPL,
> glib-based prototype to do some initial tests?
>

Not to me. :)  I'm assuming that you mean something that uses GValue and so
on?

I don't want the overhead of that sort of thing at all in my C code.  I'm
supporting resource constrained platforms, so I just want to go from my C
struct straight to a buffer without building an intermediate data structure.

Along those lines, you mention a new C implementation you're working on.
>  Is that something that you plan to fold back into the main libavro?  Or
> will it be separate?  The spec provides a good basis for defining how
> well different implementations interoperate, but so far it seems like
> everything has been folded into the single, Apache-sponsored project.
> Is there interest in having independent implementations?
>

I'm not sure what will happen with my implementation yet. I'm inclined to
say that it'll be opened up (Apache 2 licensed) but it will depend on the
quality of the code with respect to use outside of my product and other
factors.  More likely is that what I'm doing is going to serve as a test bed
for ideas and an implementation approach that can be merged into the Apache
Avro C implementation in the future.

As part of my own implementation of Avro in C, I'm also working on a binary
RPC protocol for talking with Cloudera Flume, so I have a bit more
motivation to get it opened up ...

 - Bruce

Re: C library

Posted by Douglas Creager <dc...@dcreager.net>.
> I didn't write the existing C library, but I've used it and done some work
> on it.  I'm currently writing my own more minimal and more streamlined
> implementation of Avro in C ...
> 
> The issues with glib specifically would be:
> 
> 
>    - The license is not acceptable for use here. (LGPL)
>    - It is much bigger than what is needed here.
>    - Many of the things that make it more general would also make it slower
>    than necessary. The existing C code isn't a speed demon either, but the C
>    implementation should aim for solid performance.

Ha!  Well you're certainly right that glib's not small.  Are you sure
about the speed claims, though?  Would it be worth banging out a LGPL,
glib-based prototype to do some initial tests?

Along those lines, you mention a new C implementation you're working on.
 Is that something that you plan to fold back into the main libavro?  Or
will it be separate?  The spec provides a good basis for defining how
well different implementations interoperate, but so far it seems like
everything has been folded into the single, Apache-sponsored project.
Is there interest in having independent implementations?

–doug


Re: C library

Posted by Bruce Mitchener <br...@gmail.com>.
Hi Doug,

I didn't write the existing C library, but I've used it and done some work
on it.  I'm currently writing my own more minimal and more streamlined
implementation of Avro in C ...

The issues with glib specifically would be:


   - The license is not acceptable for use here. (LGPL)
   - It is much bigger than what is needed here.
   - Many of the things that make it more general would also make it slower
   than necessary. The existing C code isn't a speed demon either, but the C
   implementation should aim for solid performance.


 - Bruce

On Thu, Oct 14, 2010 at 8:45 AM, Douglas Creager <dc...@dcreager.net>wrote:

> Quick question about the C library.  It seems like a lot of the code
> implements some scaffolding that we could get for free from a library
> like glib.  Looking through the svn history, it looks like you've
> already taken out some dependencies on apr and apr-util, so I'm guessing
> you're trying to limit external dependencies, right?  Is that just a
> licensing question?  Or is it to simplify the build process?
>
> cheers
> –doug
>
>