You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by John Lilley <jo...@redpoint.net> on 2014/08/05 22:58:05 UTC

State of the C++ vs Java implementations

Greetings,

I am desiring to read and write Avro files (such as those manipulated by MapReduce applications) from a C++ program.  While there are higher-level wrappers (such as Hive), I am interested in reading/writing the files directly.  There are both C++ and Java library implementations; however, in the C++ API README I see "And the file and rpc containers are not yet implemented."  Does this mean that I can't read and write Avro files using the C++ library?

We have very good C++/JNI wrapper-generator, so using the Java is not terribly difficult.  Given that, which interface would you recommend?  Does the C++ interface (assuming it works) have significant performance advantages?

Thanks
john

RE: State of the C++ vs Java implementations

Posted by Steve Roehrs <St...@rlmgroup.com.au>.
Fair enough John.  It all depends on your use case.


As for HDFS paths - I wouldn't know - we do all our work under linux.

 

Regards,

 

Steve Roehrs

Senior Software Engineer | Lockheed Martin

 

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551

| w: www.rlmgroup.com.au | e: Steve.Roehrs@rlmgroup.com.au

| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111

This email and any attachment to it remains the property of Lockheed
Martin and is intended only to be read or used by the named addressee.
It may contain information that is confidential, commercially valuable
or subject to legal privilege.  If you receive this email in error,
please immediately delete it and notify the sender.  Opinions,
conclusions and other information in this message that do not relate to
the official business of Lockheed Martin or any companies within
Lockheed Martin shall be understood as neither given nor endorsed by
them.

________________________________

From: John Lilley [mailto:john.lilley@redpoint.net] 
Sent: Monday, August 18, 2014 10:23 PM
To: Steve Roehrs; user@avro.apache.org
Subject: RE: State of the C++ vs Java implementations

 

Thanks Steve!

 

As for this approach: "As for the compression - as Doug has already
answered, C++ only supports null (no) codec and deflate. You can always
use the Avro java tools 'recodec' command to convert from an unsupported
codec to deflate if you need to. "

 

It's not an option for us.  We want to read Avro files natively, as
quickly as possible.  Also it doesn't look like these tools read/write
HDFS paths do they?

 

john 

 

 

From: Steve Roehrs [mailto:Steve.Roehrs@rlmgroup.com.au] 
Sent: Sunday, August 17, 2014 6:41 PM
To: John Lilley; user@avro.apache.org
Subject: RE: State of the C++ vs Java implementations

 

Hi John.

 

Sorry for the late reply I was off work for a few days ill.

 

The idea of using reading the schema from the file and then processing
it without knowing the structure beforehand is the main use case for the
GenericDatum.  The inefficiencies relate to the way that GenericDatum
handles arrays.  In our use case, most of the data consists of large
floating point arrays, or multi-dimensional (nested) arrays.
GenericDatum stores these by using a STL vector.

 

So if you were using C++ structures, you may have a float[1000] - but if
you use GenericDatum you get a  vector<GenericDatum> where each
GenericDatum contains a single float.  Of course this is the most
flexible way of implementing it, and it works - but it uses considerably
more memory.  Profiling a read of one of our data structures showed that
more than 50% of the time was spent in malloc()/ free() !

 

If your data doesn't have lots of large numeric arrays then by all
means, GenericDatum should work reasonably well for you. 

 

As for the compression - as Doug has already answered, C++ only supports
null (no) codec and deflate. You can always use the Avro java tools
'recodec' command to convert from an unsupported codec to deflate if you
need to.   

 

The groundwork for codec support in C++ is already there - it should be
quite easy to add additional codecs now. Most of the work would be in
getting the makefile/library stuff right.

 

Regards,

 

Steve Roehrs

Senior Software Engineer | Lockheed Martin

 

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551

| w: www.rlmgroup.com.au | e: Steve.Roehrs@rlmgroup.com.au

| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111

This email and any attachment to it remains the property of Lockheed
Martin and is intended only to be read or used by the named addressee.
It may contain information that is confidential, commercially valuable
or subject to legal privilege.  If you receive this email in error,
please immediately delete it and notify the sender.  Opinions,
conclusions and other information in this message that do not relate to
the official business of Lockheed Martin or any companies within
Lockheed Martin shall be understood as neither given nor endorsed by
them.

________________________________

From: John Lilley [mailto:john.lilley@redpoint.net] 
Sent: Friday, August 15, 2014 1:16 AM
To: user@avro.apache.org; Steve Roehrs
Subject: RE: State of the C++ vs Java implementations

 

Steve,


Thanks so much for the reply.  I hope that I can inconvenience you for a
little more guidance.  We want to read and write Avro data files whose
schema is not known until run-time, when we read the file metadata and
transform that into our own internal record structure.  So we are not
mapping to a C++ struct/class with defined compile-time members.  We
just want to loop over the records and columns in the data file,
transforming them serially.  Can this be done without incurring the
performance penalty of GenericDatum that you speak of?

 

Different question: do you know if the full complement of compression
codecs is available in C++?  We don't need "everything possible", but we
want to be able to read 99.9% of files that we are likely to encounter
in practice.

 

Thanks

John

 

 

From: Steve Roehrs [mailto:Steve.Roehrs@rlmgroup.com.au] 
Sent: Sunday, August 10, 2014 11:25 PM
To: user@avro.apache.org
Subject: RE: State of the C++ vs Java implementations

 

Hi John

 

You can definitely read and write Avro data files using C++.  The
DataFileWriter and DataFileReader classes are what you need.

 

The README is severely out of date.

 

I can't comment on the relative performance of the Java/C++ API's - we
used the C++ API for our application, but for performance reasons we
don't use the GenericDatum class, as it does have poor performance for
our particular mix of data.  I don't know if the Java API fares any
better in this regard.

 

Regards,

 

Steve Roehrs

Senior Software Engineer | Lockheed Martin

 

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551

| w: www.rlmgroup.com.au | e: Steve.Roehrs@rlmgroup.com.au

| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111

This email and any attachment to it remains the property of Lockheed
Martin and is intended only to be read or used by the named addressee.
It may contain information that is confidential, commercially valuable
or subject to legal privilege.  If you receive this email in error,
please immediately delete it and notify the sender.  Opinions,
conclusions and other information in this message that do not relate to
the official business of Lockheed Martin or any companies within
Lockheed Martin shall be understood as neither given nor endorsed by
them.

________________________________

From: John Lilley [mailto:john.lilley@redpoint.net] 
Sent: Wednesday, August 06, 2014 6:28 AM
To: user@avro.apache.org
Subject: State of the C++ vs Java implementations

 

Greetings,

 

I am desiring to read and write Avro files (such as those manipulated by
MapReduce applications) from a C++ program.  While there are
higher-level wrappers (such as Hive), I am interested in reading/writing
the files directly.  There are both C++ and Java library
implementations; however, in the C++ API README I see "And the file and
rpc containers are not yet implemented."  Does this mean that I can't
read and write Avro files using the C++ library?

 

We have very good C++/JNI wrapper-generator, so using the Java is not
terribly difficult.  Given that, which interface would you recommend?
Does the C++ interface (assuming it works) have significant performance
advantages?

 

Thanks

john


Re: State of the C++ vs Java implementations

Posted by svante karlsson <sa...@csi.se>.
> ·         Is the Avro/ C++ community active?

No very, but somewhat later in a project that I'm working on I'll need
Kafka / Avro / HDFS path (in C++) as well.  I'm willing to contribute.

Earlier this spring I did some work on a http server & client that does
REST calls with avro encoded payload. The code is on github it uses CMake
and supports ubuntu, raspberry and visual studio.

https://github.com/bitbouncer/csi-http

>         Is there currently any Windows (Visual C++ project) support?

Not officially - but you can use the cmake files from the gibhub project as
a base for that.

/svante

RE: State of the C++ vs Java implementations

Posted by John Lilley <jo...@redpoint.net>.
Thanks Steve!

As for this approach: "As for the compression - as Doug has already answered, C++ only supports null (no) codec and deflate. You can always use the Avro java tools 'recodec' command to convert from an unsupported codec to deflate if you need to. "

It's not an option for us.  We want to read Avro files natively, as quickly as possible.  Also it doesn't look like these tools read/write HDFS paths do they?

john


From: Steve Roehrs [mailto:Steve.Roehrs@rlmgroup.com.au]
Sent: Sunday, August 17, 2014 6:41 PM
To: John Lilley; user@avro.apache.org
Subject: RE: State of the C++ vs Java implementations

Hi John.

Sorry for the late reply I was off work for a few days ill.

The idea of using reading the schema from the file and then processing it without knowing the structure beforehand is the main use case for the GenericDatum.  The inefficiencies relate to the way that GenericDatum handles arrays.  In our use case, most of the data consists of large floating point arrays, or multi-dimensional (nested) arrays.  GenericDatum stores these by using a STL vector.

So if you were using C++ structures, you may have a float[1000] - but if you use GenericDatum you get a  vector<GenericDatum> where each GenericDatum contains a single float.  Of course this is the most flexible way of implementing it, and it works - but it uses considerably more memory.  Profiling a read of one of our data structures showed that more than 50% of the time was spent in malloc()/ free() !

If your data doesn't have lots of large numeric arrays then by all means, GenericDatum should work reasonably well for you.

As for the compression - as Doug has already answered, C++ only supports null (no) codec and deflate. You can always use the Avro java tools 'recodec' command to convert from an unsupported codec to deflate if you need to.

The groundwork for codec support in C++ is already there - it should be quite easy to add additional codecs now. Most of the work would be in getting the makefile/library stuff right.

Regards,

Steve Roehrs
Senior Software Engineer | Lockheed Martin

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551
| w: www.rlmgroup.com.au<http://www.rlmgroup.com.au> | e: Steve.Roehrs@rlmgroup.com.au<ma...@rlmgroup.com.au>
| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111
This email and any attachment to it remains the property of Lockheed Martin and is intended only to be read or used by the named addressee.  It may contain information that is confidential, commercially valuable or subject to legal privilege.  If you receive this email in error, please immediately delete it and notify the sender.  Opinions, conclusions and other information in this message that do not relate to the official business of Lockheed Martin or any companies within Lockheed Martin shall be understood as neither given nor endorsed by them.
________________________________
From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Friday, August 15, 2014 1:16 AM
To: user@avro.apache.org<ma...@avro.apache.org>; Steve Roehrs
Subject: RE: State of the C++ vs Java implementations

Steve,

Thanks so much for the reply.  I hope that I can inconvenience you for a little more guidance.  We want to read and write Avro data files whose schema is not known until run-time, when we read the file metadata and transform that into our own internal record structure.  So we are not mapping to a C++ struct/class with defined compile-time members.  We just want to loop over the records and columns in the data file, transforming them serially.  Can this be done without incurring the performance penalty of GenericDatum that you speak of?

Different question: do you know if the full complement of compression codecs is available in C++?  We don't need "everything possible", but we want to be able to read 99.9% of files that we are likely to encounter in practice.

Thanks
John


From: Steve Roehrs [mailto:Steve.Roehrs@rlmgroup.com.au]
Sent: Sunday, August 10, 2014 11:25 PM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: RE: State of the C++ vs Java implementations

Hi John

You can definitely read and write Avro data files using C++.  The DataFileWriter and DataFileReader classes are what you need.

The README is severely out of date.

I can't comment on the relative performance of the Java/C++ API's - we used the C++ API for our application, but for performance reasons we don't use the GenericDatum class, as it does have poor performance for our particular mix of data.  I don't know if the Java API fares any better in this regard.

Regards,

Steve Roehrs
Senior Software Engineer | Lockheed Martin

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551
| w: www.rlmgroup.com.au<http://www.rlmgroup.com.au> | e: Steve.Roehrs@rlmgroup.com.au<ma...@rlmgroup.com.au>
| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111
This email and any attachment to it remains the property of Lockheed Martin and is intended only to be read or used by the named addressee.  It may contain information that is confidential, commercially valuable or subject to legal privilege.  If you receive this email in error, please immediately delete it and notify the sender.  Opinions, conclusions and other information in this message that do not relate to the official business of Lockheed Martin or any companies within Lockheed Martin shall be understood as neither given nor endorsed by them.
________________________________
From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Wednesday, August 06, 2014 6:28 AM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: State of the C++ vs Java implementations

Greetings,

I am desiring to read and write Avro files (such as those manipulated by MapReduce applications) from a C++ program.  While there are higher-level wrappers (such as Hive), I am interested in reading/writing the files directly.  There are both C++ and Java library implementations; however, in the C++ API README I see "And the file and rpc containers are not yet implemented."  Does this mean that I can't read and write Avro files using the C++ library?

We have very good C++/JNI wrapper-generator, so using the Java is not terribly difficult.  Given that, which interface would you recommend?  Does the C++ interface (assuming it works) have significant performance advantages?

Thanks
john

RE: State of the C++ vs Java implementations

Posted by Steve Roehrs <St...@rlmgroup.com.au>.
Hi John.

 

Sorry for the late reply I was off work for a few days ill.

 

The idea of using reading the schema from the file and then processing
it without knowing the structure beforehand is the main use case for the
GenericDatum.  The inefficiencies relate to the way that GenericDatum
handles arrays.  In our use case, most of the data consists of large
floating point arrays, or multi-dimensional (nested) arrays.
GenericDatum stores these by using a STL vector.

 

So if you were using C++ structures, you may have a float[1000] - but if
you use GenericDatum you get a  vector<GenericDatum> where each
GenericDatum contains a single float.  Of course this is the most
flexible way of implementing it, and it works - but it uses considerably
more memory.  Profiling a read of one of our data structures showed that
more than 50% of the time was spent in malloc()/ free() !

 

If your data doesn't have lots of large numeric arrays then by all
means, GenericDatum should work reasonably well for you. 

 

As for the compression - as Doug has already answered, C++ only supports
null (no) codec and deflate. You can always use the Avro java tools
'recodec' command to convert from an unsupported codec to deflate if you
need to.   

 

The groundwork for codec support in C++ is already there - it should be
quite easy to add additional codecs now. Most of the work would be in
getting the makefile/library stuff right.

 

Regards,

 

Steve Roehrs

Senior Software Engineer | Lockheed Martin

 

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551

| w: www.rlmgroup.com.au | e: Steve.Roehrs@rlmgroup.com.au

| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111

This email and any attachment to it remains the property of Lockheed
Martin and is intended only to be read or used by the named addressee.
It may contain information that is confidential, commercially valuable
or subject to legal privilege.  If you receive this email in error,
please immediately delete it and notify the sender.  Opinions,
conclusions and other information in this message that do not relate to
the official business of Lockheed Martin or any companies within
Lockheed Martin shall be understood as neither given nor endorsed by
them.

________________________________

From: John Lilley [mailto:john.lilley@redpoint.net] 
Sent: Friday, August 15, 2014 1:16 AM
To: user@avro.apache.org; Steve Roehrs
Subject: RE: State of the C++ vs Java implementations

 

Steve,


Thanks so much for the reply.  I hope that I can inconvenience you for a
little more guidance.  We want to read and write Avro data files whose
schema is not known until run-time, when we read the file metadata and
transform that into our own internal record structure.  So we are not
mapping to a C++ struct/class with defined compile-time members.  We
just want to loop over the records and columns in the data file,
transforming them serially.  Can this be done without incurring the
performance penalty of GenericDatum that you speak of?

 

Different question: do you know if the full complement of compression
codecs is available in C++?  We don't need "everything possible", but we
want to be able to read 99.9% of files that we are likely to encounter
in practice.

 

Thanks

John

 

 

From: Steve Roehrs [mailto:Steve.Roehrs@rlmgroup.com.au] 
Sent: Sunday, August 10, 2014 11:25 PM
To: user@avro.apache.org
Subject: RE: State of the C++ vs Java implementations

 

Hi John

 

You can definitely read and write Avro data files using C++.  The
DataFileWriter and DataFileReader classes are what you need.

 

The README is severely out of date.

 

I can't comment on the relative performance of the Java/C++ API's - we
used the C++ API for our application, but for performance reasons we
don't use the GenericDatum class, as it does have poor performance for
our particular mix of data.  I don't know if the Java API fares any
better in this regard.

 

Regards,

 

Steve Roehrs

Senior Software Engineer | Lockheed Martin

 

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551

| w: www.rlmgroup.com.au | e: Steve.Roehrs@rlmgroup.com.au

| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111

This email and any attachment to it remains the property of Lockheed
Martin and is intended only to be read or used by the named addressee.
It may contain information that is confidential, commercially valuable
or subject to legal privilege.  If you receive this email in error,
please immediately delete it and notify the sender.  Opinions,
conclusions and other information in this message that do not relate to
the official business of Lockheed Martin or any companies within
Lockheed Martin shall be understood as neither given nor endorsed by
them.

________________________________

From: John Lilley [mailto:john.lilley@redpoint.net] 
Sent: Wednesday, August 06, 2014 6:28 AM
To: user@avro.apache.org
Subject: State of the C++ vs Java implementations

 

Greetings,

 

I am desiring to read and write Avro files (such as those manipulated by
MapReduce applications) from a C++ program.  While there are
higher-level wrappers (such as Hive), I am interested in reading/writing
the files directly.  There are both C++ and Java library
implementations; however, in the C++ API README I see "And the file and
rpc containers are not yet implemented."  Does this mean that I can't
read and write Avro files using the C++ library?

 

We have very good C++/JNI wrapper-generator, so using the Java is not
terribly difficult.  Given that, which interface would you recommend?
Does the C++ interface (assuming it works) have significant performance
advantages?

 

Thanks

john


RE: State of the C++ vs Java implementations

Posted by John Lilley <jo...@redpoint.net>.
Thanks Doug.

If we go the C++ route (as opposed to a JNI wrapper) we’ll need support for all of the compressors that could be in common use in the field, so that’s probably all of them.  It looks like snappy, bzip, and xz are all available as native C/C++ libraries.  If we go that route, I don’t have a problem working on the C++ integration, but we have no experience with the mechanics of contributing to Apache open-source projects and would need some help getting that done right.

This leads me to a few questions:

·         Is the Avro/ C++ community active?

·         Is there currently any Windows (Visual C++ project) support?

·         With regards to these codec libraries, would we add compile-time switches in Avro C++ to control support for the additional codecs?

·         Should it be assumed that the packages are installed on the system in a standard place?

Thanks,
John

From: Doug Cutting [mailto:cutting@apache.org]
Sent: Friday, August 15, 2014 11:29 AM
To: user@avro.apache.org
Subject: Re: State of the C++ vs Java implementations

On Thu, Aug 14, 2014 at 1:03 PM, John Lilley <jo...@redpoint.net>> wrote:
Do you know where I can find a list of codecs supported in Java vs C++?

Grepping the Avro C++ headers, it seems to support just the null codec and deflate.  These are the two codecs that every implementation is meant to support.

http://avro.apache.org/docs/current/spec.html#Required+Codecs

It would wonderful if someone contributed Snappy support to C++, but that's not yet happened.

Java additionally supports snappy, bzip2 and xz.

http://avro.apache.org/docs/current/api/java/org/apache/avro/file/CodecFactory.html

Doug


Re: State of the C++ vs Java implementations

Posted by Doug Cutting <cu...@apache.org>.
On Thu, Aug 14, 2014 at 1:03 PM, John Lilley <jo...@redpoint.net>
wrote:

> Do you know where I can find a list of codecs supported in Java vs C++?


Grepping the Avro C++ headers, it seems to support just the null codec and
deflate.  These are the two codecs that every implementation is meant to
support.

http://avro.apache.org/docs/current/spec.html#Required+Codecs

It would wonderful if someone contributed Snappy support to C++, but that's
not yet happened.

Java additionally supports snappy, bzip2 and xz.

http://avro.apache.org/docs/current/api/java/org/apache/avro/file/CodecFactory.html

Doug

RE: State of the C++ vs Java implementations

Posted by John Lilley <jo...@redpoint.net>.
Thanks!  Do you know where I can find a list of codecs supported in Java vs C++?
--john

From: Doug Cutting [mailto:cutting@apache.org]
Sent: Thursday, August 14, 2014 1:44 PM
To: user@avro.apache.org
Subject: Re: State of the C++ vs Java implementations


On Thu, Aug 14, 2014 at 11:56 AM, John Lilley <jo...@redpoint.net>> wrote:
I’m seeing discussion of a new Decimal encoding in the mailing list, and it would be bad for us to commit to the C++ Avro, and then find that our customers have created Avro files (using Java, MapReduce, etc) that we can’t read.  We don’t have control over what files we encounter, and it is desirable for our product to read whatever a customer throws at it, within reason.

Except for compression codecs, all implementations should be able to read all data files written by other implementations.  The Avro schema language has not changed incompatibly since 1.0.  Additions such as Decimal are back-compatible.  Implementations that have no knowledge of Decimal schemas can still process data written that contains Decimals but will see them as byte arrays with schema attributes.

Doug

Re: State of the C++ vs Java implementations

Posted by Doug Cutting <cu...@apache.org>.
On Thu, Aug 14, 2014 at 11:56 AM, John Lilley <jo...@redpoint.net>
wrote:

> I’m seeing discussion of a new Decimal encoding in the mailing list, and
> it would be bad for us to commit to the C++ Avro, and then find that our
> customers have created Avro files (using Java, MapReduce, etc) that we
> can’t read.  We don’t have control over what files we encounter, and it is
> desirable for our product to read whatever a customer throws at it, within
> reason.
>
>
Except for compression codecs, all implementations should be able to read
all data files written by other implementations.  The Avro schema language
has not changed incompatibly since 1.0.  Additions such as Decimal are
back-compatible.  Implementations that have no knowledge of Decimal schemas
can still process data written that contains Decimals but will see them as
byte arrays with schema attributes.

Doug

RE: State of the C++ vs Java implementations

Posted by John Lilley <jo...@redpoint.net>.
Does the C++ implementation track the Java development closely?  I'm seeing discussion of a new Decimal encoding in the mailing list, and it would be bad for us to commit to the C++ Avro, and then find that our customers have created Avro files (using Java, MapReduce, etc) that we can't read.  We don't have control over what files we encounter, and it is desirable for our product to read whatever a customer throws at it, within reason.
Thanks,
John


From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Thursday, August 14, 2014 9:46 AM
To: user@avro.apache.org; Steve.Roehrs@rlmgroup.com.au
Subject: RE: State of the C++ vs Java implementations

Steve,

Thanks so much for the reply.  I hope that I can inconvenience you for a little more guidance.  We want to read and write Avro data files whose schema is not known until run-time, when we read the file metadata and transform that into our own internal record structure.  So we are not mapping to a C++ struct/class with defined compile-time members.  We just want to loop over the records and columns in the data file, transforming them serially.  Can this be done without incurring the performance penalty of GenericDatum that you speak of?

Different question: do you know if the full complement of compression codecs is available in C++?  We don't need "everything possible", but we want to be able to read 99.9% of files that we are likely to encounter in practice.

Thanks
John


From: Steve Roehrs [mailto:Steve.Roehrs@rlmgroup.com.au]
Sent: Sunday, August 10, 2014 11:25 PM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: RE: State of the C++ vs Java implementations

Hi John

You can definitely read and write Avro data files using C++.  The DataFileWriter and DataFileReader classes are what you need.

The README is severely out of date.

I can't comment on the relative performance of the Java/C++ API's - we used the C++ API for our application, but for performance reasons we don't use the GenericDatum class, as it does have poor performance for our particular mix of data.  I don't know if the Java API fares any better in this regard.

Regards,

Steve Roehrs
Senior Software Engineer | Lockheed Martin

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551
| w: www.rlmgroup.com.au<http://www.rlmgroup.com.au> | e: Steve.Roehrs@rlmgroup.com.au<ma...@rlmgroup.com.au>
| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111
This email and any attachment to it remains the property of Lockheed Martin and is intended only to be read or used by the named addressee.  It may contain information that is confidential, commercially valuable or subject to legal privilege.  If you receive this email in error, please immediately delete it and notify the sender.  Opinions, conclusions and other information in this message that do not relate to the official business of Lockheed Martin or any companies within Lockheed Martin shall be understood as neither given nor endorsed by them.
________________________________
From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Wednesday, August 06, 2014 6:28 AM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: State of the C++ vs Java implementations

Greetings,

I am desiring to read and write Avro files (such as those manipulated by MapReduce applications) from a C++ program.  While there are higher-level wrappers (such as Hive), I am interested in reading/writing the files directly.  There are both C++ and Java library implementations; however, in the C++ API README I see "And the file and rpc containers are not yet implemented."  Does this mean that I can't read and write Avro files using the C++ library?

We have very good C++/JNI wrapper-generator, so using the Java is not terribly difficult.  Given that, which interface would you recommend?  Does the C++ interface (assuming it works) have significant performance advantages?

Thanks
john

RE: State of the C++ vs Java implementations

Posted by John Lilley <jo...@redpoint.net>.
Steve,

Thanks so much for the reply.  I hope that I can inconvenience you for a little more guidance.  We want to read and write Avro data files whose schema is not known until run-time, when we read the file metadata and transform that into our own internal record structure.  So we are not mapping to a C++ struct/class with defined compile-time members.  We just want to loop over the records and columns in the data file, transforming them serially.  Can this be done without incurring the performance penalty of GenericDatum that you speak of?

Different question: do you know if the full complement of compression codecs is available in C++?  We don't need "everything possible", but we want to be able to read 99.9% of files that we are likely to encounter in practice.

Thanks
John


From: Steve Roehrs [mailto:Steve.Roehrs@rlmgroup.com.au]
Sent: Sunday, August 10, 2014 11:25 PM
To: user@avro.apache.org
Subject: RE: State of the C++ vs Java implementations

Hi John

You can definitely read and write Avro data files using C++.  The DataFileWriter and DataFileReader classes are what you need.

The README is severely out of date.

I can't comment on the relative performance of the Java/C++ API's - we used the C++ API for our application, but for performance reasons we don't use the GenericDatum class, as it does have poor performance for our particular mix of data.  I don't know if the Java API fares any better in this regard.

Regards,

Steve Roehrs
Senior Software Engineer | Lockheed Martin

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551
| w: www.rlmgroup.com.au<http://www.rlmgroup.com.au> | e: Steve.Roehrs@rlmgroup.com.au<ma...@rlmgroup.com.au>
| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111
This email and any attachment to it remains the property of Lockheed Martin and is intended only to be read or used by the named addressee.  It may contain information that is confidential, commercially valuable or subject to legal privilege.  If you receive this email in error, please immediately delete it and notify the sender.  Opinions, conclusions and other information in this message that do not relate to the official business of Lockheed Martin or any companies within Lockheed Martin shall be understood as neither given nor endorsed by them.
________________________________
From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Wednesday, August 06, 2014 6:28 AM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: State of the C++ vs Java implementations

Greetings,

I am desiring to read and write Avro files (such as those manipulated by MapReduce applications) from a C++ program.  While there are higher-level wrappers (such as Hive), I am interested in reading/writing the files directly.  There are both C++ and Java library implementations; however, in the C++ API README I see "And the file and rpc containers are not yet implemented."  Does this mean that I can't read and write Avro files using the C++ library?

We have very good C++/JNI wrapper-generator, so using the Java is not terribly difficult.  Given that, which interface would you recommend?  Does the C++ interface (assuming it works) have significant performance advantages?

Thanks
john

RE: State of the C++ vs Java implementations

Posted by Steve Roehrs <St...@rlmgroup.com.au>.
Hi John

 

You can definitely read and write Avro data files using C++.  The
DataFileWriter and DataFileReader classes are what you need.

 

The README is severely out of date.

 

I can't comment on the relative performance of the Java/C++ API's - we
used the C++ API for our application, but for performance reasons we
don't use the GenericDatum class, as it does have poor performance for
our particular mix of data.  I don't know if the Java API fares any
better in this regard.

 

Regards,

 

Steve Roehrs

Senior Software Engineer | Lockheed Martin

 

| p: +61 8 7389 4525    | m: +61 4 3891 5622     | f: +61 8 7389 4551

| w: www.rlmgroup.com.au | e: Steve.Roehrs@rlmgroup.com.au

| Company address: 82-86 Woomera Ave, Edinburgh, SA 5111

This email and any attachment to it remains the property of Lockheed
Martin and is intended only to be read or used by the named addressee.
It may contain information that is confidential, commercially valuable
or subject to legal privilege.  If you receive this email in error,
please immediately delete it and notify the sender.  Opinions,
conclusions and other information in this message that do not relate to
the official business of Lockheed Martin or any companies within
Lockheed Martin shall be understood as neither given nor endorsed by
them.

________________________________

From: John Lilley [mailto:john.lilley@redpoint.net] 
Sent: Wednesday, August 06, 2014 6:28 AM
To: user@avro.apache.org
Subject: State of the C++ vs Java implementations

 

Greetings,

 

I am desiring to read and write Avro files (such as those manipulated by
MapReduce applications) from a C++ program.  While there are
higher-level wrappers (such as Hive), I am interested in reading/writing
the files directly.  There are both C++ and Java library
implementations; however, in the C++ API README I see "And the file and
rpc containers are not yet implemented."  Does this mean that I can't
read and write Avro files using the C++ library?

 

We have very good C++/JNI wrapper-generator, so using the Java is not
terribly difficult.  Given that, which interface would you recommend?
Does the C++ interface (assuming it works) have significant performance
advantages?

 

Thanks

john