You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by mark <ma...@googlemail.com> on 2015/08/10 10:57:08 UTC

Avro divides large JSON schema string into parts - is this intentional?

I am using Avro v1.7.7 in development, and Avro version 1.7.4 on my Hadoop
cluster.

I have a fairly large .avdl schema - a record with about 100 fields. When
running locally under test there were no issues with this schema,
everything would serialize and deserialize without issue.

When running on Hadoop however I was getting this error:

*Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.avro.Schema$Parser.parse(Ljava/lang/String;[Ljava/lang/String;)*

The reason was that the JSON schema embedded in the compiled java class was
being broken into two:

*public class SomeType extends org.apache.avro.specific.SpecificRecordBase
implements org.apache.avro.specific.SpecificRecord {*
*  public static final org.apache.avro.Schema SCHEMA$ = new
org.apache.avro.Schema.Parser().parse("long schema string part1", "long
schema string part2")*

Now, version 1.7.7 has this method signature:

*public Schema parse(String s, String... more)*

So the broken schema string works fine locally, but version 1.7.4. does
not, hence the exception when running compiled classes on Hadoop.

Is this intentional or a bug?
If intentional, what are the rules determining when Avro breaks up a schema
string?
Where is this behaviour documented?
Why does it do it at all?

Thanks

avro RAM usage

Posted by marius <m....@googlemail.com>.

Hey,

i am currently doing some performance tests for my BSc thesis and i 
wondered how exactly the parsing of avro files when reading them works. 
 From my understanding the data is read block by block from the file 
(rather than datum by datum) and then the datums are deserialized. Is 
this correct (this would mean that the memory usage of avro is depending 
on the block size rather than the datum size of each datum) or does it 
depend on the used implementation?

My second question is if there is a way to read the file datum by datum. 
I want to create an index which stores the byte offsets of the avro file 
so i can use e.g. seek() to go to that position and deserialize the 
following datum. Is this even possible or can i only start at positions 
with sync marker?

Greetings and thanks

Marius

Re: Avro divides large JSON schema string into parts - is this intentional?

Posted by Niels Basjes <Ni...@basjes.nl>.

Have a look at this
https://issues.apache.org/jira/browse/AVRO-1316

This is the bug that required this change.

Niels Basjes

On Mon, 10 Aug 2015 11:05 mark <ma...@googlemail.com> wrote:

> I am using Avro v1.7.7 in development, and Avro version 1.7.4 on my Hadoop
> cluster.
>
> I have a fairly large .avdl schema - a record with about 100 fields. When
> running locally under test there were no issues with this schema,
> everything would serialize and deserialize without issue.
>
> When running on Hadoop however I was getting this error:
>
> *Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.avro.Schema$Parser.parse(Ljava/lang/String;[Ljava/lang/String;)*
>
> The reason was that the JSON schema embedded in the compiled java class
> was being broken into two:
>
> *public class SomeType extends org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {*
> *  public static final org.apache.avro.Schema SCHEMA$ = new
> org.apache.avro.Schema.Parser().parse("long schema string part1", "long
> schema string part2")*
>
> Now, version 1.7.7 has this method signature:
>
> *public Schema parse(String s, String... more)*
>
> So the broken schema string works fine locally, but version 1.7.4. does
> not, hence the exception when running compiled classes on Hadoop.
>
> Is this intentional or a bug?
> If intentional, what are the rules determining when Avro breaks up a
> schema string?
> Where is this behaviour documented?
> Why does it do it at all?
>
> Thanks
>