You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Cheolsoo Park <ch...@cloudera.com> on 2012/12/03 20:22:10 UTC

Re: Review Request: PIG-3015 Rewrite of AvroStorage

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/#review13962
-----------------------------------------------------------


Overall looks great! I haven't gone through the test cases yet, but here are my comments so far.


1) I noticed that I cannot load .avro files that are not record types. For example, I tried to load a .avro file whose schema is "int" as follows:

[cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar getschema foo2/test_int.avro 
"int"

[cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar tojson foo2/test_int.avro 
1

in = LOAD 'foo2/test_int.avro' USING AvroStorage('int');
DUMP in;

This gives me the following error:

Caused by: java.io.IOException: avroSchemaToResourceSchema only processes records

Can only Avro record type be loaded? Or am I doing something wrong?


2) TestAvroStorage needs to be more automated. To run it, I had to run the following commands:

ant clean compile-test
cd ./test/org/apache/pig/builtin/avro
python createests.py
cd -
ant clean test -Dtestcase=TestAvroStorage

Ideally, I should be able to run a single command: ant clean -Dtestcase=TestAvroStorage. Please let me know if you need help for this.


3) python createests.py fails with the following errors. I suppose that some files are missing:

creating data/avro/uncompressed/testDirectoryCounts.avro
Exception in thread "main" java.io.FileNotFoundException: data/json/testDirectoryCounts.json (No such file or directory)
...
creating evenFileNameTestDirectoryCounts.avro
Exception in thread "main" java.io.FileNotFoundException: data/json/evenFileNameTestDirectoryCounts.json (No such file or directory)
...


4) ant test -Dtestcase=TestAvroStorage fails with the following errors. I suppose that this is due to the missing files:

Testcase: testLoadDirectory took 0.005 sec
    FAILED
Testcase: testLoadGlob took 0.004 sec
    FAILED
Testcase: testPartialLoadGlob took 0.005 sec
    FAILED


5) Typo in the name of createests.py. It should be createtests.py.


6) Is createTests.bash needed at all? If not, can you remove it?


I have more comments inline:


ivy/libraries.properties
<https://reviews.apache.org/r/8104/#comment29844>

    Can you update .eclipse.templates/.classpath too?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29845>

    Can you remove this comment?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29850>

    Can you change Exception to SchemaParseException?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29907>

    Wouldn't it be better to make this static?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29847>

    FileSystem.get(new Configuration()).open() will look for a file on the default FS that is specified by Hadoop conf. But it won't work in the following case:
    
    1) The default FS is set by installed Hadoop conf as follows:
    <name>fs.default.name</name>
    <value>hdfs://cheolsoo-mr1-0.ent.cloudera.com:8020</value>
    
    2) Run Pig in *local* mode and specify a schema file that is on local FS. This gives me the following error:
    java.io.FileNotFoundException: File does not exist: /user/cheolsoo/test_record.avsc
    
    3) Even if I specify the uri schema as file:/, this still gives me the following error:
    java.lang.IllegalArgumentException: Wrong FS: file:/home/cheolsoo/pig-svn/test_record.avsc, expected: hdfs://cheolsoo-mr1-0.ent.cloudera.com:8020
    
    
    Can you instead do the following?
    
    Path p = new Path(...);
    (FileSystem.get(p.toUri(), new Configuration()).open(p);
    
    Then, local FS will be used if file:/ is specified in path, and the default FS is if no uri scheme is specified.



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29848>

    Same problem as above.



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29849>

    Can you split this into two catch blocks: ParseException and IOException?
    
    It seems incorrect that the help message is printed when FileSystem throws an IOExcetion.



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29851>

    Can you filter out files that start with "." as well?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29852>

    Same as above.



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29853>

    This won't work in the following case. Let's say p matches two dirs, and one dir is empty.
    
    p = foo*
    
    foo1
    foo2/bar.avro
    
    I would expect the schema of bar.avro is returned, but I get an IOException instead.



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29854>

    Typo: not => No.



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29857>

    Can you rename this class without Fast?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29856>

    Is this needed?
    
    In the constructor, schema is supposed to be set. If not, there must be an error. Shouldn't we throw an exception instead of re-trying to set schema?
    
    Please correct me if I am wrong.



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29855>

    Can you replace ".avro" with AvroOutputFormat.EXT?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29870>

    Can you move this comment to the checkSchema() method?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29858>

    Can you move this comment to the prepareToWrite() method?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29859>

    Can you instead use log.debug(..., e)?



src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29865>

    Can you instead use log.debug(..., e)?



src/org/apache/pig/builtin/TrevniStorage.java
<https://reviews.apache.org/r/8104/#comment29860>

    Can you filter out files that start with "." as well?



src/org/apache/pig/builtin/TrevniStorage.java
<https://reviews.apache.org/r/8104/#comment29861>

    AvroStorage accepts files that do not end ".avro". Shouldn't TrevniStorage do the same?



src/org/apache/pig/builtin/TrevniStorage.java
<https://reviews.apache.org/r/8104/#comment29862>

    Is this needed?
    
    In the constructor, schema is supposed to be set. If not, there must be an error. Shouldn't we throw an exception instead of re-trying to set schema?



src/org/apache/pig/builtin/TrevniStorage.java
<https://reviews.apache.org/r/8104/#comment29863>

    AvroStorage accepts files that do not end ".avro". Shouldn't TrevniStorage do the same?



src/org/apache/pig/impl/util/AvroRecordReader.java
<https://reviews.apache.org/r/8104/#comment29864>

    I can't find where -ignoreErrors is used. I guess that error handling for bad files is not implemented yet?



src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java
<https://reviews.apache.org/r/8104/#comment29866>

    Can you instead use log.debug(..., e)?



src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java
<https://reviews.apache.org/r/8104/#comment29867>

    Can you instead use log.debug(..., e)?



src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java
<https://reviews.apache.org/r/8104/#comment29871>

    How about a union type that contains a single data type (e.g. ["string"])? They're currently supported.



src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java
<https://reviews.apache.org/r/8104/#comment29868>

    Can you instead use log.debug(..., e)?



src/org/apache/pig/impl/util/AvroTupleWrapper.java
<https://reviews.apache.org/r/8104/#comment29869>

    Can you instead use log.debug(..., e)?


- Cheolsoo Park


On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8104/
> -----------------------------------------------------------
> 
> (Updated Nov. 17, 2012, 5:28 a.m.)
> 
> 
> Review request for pig and Cheolsoo Park.
> 
> 
> Description
> -------
> 
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> 
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> 
> This is the latest version of the patch, complete with test cases and TrevniStorage. (Test cases for TrevniStorage are still missing).
> 
> 
> This addresses bug PIG-3015.
>     https://issues.apache.org/jira/browse/PIG-3015
> 
> 
> Diffs
> -----
> 
>   build.xml 7d468a0 
>   ivy.xml 70e8d50 
>   ivy/libraries.properties 317564f 
>   src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
>   src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroBagWrapper.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroMapWrapper.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroRecordReader.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroRecordWriter.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroTupleWrapper.java PRE-CREATION 
>   test/commit-tests 5081fbc 
>   test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/createTests.bash PRE-CREATION 
>   test/org/apache/pig/builtin/avro/createests.py PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/deflate/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/deflate/recordsAsOutputByPig.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/snappy/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordWithRepeatedSubRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsAsOutputByPig.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsOfArrays.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsOfArraysOfRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsSubSchema.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsSubSchemaNullable.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithEnums.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithFixed.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithMaps.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithMapsOfRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithNullableUnions.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recursiveRecord.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsAsOutputByPig.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsOfArrays.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsOfArraysOfRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsSubSchema.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsSubSchemaNullable.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithEnums.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithFixed.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithMaps.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithMapsOfRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithNullableUnions.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION 
>   test/unit-tests 0f18a0e 
> 
> Diff: https://reviews.apache.org/r/8104/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Joseph Adler
> 
>


Re: Review Request: PIG-3015 Rewrite of AvroStorage

Posted by Joseph Adler <ja...@linkedin.com>.

> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > src/org/apache/pig/builtin/AvroStorage.java, lines 171-172
> > <https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line171>
> >
> >     Same problem as above.

Fixing this one within getAvroSchema


> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > src/org/apache/pig/builtin/AvroStorage.java, lines 382-388
> > <https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line382>
> >
> >     Is this needed?
> >     
> >     In the constructor, schema is supposed to be set. If not, there must be an error. Shouldn't we throw an exception instead of re-trying to set schema?
> >     
> >     Please correct me if I am wrong.

Pretty sure you're right about this one (and that this code is redundant).


> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > src/org/apache/pig/builtin/TrevniStorage.java, line 160
> > <https://reviews.apache.org/r/8104/diff/1/?file=191565#file191565line160>
> >
> >     AvroStorage accepts files that do not end ".avro". Shouldn't TrevniStorage do the same?

Good point... though I realize that I've defined "visible avro files" and "visible trevni files" methods that are probably not useful. I should probably just drop the methods.


> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > src/org/apache/pig/impl/util/AvroRecordReader.java, lines 110-118
> > <https://reviews.apache.org/r/8104/diff/1/?file=191568#file191568line110>
> >
> >     I can't find where -ignoreErrors is used. I guess that error handling for bad files is not implemented yet?

No, I haven't implemented it yet. I suspect that the best way to implement the error ignoring functionality is from within Pig, and should apply to all file types (not just Avro)... I'll add that discussion to the right JIRA thread


> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java, lines 85-91
> > <https://reviews.apache.org/r/8104/diff/1/?file=191571#file191571line85>
> >
> >     How about a union type that contains a single data type (e.g. ["string"])? They're currently supported.

Good point; that's a trivial change. Adding that


> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > src/org/apache/pig/impl/util/AvroTupleWrapper.java, line 163
> > <https://reviews.apache.org/r/8104/diff/1/?file=191572#file191572line163>
> >
> >     Can you instead use log.debug(..., e)?

Just added the exception to the line logging line above


- Joseph


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/#review13962
-----------------------------------------------------------


On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8104/
> -----------------------------------------------------------
> 
> (Updated Nov. 17, 2012, 5:28 a.m.)
> 
> 
> Review request for pig and Cheolsoo Park.
> 
> 
> Description
> -------
> 
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> 
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> 
> This is the latest version of the patch, complete with test cases and TrevniStorage. (Test cases for TrevniStorage are still missing).
> 
> 
> This addresses bug PIG-3015.
>     https://issues.apache.org/jira/browse/PIG-3015
> 
> 
> Diffs
> -----
> 
>   build.xml 7d468a0 
>   ivy.xml 70e8d50 
>   ivy/libraries.properties 317564f 
>   src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
>   src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroBagWrapper.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroMapWrapper.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroRecordReader.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroRecordWriter.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroTupleWrapper.java PRE-CREATION 
>   test/commit-tests 5081fbc 
>   test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/createTests.bash PRE-CREATION 
>   test/org/apache/pig/builtin/avro/createests.py PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/deflate/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/deflate/recordsAsOutputByPig.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/snappy/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordWithRepeatedSubRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsAsOutputByPig.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsOfArrays.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsOfArraysOfRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsSubSchema.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsSubSchemaNullable.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithEnums.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithFixed.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithMaps.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithMapsOfRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithNullableUnions.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recursiveRecord.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsAsOutputByPig.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsOfArrays.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsOfArraysOfRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsSubSchema.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsSubSchemaNullable.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithEnums.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithFixed.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithMaps.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithMapsOfRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithNullableUnions.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION 
>   test/unit-tests 0f18a0e 
> 
> Diff: https://reviews.apache.org/r/8104/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Joseph Adler
> 
>


Re: Review Request: PIG-3015 Rewrite of AvroStorage

Posted by Joseph Adler <ja...@linkedin.com>.

> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > Overall looks great! I haven't gone through the test cases yet, but here are my comments so far.
> > 
> > 
> > 1) I noticed that I cannot load .avro files that are not record types. For example, I tried to load a .avro file whose schema is "int" as follows:
> > 
> > [cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar getschema foo2/test_int.avro 
> > "int"
> > 
> > [cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar tojson foo2/test_int.avro 
> > 1
> > 
> > in = LOAD 'foo2/test_int.avro' USING AvroStorage('int');
> > DUMP in;
> > 
> > This gives me the following error:
> > 
> > Caused by: java.io.IOException: avroSchemaToResourceSchema only processes records
> > 
> > Can only Avro record type be loaded? Or am I doing something wrong?
> > 
> > 
> > 2) TestAvroStorage needs to be more automated. To run it, I had to run the following commands:
> > 
> > ant clean compile-test
> > cd ./test/org/apache/pig/builtin/avro
> > python createests.py
> > cd -
> > ant clean test -Dtestcase=TestAvroStorage
> > 
> > Ideally, I should be able to run a single command: ant clean -Dtestcase=TestAvroStorage. Please let me know if you need help for this.
> > 
> > 
> > 3) python createests.py fails with the following errors. I suppose that some files are missing:
> > 
> > creating data/avro/uncompressed/testDirectoryCounts.avro
> > Exception in thread "main" java.io.FileNotFoundException: data/json/testDirectoryCounts.json (No such file or directory)
> > ...
> > creating evenFileNameTestDirectoryCounts.avro
> > Exception in thread "main" java.io.FileNotFoundException: data/json/evenFileNameTestDirectoryCounts.json (No such file or directory)
> > ...
> > 
> > 
> > 4) ant test -Dtestcase=TestAvroStorage fails with the following errors. I suppose that this is due to the missing files:
> > 
> > Testcase: testLoadDirectory took 0.005 sec
> >     FAILED
> > Testcase: testLoadGlob took 0.004 sec
> >     FAILED
> > Testcase: testPartialLoadGlob took 0.005 sec
> >     FAILED
> > 
> > 
> > 5) Typo in the name of createests.py. It should be createtests.py.
> > 
> > 
> > 6) Is createTests.bash needed at all? If not, can you remove it?
> > 
> > 
> > I have more comments inline:

Sounds like the python script isn't working completely correctly. I'll debug that script and make sure it generates all the required files.

Can I take you up on your offer to help automate that build process? I'm not exactly sure what to modify to automatically run the python script to create the test files.


> On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
> > src/org/apache/pig/builtin/AvroStorage.java, lines 296-305
> > <https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line296>
> >
> >     This won't work in the following case. Let's say p matches two dirs, and one dir is empty.
> >     
> >     p = foo*
> >     
> >     foo1
> >     foo2/bar.avro
> >     
> >     I would expect the schema of bar.avro is returned, but I get an IOException instead.

Added proper depth first search to find the first file. (I decided to sort by modification date, most recent first.)


- Joseph


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/#review13962
-----------------------------------------------------------


On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8104/
> -----------------------------------------------------------
> 
> (Updated Nov. 17, 2012, 5:28 a.m.)
> 
> 
> Review request for pig and Cheolsoo Park.
> 
> 
> Description
> -------
> 
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> 
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> 
> This is the latest version of the patch, complete with test cases and TrevniStorage. (Test cases for TrevniStorage are still missing).
> 
> 
> This addresses bug PIG-3015.
>     https://issues.apache.org/jira/browse/PIG-3015
> 
> 
> Diffs
> -----
> 
>   build.xml 7d468a0 
>   ivy.xml 70e8d50 
>   ivy/libraries.properties 317564f 
>   src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
>   src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroBagWrapper.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroMapWrapper.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroRecordReader.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroRecordWriter.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java PRE-CREATION 
>   src/org/apache/pig/impl/util/AvroTupleWrapper.java PRE-CREATION 
>   test/commit-tests 5081fbc 
>   test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION 
>   test/org/apache/pig/builtin/avro/createTests.bash PRE-CREATION 
>   test/org/apache/pig/builtin/avro/createests.py PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/deflate/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/deflate/recordsAsOutputByPig.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/snappy/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordWithRepeatedSubRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/records.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsAsOutputByPig.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsOfArrays.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsOfArraysOfRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsSubSchema.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsSubSchemaNullable.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithEnums.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithFixed.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithMaps.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithMapsOfRecords.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithNullableUnions.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/avro/uncompressed/recursiveRecord.avro PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsAsOutputByPig.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsOfArrays.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsOfArraysOfRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsSubSchema.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsSubSchemaNullable.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithEnums.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithFixed.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithMaps.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithMapsOfRecords.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recordsWithNullableUnions.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc PRE-CREATION 
>   test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION 
>   test/unit-tests 0f18a0e 
> 
> Diff: https://reviews.apache.org/r/8104/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Joseph Adler
> 
>