You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Joseph Adler (JIRA)" <ji...@apache.org> on 2012/10/29 22:16:12 UTC

[jira] [Created] (PIG-3015) Rewrite of AvroStorage

Joseph Adler created PIG-3015:
---------------------------------

             Summary: Rewrite of AvroStorage
                 Key: PIG-3015
                 URL: https://issues.apache.org/jira/browse/PIG-3015
             Project: Pig
          Issue Type: Improvement
          Components: piggybank
            Reporter: Joseph Adler


The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.

I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493550#comment-13493550 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

I hate breaking backwards compatibility. (One of the reaons for doing the rewrite is that Avro broke backwards compatibility.) But I think we have some good reasons to do so here:

- Options for AvroStorage are very different than options for other storage functions in Pig. In moving AvroStorage to builtin, it makes sense for AvroStorage to behave as close as possible to PigStorage, etc.
- The huge number of crazy options make the code slow and complicated.
- There are good workarounds for many changes in the options. For example, all the weird stuff about selecting a schema using an index could be easily changed to explicit schema definitions.
- It gets harder to make changes with time. This is probably the best opportunity to make the options simpler and clearer.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Mike Naseef (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489794#comment-13489794 ] 

Mike Naseef commented on PIG-3015:
----------------------------------

We are very excited about this direction, as we were considering a private re-write to AvroStorage for some of the issues you are addressing. I want to +1 passing the schema into the LoadFunc. The old AvroStorage is very slow and a resource hog when we have a directory hierarchy to scan - even when we set the no_schema_check property. Furthermore, we occasionally have issues with pig jobs picking the old schema when we have a schema update. Manually specifying the schema would fix this (option 1a should cover this as well) and give us more flexibility in defining the data we want pig to pull from a file.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Attachment:     (was: PIG-3015.patch)
    
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Russell Jurney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493555#comment-13493555 ] 

Russell Jurney commented on PIG-3015:
-------------------------------------

Actually, I reverse my position. Get this in builtin as soon as possible. Give ppl one pig version to get off the pipe and then we kill the old one

Ship it.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Russell Jurney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493049#comment-13493049 ] 

Russell Jurney commented on PIG-3015:
-------------------------------------

The existing method of storing to multiple locations is so strange... let's call that part a bug fix? We can enable storing to more than one place without the weird argument workaround using the new outputSchema interface, can't we?
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486427#comment-13486427 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joseph,

Thank you very much for opening the jira. I have recently worked on AvroStorage by myself, and I totally agree with you. Since you already have code to contribute, this is even better. :-)

As part of re-write, I would also like to propose to migrate AvroStorge from Piggybank to the core Pig. I have 2 reasons for this:
# AvroStorage is widely used, so it makes sense to include it in the core Pig rather than in Piggybank.
# Until migration is complete, we can maintain both versions (new one in core Pig and old one in Piggybank) to avoid breaking backward compatibility. Another motivation for re-write to me is to clean up funny options that the current AvroStorage has. So I think that it's unavoidable to break backward compatibility.

I asked this question on the [user mailing list|http://mail-archives.apache.org/mod_mbox/pig-user/201208.mbox/%3C27EE5059-F811-4E19-B1A3-951B4BB3BDDF%40hortonworks.com%3E] a while ago, and nobody disagreed. But please let me know if anyone has objections.


To start with, I am wondering if you can post your code as a patch to this jira and the review board. Assuming that we're going to move AvroStorage to the core Pig, you can probably create a new package called "org.apache.pig.backend.hadoop.avro" and add your code there. If you could break your patch into smaller pieces and attach them to sub-tasks of this jira, that would be helpful too. 

Please let me know what you think.

Thanks!
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491073#comment-13491073 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

Started working on this now. Two questions

(1) I'm a new contributor. What's the best way to organize the code within Pig? I have a lot of helper classes and methods, and would like to put different classes in different files to maximize readability. Should I put the helper classes in an existing package (org.apache.pig.impl.builtin seems like the closest match, though still not quite right), create a new package for the helper classes, or do something else? I couldn't find documentation on the best way to do this.

(2) Here's what I came up with for options: the first argument is either an explicit schema or specifies the record names if a schema is automatically generated. The second argument is a list of options (like in PigStorage):

<li><code>-namespace</code> Namespace for an automatically generated output schema.</li>
<li><code>-ignoreerrors</code> Tells function to ignore errors in input files.</li>
<li><code>-schemafile</code> Specifies URL for avro schema file from which to read the input schema (can be local file, hdfs, url, etc).</li>
<li><code>-examplefile</code> Specifies URL for avro data file  from which to copy the input schema (can be local file, hdfs, url, etc).</li>

I considered providing an explicit option to provide a schema with a "-schema" flag, but would have had to do something much more complicated to correctly parse the options if an option could include a JSON schema. (Plus, I don't think the meaning of the argument will be ambiguous: it will either be a valid JSON object describing a schema or valid name.)

                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501345#comment-13501345 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

PIG-2614 lets the user configure the following properties: 
{code}
public static final String BAD_RECORD_THRESHOLD_CONF_KEY = "pig.piggybank.storage.avro.bad.record.threshold";
public static final String BAD_RECORD_MIN_COUNT_CONF_KEY = "pig.piggybank.storage.avro.bad.record.min";
{code}
I agree with replacing {{-ignoreerrors}} with these properties.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487592#comment-13487592 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joseph,

The list of options that you described looks like a good start. I think that we should definitely start with a small set of options, but it may be a good idea to keep in mind what options we eventually want to add. So here are my questions:

*LoadFunc*
{quote}
(a) Just pick the schema from the most recent file
(b) Check all the files to make sure the schemas are compatible
{quote}
I haven't checked out your repository, so please correct me if I am wrong. I assume that your storage converts Avro schema to Pig schema during the load? If so, how do you convert multiple (compatible but different) schemas to one Pig schema? The current storage has an option called 'multiple_schemas' to merge multiple schemas into one.
{quote}
(2) Use a schema manually provided by the user
{quote}
Do we need this option for LoadFunc? Is this for when the input Avro files do not have an embedded schema?

Does your storage also have limits on unions and recursive records like the current storage? In fact, recursive records are now supported by PIG-2875.

How about corrupted files? Currently, we have an option to skip corrupted files (ignore_bad_files) instead of failing on them.

*StoreFunc*
{quote}
(2) Use a schema manually provided by the user
{quote}
The current storage provides three ways of specifying the output schema:
# A JSON string can be given (option: schema).
# The schema of an existing Avro file (.avro) can be used (option: same).
# An Avro schema file (.avsc) can be used (option: schema_file).

Are you going to support the same?

How about multiple stores with different output schemas? Currently, the current storage has the 'index' option that allows the user to specify different output schemas for each store.

Thanks!
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506099#comment-13506099 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

Hi Timothy:

I have not tried the patch with Pig 0.10, but I don't know of any reason why it would not work. Give it a spin and let us know what happens.

-- Joe
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Status: Open  (was: Patch Available)

replacing with revised patch
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Russell Jurney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500875#comment-13500875 ] 

Russell Jurney commented on PIG-3015:
-------------------------------------

Suggest to check out the work Jon did in PIG-2614. One bad record out of a billion killing a job is almost always absurd.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Status: Patch Available  (was: Open)

Revised patch; reflects comments and suggestions from review board
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509964#comment-13509964 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joe,

Thanks for your prompt response!

To answer your questions,
{quote}
I have always assumed that AvroStorage was designed to be used with Hadoop sequence files that contained a series of records, so I implemented AvroStorage to only work with a file in this format. Are there cases where the highest level schema for a file will be another type? If so... what does that mean for pig? Is there one record per file?
{quote}
This is a good question, and I see your argument. But this will be very different from what the current AvroStorage does. Currently, a non-record type is automatically wrapped in a tuple. For example, "1" is loaded as (1) in Pig. If a file includes multiple values, they are loaded as multiple tuples as follows:
{code:title=avro}
cheolsoo@localhost:~/workspace/avro $java -jar avro-tools-1.5.4.jar getschema multiple_int.avro 
"int"
cheolsoo@localhost:~/workspace/avro $java -jar avro-tools-1.5.4.jar tojson multiple_int.avro 
1
2
3
{code}
{code:title=pig}
in = LOAD 'multiple_int.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage();
DUMP in;
(1)
(2)
(3)
{code}
Agreed that we can tell users that the top-level schema must be a record type, but I am afraid that people might not agree. In my experience, people tend to think that every valid Avro file should be able to be loaded by AvroStorage. Granted, there exist some restrictions (e.g. recursive records and unions), but even these restrictions have been loosened recently. Unless there is a convincing reason to not, I think that we should keep it that way.

In many cases, people already have data pipeline in place (e.g. Flume produces Avro files => Pig consumes Avro files), and it is not guaranteed that the top-level schema is always a record type.
{quote}
Here's a specific example: suppose that we have this schema:
\{"name" : "IntArray", "type" : "array", "items" : "int"\}
Suppose that we have 3 files to load, each with this schema, each containing an array of 10 integers. Should we load this into pig as a single bag with 30 integers? A bag containing three bags (each, in turn, containing 10 integers)? Or reject this file entirely?
{quote}
Currently, they are loaded as 3 tuples, and each tuple contains a bag of 10 integers.
{code}
({(1),(2), ... ,(10)})
({(1),(2), ... ,(10)})
({(1),(2), ... ,(10)})
{code}
Thoughts?
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Timothy Potter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503923#comment-13503923 ] 

Timothy Potter commented on PIG-3015:
-------------------------------------

Can this patch be applied to Pig 0.10?
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496378#comment-13496378 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joseph,

Thanks for the update. I support what you're proposing. I appreciate your effort to clean up the code. Just to be clear, I have the following questions:
{quote}
Test pig scripts will be kept in discrete files, with parameters as file names. I'll modify the test runner to set the runtime parameters correctly.
{quote}
Ideally, all Pig unit test should be written this way. Currently, Pig queries are hard-coded in the test code, which is not very nice. But changing it is going to be a long-term effort. Your changes for this jira will be isolated in {{TestAvroStorage}}, won't they? If not, can you please provide more detail? I am just trying to understand the scope of your proposal.
{quote}
I'm thinking about modifying the build process to compile human readable files (in JSON format) into avro files before running the tests.
{quote}
This will be fully automated in the current framework (ant + junit), so I can run {{ant test -Dtestcase=TestAvroStorage}} to run unit test cases, right? One exception for this might be a test case for corrupted  Avro files I guess.

Thanks!
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Attachment:     (was: PIG-3015.patch)
    
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498579#comment-13498579 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joseph,

First of all, thank you so much!

Secondly, considering the size of the patch, would you mind uploading it to the RB? This will encourage more people to review it.
https://reviews.apache.org/

You can choose pig-git to upload a diff file from the github repository.

Thirdly, I haven't fully read the patch yet and will do once it's uploaded on the RB. But I have a few minor comments as below:
- Can you please add the Apache license header to every new file?
- Can you please remove @author tags?
- Can you please replace {{System.err.println()}} with {{common.logging.log}}?
- Our indentation convention is 4 spaces and no tabs. You used 2 spaces, and I see 2 tabs in {{directory_test.pig}}.

Lastly, your bash script probably should be replaced by a python script (or another cross-platform script) because there is an on-going effort of porting Pig to Windows (PIG-2793). In particular, TestAvroStorage is added to the unit test suites, this will be an issue. Please feel free to open a sub-task for converting it to Python if you'd like to get help.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Russell Jurney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492680#comment-13492680 ] 

Russell Jurney commented on PIG-3015:
-------------------------------------

I agree that we should replace the old AvroStorage with this one, and that we should make AvroStorage a builtin.

However, I don't think its acceptable to break backwards-compatibility with the existing AvroStorage, and having two implementations at once seems confusing. It would be best to extend this implementation with those features required to maintain compatibility with the Piggybank AvroStorage before committing it as a builtin.

It sounds like you're on top of this, Joe and Chelsoo :) I'll be a tester.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Patch Info: Patch Available
    
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Johannes Schwenk (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510372#comment-13510372 ] 

Johannes Schwenk commented on PIG-3015:
---------------------------------------

First of all I want to say many thanks Joseph, for all the great work on this so far! This will be very useful for my work. 

By the way: You certainly know of PIG-2684 about the existing AvroStorage implementation havin problems with <code><alias_name>::</code> prefix that is added by pigs join operations? What is your solution to this issue in the new implementation?
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499348#comment-13499348 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

I have made all the changes that you suggested (including rewriting the script that builds test cases in Python) and have uploaded the new version to the RB: https://reviews.apache.org/r/8104/
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490824#comment-13490824 ] 

Alan Gates commented on PIG-3015:
---------------------------------

+1 for moving it into Pig proper.  Avro is a common format and it makes sense to guarantee support for it in Pig.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487179#comment-13487179 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

Just reading through the discussion on the user list.

I'll check out trunk, refactor/rename as needed, make sure it passes existing tests, fix bugs, then submit the patches. That will probably take me a few days to do.

Additionally, I'd like to get a few things correct the first time. Specifically, I'm trying to figure out how to deal with the plethora of possible options for load/store functions. I want to make sure that I cover all the important use cases regarding schemas. Here's the list that I came up with:

LoadFunc:
(1) Read the schema from the input file(s)
  (a) Just pick the schema from the most recent file
  (b) Check all the files to make sure the schemas are compatible
(2) Use a schema manually provided by the user

StoreFunc:
(1) Automatically translate the Pig schema to an Avro Schema
(2) Use a schema manually provided by the user
  (a) Allow the user to name the records and name space
  (b) Automatically pick a record and namespace name

                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491741#comment-13491741 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

I put the code in o.a.impl.util. Not a big deal to move it later if that's the preferred style.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488201#comment-13488201 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joseph,

1) Using different functions sounds OK to me, but couldn't we handle them via args using CommandLineParser? IMHO, this is simpler and more scalable. Another advantage of using CommandLineParser is that we don't have to infer the meaning of arguments based on the number of arguments. Other built-in storages (e.g. HBaseStorage) use CommandLineParser, so why don't we do the same to provide the universal syntax to the user across the project? Thoughts?

2) Multiple schema support
{quote}
this brings up another question: what does "compatible" mean in this case?
{quote}
Please refer to the rules listed in [PIG-2579|https://issues.apache.org/jira/browse/PIG-2579?focusedCommentId=13446546&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13446546]. I did this because it asked by several people. The use case is that people define Avro schemas, but they evolve over time. Since the AvroStorage used to assume that all the input files have the exactly the same schema, they couldn't load them. PIG-2579 was trying to address that inconvenience. Do you think that we should include a similar functionality as an option in the new storage?

3) Recursive record support
{quote}
You can't specify a recursive schema in Pig, so why allow users to load files with recursive schemas in Pig? By default, recursive schema definitions should result in an error, or at least a warning message. I'd propose that this be allowed only as an option.
{quote}
Agreed (and guilty :-)). In fact, this was a feature request from one of my customers. The rationale was that people couldn't change their already-defined recursive schemas, but they wanted to do some processing on non-recursive parts of data. Providing it as an option sound good to me.

4) Multiple store support
{quote}
Can you explain the use case for multiple stores with different output schemas? I'm having a hard time understanding why it makes sense to do something complicated like that.
{quote}
I think that I wasn't clear. All I wanted to say is that if we have more than one relation to store in a script, we should be able to do it.
{code}
set1 = load 'input1.txt' using PigStorage() as ( ... );
store set1 into 'set1' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');

set2 = load 'input2.txt' using PigStorage() as ( ... );
store set2 into 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '2');
{code}
The current storage supports multiple stores via the 'index' option. In fact, this is very hacky, and we should get rid of it. Nevertheless, I wanted to know if this will be still supported. On a second thought, I think that your proposal already implies multiple store support because:
- The output schema will be derived from the Pig schema per store, or
- The user will specify the output schema per store.

So I don't see any problem.

Thanks!
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496730#comment-13496730 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

Just TestAvroStorage, yes. I'm not trying to rewrite the whole test system, just clean up the AvroStorage tests. And yes, I'd want to either make an exception for corrupted Avro files or have a job that corrupts the files. 
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Status: Patch Available  (was: In Progress)

Here is a patch with a working implementation (plus new unit tests and a bash script to generate the test data files; just run the bash script in the test/org/apache/pig/builtin/avro directory to generate all the avro files needed for testing)
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489800#comment-13489800 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Mike, thanks for your opinion. I agree that passing the input schema into the LoadFunc is a good improvement.

Please feel free to comment on other issues too. Hopefully, we can resolve as many issues as possible while re-writing AvroStorage.


                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Attachment: PIG-3015.patch

I added support for files that don't have records, added option for dealing with double colons in variable names.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509296#comment-13509296 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

I made most of the recommended changes (thanks for looking this over), and have a follow up question:

I have always assumed that AvroStorage was designed to be used with Hadoop sequence files that contained a series of records, so I implemented AvroStorage to only work with a file in this format. Are there cases where the highest level schema for a file will be another type? If so... what does that mean for pig? Is there one record per file?

Here's a specific example: suppose that we have this schema:

{"name" : "IntArray", "type" : "array", "items" : "int"}

Suppose that we have 3 files to load, each with this schema, each containing an array of 10 integers. Should we load this into pig as a single bag with 30 integers? A bag containing three bags (each, in turn, containing 10 integers)? Or reject this file entirely?
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on PIG-3015 started by Joseph Adler.

> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501340#comment-13501340 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

I just took at look at PIG-2614. It looks like the PIG-2614 patch will be compatible with this patch; PIG-2614 simply counts errors as values are read from a LoadFunc. Am I missing something? I'd be happy to drop the option to ignore bad records; I think that would make the options for this function cleaner and easier to understand.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510033#comment-13510033 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Yes, it does. Thank you, sir!
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510629#comment-13510629 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

Hi Johannes,

As you probably know, the Avro specification limits the set of valid characters in names (see http://avro.apache.org/docs/current/spec.html#Names). Names must

- start with [A-Za-z_]
- subsequently contain only [A-Za-z0-9_]

So double colons aren't allowed. PIG-2684 proposes using namespaces as the solution. I think that's a poor choice; namespaces are often used for other purposes. Specifically, names spaces are essential if you are writing complicated data processing software that processes multiple types of avro serialized objects. In my experience, the avro schema and protocol compilers produce much better, more usable code if you use name spaces.

There are two good workarounds:

- The Pig user can rename variables in a bag before storing the bag using AvroStorage
- The Pig user can manually specify the output schema before storing the bag with AvroStorage

So, here's a specific suggestion:

- By default, throw an exception if the pig schema contains a name with a double-colon and the user does not specify an output schema
- Add an option to AvroStorage to transform double colons to something else. (Maybe double underscores? Maybe storing them in the namespace?)

What do you think?


                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493010#comment-13493010 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Russell,

Thank you very much for offering help. :-)
{quote}
However, I don't think its acceptable to break backwards-compatibility with the existing AvroStorage, and having two implementations at once seems confusing. It would be best to extend this implementation with those features required to maintain compatibility with the Piggybank AvroStorage before committing it as a builtin.
{quote}
Sure, we can wait until completing the new AvroStorage before commit it, and I won't insist to maintain two versions of AvroStorage if that's confusing to others.

But given that the new AvroStorage will have different options from the current AvroStorage, it seems unavoidable to introduce some backward incompatibility. For example, Joseph's proposal for new options are very different from those of the current AvroStorage. Would that be acceptable?
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486662#comment-13486662 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Thanks for the link.

You can upload the entire code as a single patch if you prefer. I suggested only because big patches usually take longer to be reviewed and committed, but I will review this one at least.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486489#comment-13486489 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

Here's the working version: https://github.com/josephadler/fast-avro-storage

I can break that up into multiple Jira tickets, though that feels like a lot of extra work; I threw away all the existing code and started from scratch. I do think it's reasonable to separate AvroStorage and TrevniStorage for now (though they are very closely related)
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509992#comment-13509992 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

I think that approach makes sense; each object in a file should be wrapped in a Tuple. Suppose that a file example.avro contained the data:

  {[1, 2, 3, 4, 5]}
  {[6, 7, 8, 9, 10]}

and had this schema: {"name" : "IntArray", "type" : "array", "items" : "int"}, and we loaded this as

  A = LOAD 'example.avro' USING AvroStorage;

The bag A would have the Pig schema A:{(IntArray:{(int)})}; it would contain two tuples, which would in turn each contain one bag of integers. Does that sound correct? If so, I'll go implement that.

                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Attachment: PIG-3015.patch

Here's the generated patch file.
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491231#comment-13491231 ] 

Cheolsoo Park commented on PIG-3015:
------------------------------------

Hi Joseph,

To answer your questions:

1) If I am not mistaken, o.a.p.impl.builtin is for internal built-in UDFs. I don't know exactly what your helper classes are like, but would o.a.p.impl.util be a better place?

Looking at the package tree, I also noticed that there is an *.impl.util package for each sub-component of Pig. So if your helper classes are AvroStorage-specific, you may want to create two new packages called o.a.p.hadoop.avro and o.a.p.hadoop.avro.util, and add AvroStorage to hadoop.avro and helper classes to hadoop.avro.util respectively.

Please anyone correct me if I am wrong here. I am a new committer. :-)

2) What you propose sounds good to me.

Thanks!
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487954#comment-13487954 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

Before addressing the questions, I wanted to propose a naming schema for the load and store functions. To be consistent with other Pig UDFs, I think it makes more sense to use different function names rather than passing different types of arguments to the UDF. Can I propose something like this:

LoadFuncs:

- AvroStorage. May be instantiated with zero, one, or two arguments. If called with no arguments, the function will load the schema from the most recent data file found in the specified path and use that schema. If called with one argument, the argument will be a String that specifies the input schema. The String may either contain the schema definition, may be a URI that refers to the location of the input schema in a file, or may be an example data file from which to read the schema. If two arguments are specified, the first argument refers to the type of the output records (the name of the type) and the second argument may be either a JSON string, a URI for a schema definition file, or a URI for an example file that contains the definition of that type.

 This function does not check schema compatibility of input files or allow recursive schema definitions. Fails when corrupted files are encountered.
- AvroStorage.AllowRecursive. Same as above, except this function does not check schema compatibility of input files but does allow recursive schema definitions. Recursively defined records are just defined as schemaless tuples in the Pig Schema.
- AvroStorage.IgnoreCorrupted Same as above, except this function will not allow recursive schema definitions, but will not fail on corrupted input files.
- AvroStorage.AllowRecursiveAndIgnoreCorrupted Same as above, except this function allows recursive definitions and does not fail on corrupted input files.


StoreFunc:

- AvroStorage. May be instantiated with zero, one, or two arguments; the meaning of the arguments can be inferred from how they are specified. If called with no arguments, the function will translate the pig schema to an Avro schema, use a default name for the record types, and not assign a namespace to the records. If called with one argument, the argument will be a String that may specify the output schema, or may specify the record name for the output records. If the string specifies the schema definition, may be a URI that refers to the location of the input schema in a file, or may be an example data file from which to reuse the schema. If two arguments are specified, they may refer to the name and namespace for the output records. Alternately, the first argument may refer to the type of the output records (the name of the schema), and the second argument may be either a JSON string, a URI for a schema definition file, or a URI for an example file that contains the definition of that type.


Answers to questions:

LoadFunc 1a: Yes, the storage function will convert avro schemas to pig schemas, and vice versa. 

I haven't tried to convert multiple "compatible but different" schemas to one pig schema. I believe that if you manually supply a schema to the function that is a superset of all the schemas in the input data, the underlying Avro libraries will take care of this for you... though this brings up another question: what does "compatible" mean in this case? Personally, I do not think that the core Pig library should attempt to resolve this problem for users; I think it is best for users to load files with different load functions, cast and rename fields as appropriate in pig code, then take a union of the values. It's possible to miss real (and important) errors if Pig does a lot of type conversions and manipulations under the covers.

LoadFunc 2: I think this is necessary for a few reasons: It's faster to supply a schema manually (the Pig run time doesn't have to read files from HDFS at planning time to detect the schema). By specifying the schema, you can also specify a subset of fields to de-serialize, reducing the size of the input data. Finally, by specifying a schema manually, you can read a set of files with compatible but different schemas.

I think PIG-2875 is a design mistake. If I had been involved in the project, I would have argued hard against this. You can't specify a recursive schema in Pig, so why allow users to load files with recursive schemas in Pig? It is possible to load recursively defined records into pig, but that seems like a recipe for confusion and errors. By default, recursive schema definitions should result in an error, or at least a warning message. I'd propose that this be allowed only as an option.

Storefunc 2a:

I don't think it's hard to specfiy those three options. It's probably OK for the StoreFunc to allow the user to specify either a schema, a URI that refers to a schema file, or a URI that refers to an example file, then for the function to figure out what the argument means and do the right thing. 

Can you explain the use case for multiple stores with different output schemas? I'm having a hard time understanding why it makes sense to do something complicated like that.


                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-3015) Rewrite of AvroStorage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park reassigned PIG-3015:
----------------------------------

    Assignee: Joseph Adler
    
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Adler updated PIG-3015:
------------------------------

    Attachment: PIG-3015.patch

Revised patch (compiles together all changes)
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496363#comment-13496363 ] 

Joseph Adler commented on PIG-3015:
-----------------------------------

Progress update: I merged in the code, and am now working on test cases. I plan to submit the patches for review later this week.

Right now, I am working on unit tests for AvroStorage. Because AvroStorage is so complicated, I am trying to find ways to make the test cases easier to manage. (I don't like seeing a single test file with dozens of distinct test cases, and dozens of test data files in one directory). I feel like it's too hard to understand what's being tested and what's not being tested, and too hard to maintain the tests. AvroStorage is very complicated, and I think it's worth changing the test strategy to be more methodical and rigorous. Here's what I'm proposing:

(1) Test files will be kept in different directories by file type: schema (AVSC) files, raw text input files, json formatted input files, uncompressed avro files, deflate compressed avro files, snappy compressed avro files, uncompressed avro output files, deflate compressed avro output files, snappy compressed output files. 
(2) Test pig scripts will be kept in discrete files, with parameters as file names. I'll modify the test runner to set the runtime parameters correctly. (I think this increases the readability of the test cases and also helps with debugging; you can always type "java -cp pig.jar org.apache.pig.Main -x local -f test_file" to run the files outside the test harness and see what happens)
(3) I'm thinking about modifying the build process to compile human readable files (in JSON format) into avro files before running the tests.

What do you guys think?
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira