You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cheolsoo Park (JIRA)" <ji...@apache.org> on 2012/09/01 02:55:07 UTC

[jira] [Updated] (PIG-2579) Support for multiple input schemas in AvroStorage

     [ https://issues.apache.org/jira/browse/PIG-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2579:
-------------------------------

    Attachment: PIG-2579.patch
                PIG-2579-avro_test_files.tar.gz

I updated the original Stan's patch re-basing it to trunk. While I kept the core logic unchanged, I made some modifications as follows:
# Removed glob pattern related code as it's resolved in PIG-2492.
# Added an option 'multiple_schema' to AvroStorage. By default, AvroStorage assumes that all the input files have the same schema, but if 'multiple_schema' is passed to load function, it tries to merge every input schema.
# Allows multiple schemas with the same name. I use paths to identify schemas instead of their names.
# Refactored code.
# Added unit tests.

I think that the most arguable part is how to merge two different schemas into one. In shorts, the rules are as follows:
# Different primitive types can be merged if certain conditions are met. Please see AvroStorageUtils.mergeType() for more details.
# Only the same kind of complex types can be merged. e.g. record + record => ok, but record + array => error.
# For records, the union of fields is returned.
# For arrays/maps, their element types/value types are merged.
# For unions, the union of unions is returned.
# For fixeds, only the same size of fixeds can be merged.

It's easy to see in a unit test (TestAvroStorageUtils) what's expected when two schemas are merged.

Please let me know if you have any questions/concerns.

Thanks!
                
> Support for multiple input schemas in AvroStorage
> -------------------------------------------------
>
>                 Key: PIG-2579
>                 URL: https://issues.apache.org/jira/browse/PIG-2579
>             Project: Pig
>          Issue Type: New Feature
>          Components: piggybank
>    Affects Versions: 0.9.2, 0.11
>            Reporter: Stan Rosenberg
>            Assignee: Cheolsoo Park
>            Priority: Minor
>         Attachments: avro_storage_union_schema.patch, avro_storage_union_schema_test.tar.gz, PIG-2579-avro_test_files.tar.gz, PIG-2579.patch
>
>
> This is a barebones patch for AvroStorage which enables support of multiple input schemas.  The assumption is that the input consists of avro files having different schemas that can be unioned, e.g., flat records.  
> A simple illustrative example is attached (avro_storage_union_schema_test.tar.gz): run create_avro1.pig, followed by create_avro2.pig, followed by read_avro.pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira