You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cheolsoo Park (JIRA)" <ji...@apache.org> on 2012/07/15 04:44:35 UTC

[jira] [Commented] (PIG-2492) AvroStorage should recognize globs and commas

    [ https://issues.apache.org/jira/browse/PIG-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414564#comment-13414564 ] 

Cheolsoo Park commented on PIG-2492:
------------------------------------

Hi,

I am interested in getting this jira resolved, so I posted a new patch [^PIG-2492.patch] that hopefully addresses concerns expressed here. To summarize, I did the following:

1) I used functions that hadoop provides instead of implementing my own glob pattern matching. In fact, it was slightly more complicated than what Scott described for two reasons:
- _FileInputFormat.setInputFiles()_ doesn't find files in sub-directories. But currently, if the path is a directory, AvroStorage recursively loads files in a directory and its sub-directories.
- AvroStorage needs to know the schema of the files to load, so t is necessary to expand the glob pattern in AvroStorage.

Nevertheless, I was able to implement glob/comma support using _FileSystem.globStatus()_ and _FileInputFormat.setInputFiles()_ while not changing the current recursive load semantics.

2) URIs are handled properly because glob patterns are expanded by hadoop that knows how to handle URIs properly.

3) The glob syntax is the same as what's supported in PigStorage since PigStorage also uses _FileInputFormat.setInputFiles()_ to expand glob patterns. Some examples are as follows:
{code}
test_dir1/*
test_dir1/test_glob{1,2,3}.avro
{test_dir1,test_dir2}/test_glob*.avro
{code}

4) I assumed that all the files that match the glob pattern have the same schema. In fact, this is the same limitation that we have for loading a directory:
{quote}
If the input directory is a leaf directory, then we assume Avro data files in it have the same schema;
If the input directory contains sub-directoies, then we assume Avro data files in all sub-directories have the same schema.
{quote}
https://cwiki.apache.org/PIG/avrostorage.html

4) I added 4 unit tests to verify the functionality as follow:
- testDir verifies that AvroStorage recursively loads files in a directory and its sub-directories.
- testGlob1 to 3 verify that glob patterns are expanded properly.

In addition to the patch, I uploaded some .avro files [^avro_test_files.tar.gz] that are needed for my tests. To run the tests, please do the following:
{code}
tar -xf avro_test_files.tar.gz
ant clean compile-test piggybank -Dhadoopversion=20
cd contrib/piggybank/java
ant test -Dtestcase=TestAvroStorage
{code}
Please let me know what you think.

Thanks!
                
> AvroStorage should recognize globs and commas
> ---------------------------------------------
>
>                 Key: PIG-2492
>                 URL: https://issues.apache.org/jira/browse/PIG-2492
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>    Affects Versions: 0.9.1, 0.10.0
>            Reporter: Stan Rosenberg
>         Attachments: AvroStorage.patch, AvroStorageUtils.patch, PIG-2492.patch, avro_test_files.tar.gz
>
>
> I've patched AvroStorage and AvroStorageUtils to support the same file input syntax as currently supported
> by hadoop's FileInputFormat.  Specifically, globs and commas are supported.
> Somebody should write some unit tests for theses changes; I am currently pressed for time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira