You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jerry Chen (JIRA)" <ji...@apache.org> on 2013/08/01 11:09:52 UTC

[jira] [Created] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders

Jerry Chen created PIG-3404:
-------------------------------

             Summary: Improve Pig to ignore bad files or inaccessible files or folders
                 Key: PIG-3404
                 URL: https://issues.apache.org/jira/browse/PIG-3404
             Project: Pig
          Issue Type: New Feature
          Components: data
    Affects Versions: 0.11.2
            Reporter: Jerry Chen


There are use cases in Pig:
* A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression).
* A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory.

The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the  good data for processing; Ignore a few of bad files may be a better choice for such situations.

We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept.

The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira