You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cheolsoo Park (JIRA)" <ji...@apache.org> on 2012/08/02 11:27:02 UTC

[jira] [Created] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Cheolsoo Park created PIG-2856:
----------------------------------

             Summary: AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
                 Key: PIG-2856
                 URL: https://issues.apache.org/jira/browse/PIG-2856
             Project: Pig
          Issue Type: Bug
          Components: piggybank
    Affects Versions: 0.10.0
            Reporter: Cheolsoo Park
            Assignee: Cheolsoo Park


This is a regression from PIG-2492.

When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:

{code}
static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
...
FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
...
for (FileStatus file : matchedFiles) {
    if (file.isDir()) {
-        for (FileStatus sub : fs.listStatus(path)) {
+        for (FileStatus sub : fs.listStatus(file.getPath())) {
            getAllSubDirs(sub.getPath(), job, paths);
        }
    }
}
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429009#comment-13429009 ] 

Santhosh Srinivasan commented on PIG-2856:
------------------------------------------

Addendum to previous comment - its unrelated to this patch and existed previously.
                
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.11
>
>         Attachments: PIG-2856-2.patch, PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-2856:
-------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.11
>
>         Attachments: PIG-2856-2.patch, PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2856:
-------------------------------

    Attachment: PIG-2856.patch

Attach is a patch that fixes the bug in getAllSubDirs() and updates the unit test testGlob1.

Regarding the test, "expected_test_dir_1.avro" includes files in test_dir1 but doesn't include ones in its sub-directory test_subdir. On the other hand, "expected_testDir.avro" includes files not only test_dir1 but also its sub-directory test_subdir.

Since all files in test_dir1 and its sub-directory are supposed to be loaded, "expected_testDir.avro" is used.
                
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2856:
-------------------------------

           Patch Info: Patch Available
    Affects Version/s:     (was: 0.10.0)
                       0.11
    
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2856:
-------------------------------

    Status: Patch Available  (was: Open)

Review board:
https://reviews.apache.org/r/6318/
                
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428981#comment-13428981 ] 

Santhosh Srinivasan commented on PIG-2856:
------------------------------------------

All unit tests pass except TestDBStorage (Hadoop 20) and TestMultiStorage (Hadoop 20 and Hadoop 23). Patch has been committed. Thanks Cheolsoo!
                
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.11
>
>         Attachments: PIG-2856-2.patch, PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Santhosh Srinivasan updated PIG-2856:
-------------------------------------

    Fix Version/s: 0.11
    
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.11
>
>         Attachments: PIG-2856-2.patch, PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2856:
-------------------------------

    Attachment: PIG-2856-2.patch

Regarding why this problem was not caught by testGlob1, there are actually two reasons:

# The expected output was incorrect (as mentioned above).
# The job status was not checked at all. So even though the job failed, the test still passed if it generated the expected output. In testGlob1, the job failed after loading 3 files, but since that happened to be the expected output, the test still passed. 

I've updated the patch so that not only is the expected output for testGlob1 updated, but the job status also is checked.

Thanks!
                
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2856-2.patch, PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2856) AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429008#comment-13429008 ] 

Santhosh Srinivasan commented on PIG-2856:
------------------------------------------

Forgot to add that TestLookupInFiles in Hadoop 23 is erroring out.
                
> AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2856
>                 URL: https://issues.apache.org/jira/browse/PIG-2856
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.11
>
>         Attachments: PIG-2856-2.patch, PIG-2856.patch
>
>
> This is a regression from PIG-2492.
> When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:
> {code}
> static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
> ...
> FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
> ...
> for (FileStatus file : matchedFiles) {
>     if (file.isDir()) {
> -        for (FileStatus sub : fs.listStatus(path)) {
> +        for (FileStatus sub : fs.listStatus(file.getPath())) {
>             getAllSubDirs(sub.getPath(), job, paths);
>         }
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira