You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Nandor Kollar (JIRA)" <ji...@apache.org> on 2016/07/20 13:32:20 UTC

[jira] [Commented] (PIG-3891) FileBasedOutputSizeReader does not calculate size of files in sub-directories

    [ https://issues.apache.org/jira/browse/PIG-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385858#comment-15385858 ] 

Nandor Kollar commented on PIG-3891:
------------------------------------

The patch for FileBasedOutputSizeReader looks good for me, but I think a new test case is required in TestMRJobStats to test this case not just in piggybank. I attached a 3rd version of this patch with a new test case with and changed the MultiStore test case too: it seems that FileBasedOutputSizeReader is used when the script is executed in batch mode, and it stores the result in multiple stores (using MultiStores in subdirectories), and not just in one (for just one store command, the mapreduce counters are taken into account).

[~rohini] wouldn't testGetOutputSizeUsingFileBasedStorage in TestMRJobStats test the filesize if it is a file and not a path? With the patch applied, this test is green.

> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -----------------------------------------------------------------------------
>
>                 Key: PIG-3891
>                 URL: https://issues.apache.org/jira/browse/PIG-3891
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0
>            Reporter: Rohini Palaniswamy
>            Assignee: Nandor Kollar
>         Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch
>
>
> FileBasedOutputSizeReader only includes files in the top level output directory. So if files are stored under subdirectories (For eg: MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a regression. A quick look at the code shows that the JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and code is same as  FileBasedOutputSizeReader. Need to investigate where the correct value comes from in 0.11 and fix it in 0.12.1/0.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)