You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Nandor Kollar (JIRA)" <ji...@apache.org> on 2016/09/12 10:52:20 UTC

[jira] [Commented] (PIG-3891) FileBasedOutputSizeReader does not calculate size of files in sub-directories

    [ https://issues.apache.org/jira/browse/PIG-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483786#comment-15483786 ] 

Nandor Kollar commented on PIG-3891:
------------------------------------

Attached patch:
- executed TestMultiStorage in Tez mode, after minor adjustments it passed in Tez mode too. When I executed the tests, it seemed that in Tez mode the statistics written to the console are not collected via FileBasedOutputSizeReader, I could see the correct values there even without the fix, but in MR mode the console output was incorrect without the recursive traversal fix in FileBasedOutputSizeReader. I don't know what kind of changes I should do in MRJobStats, JobStats, those tests passed even in Tez and in MR mode.
- in TestMultiStorage I added asserts for getMultiStoreCounters
- renamed the method, added a comment
- in addition, I fixed typos in methods in TestMRJobStats.java

[~rohini] could you please take a look at the latest patch?

> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -----------------------------------------------------------------------------
>
>                 Key: PIG-3891
>                 URL: https://issues.apache.org/jira/browse/PIG-3891
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0
>            Reporter: Rohini Palaniswamy
>            Assignee: Nandor Kollar
>         Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch, PIG-3891-4.patch
>
>
> FileBasedOutputSizeReader only includes files in the top level output directory. So if files are stored under subdirectories (For eg: MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a regression. A quick look at the code shows that the JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and code is same as  FileBasedOutputSizeReader. Need to investigate where the correct value comes from in 0.11 and fix it in 0.12.1/0.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)