You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/12/01 17:47:59 UTC

[jira] [Commented] (PIG-3891) FileBasedOutputSizeReader does not calculate size of files in sub-directories

    [ https://issues.apache.org/jira/browse/PIG-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712581#comment-15712581 ] 

Rohini Palaniswamy commented on PIG-3891:
-----------------------------------------

Comments:
     - CHANGES.txt will be modified when committing. Need not make any changes to that as part of patch
     - Please revert changes to ExecType and TezMiniCluster. We can't have public static changed to package protected as it is already being used by users. Once PIG-4923 goes in, we can add TEZ and SPARK there. 
   - In TestMRJobStats, can you change "The returned output size is expected to be the same as the file size" to "The returned output size is expected to be sum of file sizes in the sub-directories"
   - We try to avoid if (Tez) else (MR) conditions as much as possible in tests. For testOutputStats test in TestMultiStorage, can we just do following asserts and put hardcoded values instead of getting values from MR and Tez counters. That way test is more solid.  Also please do add a FILTER statement for out2 to filter couple of records so that bytes and records are not same as out1.  
{code}
Map<String, Long> multiStoreCounters = dagStats.getMultiStoreCounters();
+        PigStats stats = job.getStatistics();
+        assertEquals(HardCodedValueHere, stats.getBytesWritten());
+        List<OutputStats> outputStats = SimplePigStats.get().getOutputStats();
+        assertEquals(2, outputStats.size()); // 2 split conditions
+        assertEquals(HardCodedValueHere, outputStats.get(0).getBytes());
+        assertEquals(HardCodedValueHere, outputStats.get(1).getBytes());
+        assertEquals(HardCodedValueHere, outputStats.get(0).getRecords());
+        assertEquals(HardCodedValueHere, outputStats.get(1).getRecords());
+        assertEquals(9L, multiStoreCounters.get("Output records in _1_out2").longValue());
+        assertEquals(9L, multiStoreCounters.get("Output records in _0_out1").longValue());
{code}


> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -----------------------------------------------------------------------------
>
>                 Key: PIG-3891
>                 URL: https://issues.apache.org/jira/browse/PIG-3891
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0
>            Reporter: Rohini Palaniswamy
>            Assignee: Nandor Kollar
>         Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch, PIG-3891-4.patch
>
>
> FileBasedOutputSizeReader only includes files in the top level output directory. So if files are stored under subdirectories (For eg: MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a regression. A quick look at the code shows that the JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and code is same as  FileBasedOutputSizeReader. Need to investigate where the correct value comes from in 0.11 and fix it in 0.12.1/0.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)