You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jim Huang (JIRA)" <ji...@apache.org> on 2014/09/16 15:55:34 UTC

[jira] [Updated] (PIG-4175) PIG CROSS operation follow by STORE produces non-deterministic results each run

     [ https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Huang updated PIG-4175:
---------------------------
    Description: 
Three files will be attached to help visualize this issue.

1. mktestdata.py - to generate test data to feed the pig script
2. test_cross.pig - the PIG script using CROSS and STORE
3. test_cross.out - the PIG console output showing the input/output records delta

To reproduce this PIG CROSS operation problem, you need to use the supplied Python script,
mktestdata.py, to generate an input file that is at least 13,948,228,930 bytes (> 13GB).

The CROSS between raw_data (m records) and cross_count (1 record) should yield exactly (m records) as the output.  
The STORE results from the CROSS operations yielded about 1/3 of input record in raw_data as the output.  

If I joined the both of the CROSS operations together, the STORE results from the CROSS operations yielded about 2/3
of the input records in raw-data as the output.  
-- data = CROSS raw_data, field04s_count, subsection1_field04s_count, subsection2_field04s_count;


We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 2.x) clusters.  
The default HDFS block size is 128MB.  




> PIG CROSS operation follow by STORE produces non-deterministic results each run
> -------------------------------------------------------------------------------
>
>                 Key: PIG-4175
>                 URL: https://issues.apache.org/jira/browse/PIG-4175
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11, 0.12.0
>         Environment: RHEL 6/64-bit
>            Reporter: Jim Huang
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)