You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Zbigniew Rzepka (JIRA)" <ji...@apache.org> on 2015/10/07 15:53:26 UTC

[jira] [Created] (PIG-4695) Using 'replicated' left join results in different result from regular left join.

Zbigniew Rzepka created PIG-4695:
------------------------------------

             Summary: Using 'replicated' left join results in different result from regular left join.
                 Key: PIG-4695
                 URL: https://issues.apache.org/jira/browse/PIG-4695
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.15.0
            Reporter: Zbigniew Rzepka


There seems to be a difference in results between regular LEFT JOIN and replicated LEFT JOIN. This may be a case only with very small data sets, as we're using piece of code shown below in production with correct results.

Example:
I have two data sets:

first_period_users:
{code}
(108,11,all_users,all_users)
(108,13,all_users,all_users)
(108,17,all_users,all_users)
(138,11,all_users,all_users)
{code}
second_period_users:
{code}
(108,11,all_users,all_users)
(108,13,all_users,all_users)
{code}

When I use regular LEFT JOIN on these two I get the correct output:
{code:sql}
joined_periods_users = JOIN 
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
{code}

output:
{code}
(108,11,all_users,all_users,108,11,all_users,all_users)
(138,11,all_users,all_users,,,,)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,17,all_users,all_users,,,,)
{code}

BUT, if I add {{USING 'replicated'}}, the result is completely different:
{code}
$joined_periods_users = JOIN 
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value) 
USING 'replicated';
{code}
output:
{code}
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,17,all_users,all_users,,,,)
(138,11,all_users,all_users,,,,)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)