You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Zbigniew Rzepka (JIRA)" <ji...@apache.org> on 2015/10/07 15:53:26 UTC
[jira] [Created] (PIG-4695) Using 'replicated' left join results in
different result from regular left join.
Zbigniew Rzepka created PIG-4695:
------------------------------------
Summary: Using 'replicated' left join results in different result from regular left join.
Key: PIG-4695
URL: https://issues.apache.org/jira/browse/PIG-4695
Project: Pig
Issue Type: Bug
Affects Versions: 0.15.0
Reporter: Zbigniew Rzepka
There seems to be a difference in results between regular LEFT JOIN and replicated LEFT JOIN. This may be a case only with very small data sets, as we're using piece of code shown below in production with correct results.
Example:
I have two data sets:
first_period_users:
{code}
(108,11,all_users,all_users)
(108,13,all_users,all_users)
(108,17,all_users,all_users)
(138,11,all_users,all_users)
{code}
second_period_users:
{code}
(108,11,all_users,all_users)
(108,13,all_users,all_users)
{code}
When I use regular LEFT JOIN on these two I get the correct output:
{code:sql}
joined_periods_users = JOIN
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
{code}
output:
{code}
(108,11,all_users,all_users,108,11,all_users,all_users)
(138,11,all_users,all_users,,,,)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,17,all_users,all_users,,,,)
{code}
BUT, if I add {{USING 'replicated'}}, the result is completely different:
{code}
$joined_periods_users = JOIN
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value)
USING 'replicated';
{code}
output:
{code}
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,11,all_users,all_users,108,11,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,13,all_users,all_users,108,13,all_users,all_users)
(108,17,all_users,all_users,,,,)
(138,11,all_users,all_users,,,,)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)