You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Ashutosh Chauhan (JIRA)" <ji...@apache.org> on 2010/06/04 07:03:27 UTC
[jira] Created: (PIG-1437) [Optimization] Rewrite
GroupBy-Foreach-flatten(group) to Distinct
[Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
-----------------------------------------------------------------
Key: PIG-1437
URL: https://issues.apache.org/jira/browse/PIG-1437
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Minor
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1437) [Optimization] Rewrite
GroupBy-Foreach-flatten(group) to Distinct
Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875481#action_12875481 ]
Ashutosh Chauhan commented on PIG-1437:
---------------------------------------
Since this is logical transformation of query plan, logical optimizer is the ideal place for this optimization. But I think it instead might be easier to do on MR plan after it is generated.
> [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
> -----------------------------------------------------------------
>
> Key: PIG-1437
> URL: https://issues.apache.org/jira/browse/PIG-1437
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Ashutosh Chauhan
> Priority: Minor
>
> Its possible to rewrite queries like this
> {code}
> A = load 'data' as (name,age);
> B = group A by (name,age);
> C = foreach B generate group.name, group.age;
> dump C;
> {code}
> or
> {code}
> (name,age);
> B = group A by (name
> A = load 'data' as,age);
> C = foreach B generate flatten(group);
> dump C;
> {code}
> to
> {code}
> A = load 'data' as (name,age);
> B = distinct A;
> dump B;
> {code}
> This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1437) [Optimization] Rewrite
GroupBy-Foreach-flatten(group) to Distinct
Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai updated PIG-1437:
----------------------------
Assignee: Xuefu Zhang
Fix Version/s: 0.9.0
> [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
> -----------------------------------------------------------------
>
> Key: PIG-1437
> URL: https://issues.apache.org/jira/browse/PIG-1437
> Project: Pig
> Issue Type: Sub-task
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Ashutosh Chauhan
> Assignee: Xuefu Zhang
> Priority: Minor
> Fix For: 0.9.0
>
>
> Its possible to rewrite queries like this
> {code}
> A = load 'data' as (name,age);
> B = group A by (name,age);
> C = foreach B generate group.name, group.age;
> dump C;
> {code}
> or
> {code}
> (name,age);
> B = group A by (name
> A = load 'data' as,age);
> C = foreach B generate flatten(group);
> dump C;
> {code}
> to
> {code}
> A = load 'data' as (name,age);
> B = distinct A;
> dump B;
> {code}
> This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1437) [Optimization] Rewrite
GroupBy-Foreach-flatten(group) to Distinct
Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai updated PIG-1437:
----------------------------
Parent: PIG-1319
Issue Type: Sub-task (was: Bug)
> [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
> -----------------------------------------------------------------
>
> Key: PIG-1437
> URL: https://issues.apache.org/jira/browse/PIG-1437
> Project: Pig
> Issue Type: Sub-task
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Ashutosh Chauhan
> Priority: Minor
>
> Its possible to rewrite queries like this
> {code}
> A = load 'data' as (name,age);
> B = group A by (name,age);
> C = foreach B generate group.name, group.age;
> dump C;
> {code}
> or
> {code}
> (name,age);
> B = group A by (name
> A = load 'data' as,age);
> C = foreach B generate flatten(group);
> dump C;
> {code}
> to
> {code}
> A = load 'data' as (name,age);
> B = distinct A;
> dump B;
> {code}
> This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1437) [Optimization] Rewrite
GroupBy-Foreach-flatten(group) to Distinct
Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ashutosh Chauhan updated PIG-1437:
----------------------------------
Release Note: (was: Its possible to rewrite queries like this
{code}
A = load 'data' as (name,age);
B = group A by (name,age);
C = foreach B generate group.name, group.age;
dump C;
{code}
or
{code}
(name,age);
B = group A by (name
A = load 'data' as,age);
C = foreach B generate flatten(group);
dump C;
{code}
to
{code}
A = load 'data' as (name,age);
B = distinct A;
dump B;
{code}
This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. )
Description:
Its possible to rewrite queries like this
{code}
A = load 'data' as (name,age);
B = group A by (name,age);
C = foreach B generate group.name, group.age;
dump C;
{code}
or
{code}
(name,age);
B = group A by (name
A = load 'data' as,age);
C = foreach B generate flatten(group);
dump C;
{code}
to
{code}
A = load 'data' as (name,age);
B = distinct A;
dump B;
{code}
This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win.
> [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
> -----------------------------------------------------------------
>
> Key: PIG-1437
> URL: https://issues.apache.org/jira/browse/PIG-1437
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Ashutosh Chauhan
> Priority: Minor
>
> Its possible to rewrite queries like this
> {code}
> A = load 'data' as (name,age);
> B = group A by (name,age);
> C = foreach B generate group.name, group.age;
> dump C;
> {code}
> or
> {code}
> (name,age);
> B = group A by (name
> A = load 'data' as,age);
> C = foreach B generate flatten(group);
> dump C;
> {code}
> to
> {code}
> A = load 'data' as (name,age);
> B = distinct A;
> dump B;
> {code}
> This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.