You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2011/02/01 20:08:29 UTC

[jira] Created: (PIG-1836) Accumulator like interface should be used with Pig operators after (co)group in certain cases

Accumulator like interface should be used with Pig operators after (co)group in certain cases
---------------------------------------------------------------------------------------------

                 Key: PIG-1836
                 URL: https://issues.apache.org/jira/browse/PIG-1836
             Project: Pig
          Issue Type: Improvement
            Reporter: Alan Gates


There are a number of cases where people (co)group their data, and then pass it to an operator other than foreach with a UDF, but where an accumulator like interface would still make sense.  A few examples:

{code}
C = group B by $0;
D = foreach C generate flatten(B);
...

C = group B by $0;
D = stream C through 'script.py';
...

C = group B by $0;
store C into 'output';
{code}

In all these cases the following operator does not require all the data to be held in memory at once.  There may be others beyond this.  Changing this part of the pipeline would greatly speed these types of queries and make them less likely to die with out of memory errors.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1836) Accumulator like interface should be used with Pig operators after (co)group in certain cases

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1836:
--------------------------------

    Fix Version/s: 0.10

> Accumulator like interface should be used with Pig operators after (co)group in certain cases
> ---------------------------------------------------------------------------------------------
>
>                 Key: PIG-1836
>                 URL: https://issues.apache.org/jira/browse/PIG-1836
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Alan Gates
>             Fix For: 0.10
>
>
> There are a number of cases where people (co)group their data, and then pass it to an operator other than foreach with a UDF, but where an accumulator like interface would still make sense.  A few examples:
> {code}
> C = group B by $0;
> D = foreach C generate flatten(B);
> ...
> C = group B by $0;
> D = stream C through 'script.py';
> ...
> C = group B by $0;
> store C into 'output';
> {code}
> In all these cases the following operator does not require all the data to be held in memory at once.  There may be others beyond this.  Changing this part of the pipeline would greatly speed these types of queries and make them less likely to die with out of memory errors.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1836) Accumulator like interface should be used with Pig operators after (co)group in certain cases

Posted by "Olga Natkovich (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1836:
--------------------------------

    Fix Version/s:     (was: 0.10)
    
> Accumulator like interface should be used with Pig operators after (co)group in certain cases
> ---------------------------------------------------------------------------------------------
>
>                 Key: PIG-1836
>                 URL: https://issues.apache.org/jira/browse/PIG-1836
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Alan Gates
>
> There are a number of cases where people (co)group their data, and then pass it to an operator other than foreach with a UDF, but where an accumulator like interface would still make sense.  A few examples:
> {code}
> C = group B by $0;
> D = foreach C generate flatten(B);
> ...
> C = group B by $0;
> D = stream C through 'script.py';
> ...
> C = group B by $0;
> store C into 'output';
> {code}
> In all these cases the following operator does not require all the data to be held in memory at once.  There may be others beyond this.  Changing this part of the pipeline would greatly speed these types of queries and make them less likely to die with out of memory errors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira