You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Abhishek Agarwal (JIRA)" <ji...@apache.org> on 2014/06/24 21:36:24 UTC
[jira] [Commented] (PIG-3931) DUMP should limit how much data it emits

    [ https://issues.apache.org/jira/browse/PIG-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042572#comment-14042572 ] 

Abhishek Agarwal commented on PIG-3931:
---------------------------------------

+1 for having DUMP take limit as an additional argument. It is certainly more convenient. 

> DUMP should limit how much data it emits
> ----------------------------------------
>
>                 Key: PIG-3931
>                 URL: https://issues.apache.org/jira/browse/PIG-3931
>             Project: Pig
>          Issue Type: Improvement
>          Components: grunt, impl
>            Reporter: Philip (flip) Kromer
>            Priority: Minor
>              Labels: dump, grunt, inline, limit, nested, operator
>
> The DUMP command is fairly dangerous: leave a stray DUMP uncommented from debugging your script on reduced data and it will spew a terabyte of data into your console with no apology. 
> 1. By (configurable) default, DUMP should not emit more than 1MB of data
> 2. The DUMP statement should accept a limit on rows
> h3. Safety Valve limit on output size
> Pig should gain a pig.max_dump_bytes configuration variable imposing an approximate upper bound on how much data DUMP will emit. Since a GROUP BY statement can generate an extremely large bag, this safety valve limit should be bytes and not rows. I propose a default of 1,000,000 bytes -- good for about 1000 records of 1k each. Pig should emit a warning to the console if the max_dump_bytes limit is hit. 
> This is a breaking change, but users shouldn't be using DUMP other than for experimentation. Pig should favor the experimentation use case, and let the foolhardy push the max_dump_bytes limit back up on their own.
> h3. DUMP can elegantly limit the number of rows
> Right now I have to write the following annoyingly-wordy statement:
> {code}
> dumpable = LIMIT teams 10 ; DUMP dumpable;
> {code}
> One approach would be to allow DUMP to accept an inline (nested) operator. Assignment statements can have inline operators, but dump can't:
> {code}
> -- these work, which is so awesome:
> some = FOREACH (LIMIT teams 10) GENERATE team_id, park_id;
> some = GROUP (LIMIT teams 10) BY park_id;
> STORE (LIMIT teams 10) INTO '/tmp/some_teams';
> -- these don't work, but maybe they should:
> DUMP (LIMIT teams 10);
> DUMP (GROUP teams BY team_id);
> {code}
> Alternatively, DUMP could accept an argument:
> {code}
> dumpable = DUMP teams LIMIT 10;
> dumpable = DUMP teams LIMIT ALL;
> {code}
> The generated plan should be equivalent to that from `some = LIMIT teams 10 ; DUMP some` so that optimizations on LIMIT kick in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)