You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2018/02/15 20:05:00 UTC

[jira] [Commented] (PIG-4608) FOREACH ... UPDATE

    [ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366197#comment-16366197 ] 

Koji Noguchi commented on PIG-4608:
-----------------------------------

{quote}The idea is to make ".." syntax more flexible,
{quote}
I think one of the goal here is to let users manipulate records without using ".." at all. 
 For the initial version, let's just focus on the basics. We can add more later, but of course changing is always tough.

I don't want this jira to go stale after having such a great contribution from Will.
 I feel having UPDATE and DROP with simple column(field) updates is a good start.

Only thing I'm not clear on is,
{code:java}
/* simple update using positional arguments */
a = FOREACH b UPDATE $1 with r+$2;
{code}
Should this be {{UPDATE 1 with r+$2}} ? 
 To me, {{UPDATE $1}} means  {{n=$1}} and updating the _n_th field accordingly.

> FOREACH ... UPDATE
> ------------------
>
>                 Key: PIG-4608
>                 URL: https://issues.apache.org/jira/browse/PIG-4608
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Haley Thrapp
>            Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do not match an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large number of fields (in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH ... UPDATE statement, allows the developer to focus on the actual logical changes instead of having to list all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes should be needed because we will leverage what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)