You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Handerson, Steven K." <St...@idearc.com> on 2008/07/28 21:34:16 UTC

Newbie question -- iterating over tuples

Folks,

Is there a way to do something akin to map (of map/reduce) over a tuple?
The input file is lines like this:

category word1 word2 ...

So simplest is to read it as a tuple (PigStorage ' '), but then I want
to iterate over the words
which are $1 ... <whatever> and create a bag of tuples, say:
(word, category)

I know I'm being lazy not writing code at this point, but I think Pig
should
be flexible enough to do what I want, ideally.

Tuples might also want something like a perl "shift" (front) or "pop"
(back) -- so you could 
kind of manually shift distinguished values off the front, and then
treat the 
rest of the tuple as a list of similar elements.

Anybody?

-- Steve




RE: Newbie question -- nested foreach/generate

Posted by "Handerson, Steven K." <St...@idearc.com>.
 
Alan,

Ok, great.  Thanks for the reply.

I was concerned because the documentation seems to indicate
that this was a design choice -- "note that we disallow nested
foreach..generates, because that would allow nesting to arbitrary depths."
But I'm glad that's not really the case, that it's just NYI.

I solved the particular problem by writing an EvalFunc that does
the necessary iterations -- but of course it would be great if 
PigLatin supported this directly, eventually.

-- Steve


-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Tuesday, July 29, 2008 10:54 AM
To: pig-user@incubator.apache.org
Subject: Re: Newbie question -- nested foreach/generate

The goal is certainly to support full nesting in foreach.  So eventually 
anything that you can do at the top level will be doable inside a 
foreach, including another foreach with additional nesting.  We just 
gotten to it yet.

Alan.

Handerson, Steven K. wrote:
> Folks:
>
> Why does Pig Latin not allow nested foreach/generates?
> Arbitrary nesting of data is already "allowed" constructively --
> why not allow operations over arbitrarily nested structures?
> It increases locality, and therefore parallelism,
> and otherwise you have to hack things with additional groups / joins otherwise
> (which will add unnecessary extra computation, slowing Pig programs down).
>
> Example:
> In naïve bayes, you extract features from documents (with categories) and group by features,
> to get feature counts for various categories.
> What you want to do next is (simple version) just sum the totals for
> all categories, and then divide the category-specific counts by the total.
> In pig:
>
> we have {(feature1 {(category1, count), (category2, count)...})
>          (feature2 {(category1, count), (category2, count)...}) ... }
> we want to make 
> {(feature1 total {(category, count), (category2, count)...
> and then
> {(feature1 {(category, count/total), (category2, count/total)} ..
>
> Why can't this be done like this?
> foreach feature {
>   total = SUM(categorytuples.count);
>   categoryprobs = foreach categorytuples {
>     generate category, count/total;
>   }
>   generate feature, categoryprobs;
> }
>
> I'm afraid people are taking "map/reduce" too literally, and thinking
> every map has to have a reduce.  This is a case where all you need is a map,
> which of course is very parallelizable.
>
> I suppose it makes things a little complicated for the planner in the above example,
> because the optimizer needs to know that it needs to do two maps over categorytuples,
> one to get the total, and the other to apply it, but the code spells out how it's done,
> really all the planner needs to do is know that it can't optimize this further.
>
> -- Steve
>
>     
>   

Re: Newbie question -- nested foreach/generate

Posted by Alan Gates <ga...@yahoo-inc.com>.
The goal is certainly to support full nesting in foreach.  So eventually 
anything that you can do at the top level will be doable inside a 
foreach, including another foreach with additional nesting.  We just 
gotten to it yet.

Alan.

Handerson, Steven K. wrote:
> Folks:
>
> Why does Pig Latin not allow nested foreach/generates?
> Arbitrary nesting of data is already "allowed" constructively --
> why not allow operations over arbitrarily nested structures?
> It increases locality, and therefore parallelism,
> and otherwise you have to hack things with additional groups / joins otherwise
> (which will add unnecessary extra computation, slowing Pig programs down).
>
> Example:
> In naïve bayes, you extract features from documents (with categories) and group by features,
> to get feature counts for various categories.
> What you want to do next is (simple version) just sum the totals for
> all categories, and then divide the category-specific counts by the total.
> In pig:
>
> we have {(feature1 {(category1, count), (category2, count)...})
>          (feature2 {(category1, count), (category2, count)...}) ... }
> we want to make 
> {(feature1 total {(category, count), (category2, count)...
> and then
> {(feature1 {(category, count/total), (category2, count/total)} ..
>
> Why can't this be done like this?
> foreach feature {
>   total = SUM(categorytuples.count);
>   categoryprobs = foreach categorytuples {
>     generate category, count/total;
>   }
>   generate feature, categoryprobs;
> }
>
> I'm afraid people are taking "map/reduce" too literally, and thinking
> every map has to have a reduce.  This is a case where all you need is a map,
> which of course is very parallelizable.
>
> I suppose it makes things a little complicated for the planner in the above example,
> because the optimizer needs to know that it needs to do two maps over categorytuples,
> one to get the total, and the other to apply it, but the code spells out how it's done,
> really all the planner needs to do is know that it can't optimize this further.
>
> -- Steve
>
>     
>   

RE: Newbie question -- nested foreach/generate

Posted by "Handerson, Steven K." <St...@idearc.com>.
Folks:

Why does Pig Latin not allow nested foreach/generates?
Arbitrary nesting of data is already "allowed" constructively --
why not allow operations over arbitrarily nested structures?
It increases locality, and therefore parallelism,
and otherwise you have to hack things with additional groups / joins otherwise
(which will add unnecessary extra computation, slowing Pig programs down).

Example:
In naïve bayes, you extract features from documents (with categories) and group by features,
to get feature counts for various categories.
What you want to do next is (simple version) just sum the totals for
all categories, and then divide the category-specific counts by the total.
In pig:

we have {(feature1 {(category1, count), (category2, count)...})
         (feature2 {(category1, count), (category2, count)...}) ... }
we want to make 
{(feature1 total {(category, count), (category2, count)...
and then
{(feature1 {(category, count/total), (category2, count/total)} ..

Why can't this be done like this?
foreach feature {
  total = SUM(categorytuples.count);
  categoryprobs = foreach categorytuples {
    generate category, count/total;
  }
  generate feature, categoryprobs;
}

I'm afraid people are taking "map/reduce" too literally, and thinking
every map has to have a reduce.  This is a case where all you need is a map,
which of course is very parallelizable.

I suppose it makes things a little complicated for the planner in the above example,
because the optimizer needs to know that it needs to do two maps over categorytuples,
one to get the total, and the other to apply it, but the code spells out how it's done,
really all the planner needs to do is know that it can't optimize this further.

-- Steve

    

RE: Newbie question -- iterating over tuples

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
You can contribute the function that you wrote back to the Piggybank and
also use functions available there:

http://wiki.apache.org/pig/PiggyBank

Olga 

> -----Original Message-----
> From: Handerson, Steven K. [mailto:Steven.Handerson@idearc.com] 
> Sent: Monday, July 28, 2008 1:15 PM
> To: pig-user@incubator.apache.org
> Subject: RE: Newbie question -- iterating over tuples
> 
>  
> Ok, actually I pretty easily wrote an Eval function, 
> PairFirstWithRest.  Cool!
> 
> Still, I think more of these kinds of things should be in a 
> library; of course you should be *able* to write code and 
> integrate, but a system that you can trick into doing what 
> you want (out of the
> box) is better.
> 
> Here it is, as a donation:
> 
> public class PairFirstWithRest extends EvalFunc<DataBag> {
>     @Override
>         public void exec(Tuple input, DataBag output) throws 
> IOException {
>         Datum first = input.getField(0);
>         for (int i=1; i<input.arity(); i++) {
>             ArrayList<Datum> list = new ArrayList<Datum>();
>             list.add(first);
>             list.add(input.getField(i));
>             output.add(new Tuple(list));
>         }
>     }
> }
> 
> -- Steve
> 
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Monday, July 28, 2008 3:39 PM
> To: pig-user@incubator.apache.org
> Subject: RE: Newbie question -- iterating over tuples
> 
> You would need a custom load function to do this.
> 
> Olga 
> 
> > -----Original Message-----
> > From: Handerson, Steven K. [mailto:Steven.Handerson@idearc.com]
> > Sent: Monday, July 28, 2008 12:34 PM
> > To: pig-user@incubator.apache.org
> > Subject: Newbie question -- iterating over tuples
> > 
> > 
> > Folks,
> > 
> > Is there a way to do something akin to map (of map/reduce) over a 
> > tuple?
> > The input file is lines like this:
> > 
> > category word1 word2 ...
> > 
> > So simplest is to read it as a tuple (PigStorage ' '), but 
> then I want 
> > to iterate over the words which are $1 ...
> > <whatever> and create a bag of tuples, say:
> > (word, category)
> > 
> > I know I'm being lazy not writing code at this point, but I 
> think Pig 
> > should be flexible enough to do what I want, ideally.
> > 
> > Tuples might also want something like a perl "shift" 
> (front) or "pop"
> > (back) -- so you could
> > kind of manually shift distinguished values off the front, and then 
> > treat the rest of the tuple as a list of similar elements.
> > 
> > Anybody?
> > 
> > -- Steve
> > 
> > 
> > 
> > 
> 

RE: Newbie question -- iterating over tuples

Posted by "Handerson, Steven K." <St...@idearc.com>.
 
Ok, actually I pretty easily wrote an Eval function,
PairFirstWithRest.  Cool!

Still, I think more of these kinds of things should be in a library;
of course you should be *able* to write code and integrate,
but a system that you can trick into doing what you want (out of the
box) is better.

Here it is, as a donation:

public class PairFirstWithRest extends EvalFunc<DataBag> {
    @Override
        public void exec(Tuple input, DataBag output) throws IOException
{
        Datum first = input.getField(0);
        for (int i=1; i<input.arity(); i++) {
            ArrayList<Datum> list = new ArrayList<Datum>();
            list.add(first);
            list.add(input.getField(i));
            output.add(new Tuple(list));
        }
    }
}

-- Steve


-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Monday, July 28, 2008 3:39 PM
To: pig-user@incubator.apache.org
Subject: RE: Newbie question -- iterating over tuples

You would need a custom load function to do this.

Olga 

> -----Original Message-----
> From: Handerson, Steven K. [mailto:Steven.Handerson@idearc.com] 
> Sent: Monday, July 28, 2008 12:34 PM
> To: pig-user@incubator.apache.org
> Subject: Newbie question -- iterating over tuples
> 
> 
> Folks,
> 
> Is there a way to do something akin to map (of map/reduce) 
> over a tuple?
> The input file is lines like this:
> 
> category word1 word2 ...
> 
> So simplest is to read it as a tuple (PigStorage ' '), but 
> then I want to iterate over the words which are $1 ... 
> <whatever> and create a bag of tuples, say:
> (word, category)
> 
> I know I'm being lazy not writing code at this point, but I 
> think Pig should be flexible enough to do what I want, ideally.
> 
> Tuples might also want something like a perl "shift" (front) or "pop"
> (back) -- so you could
> kind of manually shift distinguished values off the front, 
> and then treat the rest of the tuple as a list of similar elements.
> 
> Anybody?
> 
> -- Steve
> 
> 
> 
> 

RE: Newbie question -- iterating over tuples

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
You would need a custom load function to do this.

Olga 

> -----Original Message-----
> From: Handerson, Steven K. [mailto:Steven.Handerson@idearc.com] 
> Sent: Monday, July 28, 2008 12:34 PM
> To: pig-user@incubator.apache.org
> Subject: Newbie question -- iterating over tuples
> 
> 
> Folks,
> 
> Is there a way to do something akin to map (of map/reduce) 
> over a tuple?
> The input file is lines like this:
> 
> category word1 word2 ...
> 
> So simplest is to read it as a tuple (PigStorage ' '), but 
> then I want to iterate over the words which are $1 ... 
> <whatever> and create a bag of tuples, say:
> (word, category)
> 
> I know I'm being lazy not writing code at this point, but I 
> think Pig should be flexible enough to do what I want, ideally.
> 
> Tuples might also want something like a perl "shift" (front) or "pop"
> (back) -- so you could
> kind of manually shift distinguished values off the front, 
> and then treat the rest of the tuple as a list of similar elements.
> 
> Anybody?
> 
> -- Steve
> 
> 
> 
>