You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by "Handerson, Steven K." <St...@idearc.com> on 2008/07/29 08:52:59 UTC

RE: Newbie question -- nested foreach/generate

Folks:

Why does Pig Latin not allow nested foreach/generates?
Arbitrary nesting of data is already "allowed" constructively --
why not allow operations over arbitrarily nested structures?
It increases locality, and therefore parallelism,
and otherwise you have to hack things with additional groups / joins otherwise
(which will add unnecessary extra computation, slowing Pig programs down).

Example:
In naïve bayes, you extract features from documents (with categories) and group by features,
to get feature counts for various categories.
What you want to do next is (simple version) just sum the totals for
all categories, and then divide the category-specific counts by the total.
In pig:

we have {(feature1 {(category1, count), (category2, count)...})
         (feature2 {(category1, count), (category2, count)...}) ... }
we want to make 
{(feature1 total {(category, count), (category2, count)...
and then
{(feature1 {(category, count/total), (category2, count/total)} ..

Why can't this be done like this?
foreach feature {
  total = SUM(categorytuples.count);
  categoryprobs = foreach categorytuples {
    generate category, count/total;
  }
  generate feature, categoryprobs;
}

I'm afraid people are taking "map/reduce" too literally, and thinking
every map has to have a reduce.  This is a case where all you need is a map,
which of course is very parallelizable.

I suppose it makes things a little complicated for the planner in the above example,
because the optimizer needs to know that it needs to do two maps over categorytuples,
one to get the total, and the other to apply it, but the code spells out how it's done,
really all the planner needs to do is know that it can't optimize this further.

-- Steve

RE: Newbie question -- nested foreach/generate

Posted by "Handerson, Steven K." <St...@idearc.com>.

 
Alan,

Ok, great.  Thanks for the reply.

I was concerned because the documentation seems to indicate
that this was a design choice -- "note that we disallow nested
foreach..generates, because that would allow nesting to arbitrary depths."
But I'm glad that's not really the case, that it's just NYI.

I solved the particular problem by writing an EvalFunc that does
the necessary iterations -- but of course it would be great if 
PigLatin supported this directly, eventually.

-- Steve


-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Tuesday, July 29, 2008 10:54 AM
To: pig-user@incubator.apache.org
Subject: Re: Newbie question -- nested foreach/generate

The goal is certainly to support full nesting in foreach.  So eventually 
anything that you can do at the top level will be doable inside a 
foreach, including another foreach with additional nesting.  We just 
gotten to it yet.

Alan.

Handerson, Steven K. wrote:
> Folks:
>
> Why does Pig Latin not allow nested foreach/generates?
> Arbitrary nesting of data is already "allowed" constructively --
> why not allow operations over arbitrarily nested structures?
> It increases locality, and therefore parallelism,
> and otherwise you have to hack things with additional groups / joins otherwise
> (which will add unnecessary extra computation, slowing Pig programs down).
>
> Example:
> In naïve bayes, you extract features from documents (with categories) and group by features,
> to get feature counts for various categories.
> What you want to do next is (simple version) just sum the totals for
> all categories, and then divide the category-specific counts by the total.
> In pig:
>
> we have {(feature1 {(category1, count), (category2, count)...})
>          (feature2 {(category1, count), (category2, count)...}) ... }
> we want to make 
> {(feature1 total {(category, count), (category2, count)...
> and then
> {(feature1 {(category, count/total), (category2, count/total)} ..
>
> Why can't this be done like this?
> foreach feature {
>   total = SUM(categorytuples.count);
>   categoryprobs = foreach categorytuples {
>     generate category, count/total;
>   }
>   generate feature, categoryprobs;
> }
>
> I'm afraid people are taking "map/reduce" too literally, and thinking
> every map has to have a reduce.  This is a case where all you need is a map,
> which of course is very parallelizable.
>
> I suppose it makes things a little complicated for the planner in the above example,
> because the optimizer needs to know that it needs to do two maps over categorytuples,
> one to get the total, and the other to apply it, but the code spells out how it's done,
> really all the planner needs to do is know that it can't optimize this further.
>
> -- Steve
>
>     
>

Re: Newbie question -- nested foreach/generate

Posted by Alan Gates <ga...@yahoo-inc.com>.

The goal is certainly to support full nesting in foreach.  So eventually 
anything that you can do at the top level will be doable inside a 
foreach, including another foreach with additional nesting.  We just 
gotten to it yet.

Alan.

Handerson, Steven K. wrote:
> Folks:
>
> Why does Pig Latin not allow nested foreach/generates?
> Arbitrary nesting of data is already "allowed" constructively --
> why not allow operations over arbitrarily nested structures?
> It increases locality, and therefore parallelism,
> and otherwise you have to hack things with additional groups / joins otherwise
> (which will add unnecessary extra computation, slowing Pig programs down).
>
> Example:
> In naïve bayes, you extract features from documents (with categories) and group by features,
> to get feature counts for various categories.
> What you want to do next is (simple version) just sum the totals for
> all categories, and then divide the category-specific counts by the total.
> In pig:
>
> we have {(feature1 {(category1, count), (category2, count)...})
>          (feature2 {(category1, count), (category2, count)...}) ... }
> we want to make 
> {(feature1 total {(category, count), (category2, count)...
> and then
> {(feature1 {(category, count/total), (category2, count/total)} ..
>
> Why can't this be done like this?
> foreach feature {
>   total = SUM(categorytuples.count);
>   categoryprobs = foreach categorytuples {
>     generate category, count/total;
>   }
>   generate feature, categoryprobs;
> }
>
> I'm afraid people are taking "map/reduce" too literally, and thinking
> every map has to have a reduce.  This is a case where all you need is a map,
> which of course is very parallelizable.
>
> I suppose it makes things a little complicated for the planner in the above example,
> because the optimizer needs to know that it needs to do two maps over categorytuples,
> one to get the total, and the other to apply it, but the code spells out how it's done,
> really all the planner needs to do is know that it can't optimize this further.
>
> -- Steve
>
>     
>