You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2011/10/06 17:50:42 UTC

Using 'collected' group

Hi guys,
It seems like our 'collected' option for group is pretty limited.
Imagine I have the following (silly example) script:

tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
text:chararray, ts:long);
happy_words = load 'happy_words' using HappyLoader() as (word:chararray);

ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
(ngram:chararray);

-- get only happy ngrams, using replicated to avoid MR step
happy_ngrams = join ngrams by ngram, happy_words by word using 'replicated';

-- find only happy tweets. We know ngrams that were exploded from a single
tweet
-- must be in the same mapper still, so in theory this should work
happy_tweets = group happy_ngrams by (id, uid) using 'collected';


But this doesn't work, of course, because there's a whole mess of operators
between the load and the group, including a join, and nothing makes any
guarantees about (id, uid) being on the same mapper except for what the user
knows about the data.

What's the right approach to let the user force this through?
a) this is an edge case optimization that's more trouble than it is worth
b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
disable sanity checks
c) using 'collected-its-cool-dmitriy-said-its-ok'
d) drop the checks altogether
e) something else?

D

Re: Using 'collected' group

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I agree with your sentiment, Thejas. Perhaps rather than calling it a new
group type, we can introduce a keyword that can be used in multiple places?
Something like:

c = group x by id using 'collected' __nosafety

(double underscores call attention to the fact that this is a keyword, and a
super-user feature at that)

Then we can use the same keyword to turn off checking for merge joins, etc,
on a per-call basis.

D

On Fri, Oct 7, 2011 at 4:29 PM, Thejas Nair <th...@hortonworks.com> wrote:

> I would vote for option C - i would like the user to sign off in each place
> the feature is used.
>
> pig scripts will be modified over time, and person making the edit might
> not notice that the checks are turned off elsewhere in the script. If it is
> set in a properties file, it could get inadvertently used. I think dealing
> with incorrect results is too expensive, and justifies this.
>
> -Thejas
>
>
>
> On 10/7/11 8:23 AM, Alan Gates wrote:
>
>> I would vote for Dmitriy's original option b, on a per feature basis.  I
>> know per feature switches are more cumbersome, but a "turn off all sanity
>> checks" option is dangerous.  When removing safeties it seems better to do
>> it one at a time.
>>
>> Alan.
>>
>> On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:
>>
>>  Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which
>>> is
>>> totally backwards, since if you are a dummy, the last thing you will do
>>> is
>>> use a little-known parameter to protect yourself... but I digress.
>>>
>>> Being able to set safety valves per-script seems like a good idea. Make
>>> it
>>> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
>>> etc?)
>>>
>>> D
>>>
>>> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan<ha...@apache.org>*
>>> *wrote:
>>>
>>>  One possibility is to introduce 'mode' in Pig with default value of
>>>> 'strict'. Other values being 'non-strict' or potentially others. Another
>>>> use
>>>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>>>> PigStorage cannot guarantee all the requirements imposed by Merge Join,
>>>> but
>>>> you can still use it in most cases. I dont recall all the details but
>>>> discussion can be found at: https://issues.apache.org/**
>>>> jira/browse/PIG-1518 <https://issues.apache.org/jira/browse/PIG-1518>
>>>>
>>>> Ashutosh
>>>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy<dv...@gmail.com>
>>>>  wrote:
>>>>
>>>>  Hi guys,
>>>>> It seems like our 'collected' option for group is pretty limited.
>>>>> Imagine I have the following (silly example) script:
>>>>>
>>>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>>>> text:chararray, ts:long);
>>>>> happy_words = load 'happy_words' using HappyLoader() as
>>>>> (word:chararray);
>>>>>
>>>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>>>> (ngram:chararray);
>>>>>
>>>>> -- get only happy ngrams, using replicated to avoid MR step
>>>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>>>> 'replicated';
>>>>>
>>>>> -- find only happy tweets. We know ngrams that were exploded from a
>>>>>
>>>> single
>>>>
>>>>> tweet
>>>>> -- must be in the same mapper still, so in theory this should work
>>>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>>>>
>>>>>
>>>>> But this doesn't work, of course, because there's a whole mess of
>>>>>
>>>> operators
>>>>
>>>>> between the load and the group, including a join, and nothing makes any
>>>>> guarantees about (id, uid) being on the same mapper except for what the
>>>>> user
>>>>> knows about the data.
>>>>>
>>>>> What's the right approach to let the user force this through?
>>>>> a) this is an edge case optimization that's more trouble than it is
>>>>> worth
>>>>> b) something like "set pig.i.know.what.i.am.doing.**collectedgroup=true
>>>>> to
>>>>> disable sanity checks
>>>>> c) using 'collected-its-cool-dmitriy-**said-its-ok'
>>>>> d) drop the checks altogether
>>>>> e) something else?
>>>>>
>>>>> D
>>>>>
>>>>>
>>>>
>>
>

Re: Using 'collected' group

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I agree with your sentiment, Thejas. Perhaps rather than calling it a new
group type, we can introduce a keyword that can be used in multiple places?
Something like:

c = group x by id using 'collected' __nosafety

(double underscores call attention to the fact that this is a keyword, and a
super-user feature at that)

Then we can use the same keyword to turn off checking for merge joins, etc,
on a per-call basis.

D

On Fri, Oct 7, 2011 at 4:29 PM, Thejas Nair <th...@hortonworks.com> wrote:

> I would vote for option C - i would like the user to sign off in each place
> the feature is used.
>
> pig scripts will be modified over time, and person making the edit might
> not notice that the checks are turned off elsewhere in the script. If it is
> set in a properties file, it could get inadvertently used. I think dealing
> with incorrect results is too expensive, and justifies this.
>
> -Thejas
>
>
>
> On 10/7/11 8:23 AM, Alan Gates wrote:
>
>> I would vote for Dmitriy's original option b, on a per feature basis.  I
>> know per feature switches are more cumbersome, but a "turn off all sanity
>> checks" option is dangerous.  When removing safeties it seems better to do
>> it one at a time.
>>
>> Alan.
>>
>> On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:
>>
>>  Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which
>>> is
>>> totally backwards, since if you are a dummy, the last thing you will do
>>> is
>>> use a little-known parameter to protect yourself... but I digress.
>>>
>>> Being able to set safety valves per-script seems like a good idea. Make
>>> it
>>> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
>>> etc?)
>>>
>>> D
>>>
>>> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan<ha...@apache.org>*
>>> *wrote:
>>>
>>>  One possibility is to introduce 'mode' in Pig with default value of
>>>> 'strict'. Other values being 'non-strict' or potentially others. Another
>>>> use
>>>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>>>> PigStorage cannot guarantee all the requirements imposed by Merge Join,
>>>> but
>>>> you can still use it in most cases. I dont recall all the details but
>>>> discussion can be found at: https://issues.apache.org/**
>>>> jira/browse/PIG-1518 <https://issues.apache.org/jira/browse/PIG-1518>
>>>>
>>>> Ashutosh
>>>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy<dv...@gmail.com>
>>>>  wrote:
>>>>
>>>>  Hi guys,
>>>>> It seems like our 'collected' option for group is pretty limited.
>>>>> Imagine I have the following (silly example) script:
>>>>>
>>>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>>>> text:chararray, ts:long);
>>>>> happy_words = load 'happy_words' using HappyLoader() as
>>>>> (word:chararray);
>>>>>
>>>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>>>> (ngram:chararray);
>>>>>
>>>>> -- get only happy ngrams, using replicated to avoid MR step
>>>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>>>> 'replicated';
>>>>>
>>>>> -- find only happy tweets. We know ngrams that were exploded from a
>>>>>
>>>> single
>>>>
>>>>> tweet
>>>>> -- must be in the same mapper still, so in theory this should work
>>>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>>>>
>>>>>
>>>>> But this doesn't work, of course, because there's a whole mess of
>>>>>
>>>> operators
>>>>
>>>>> between the load and the group, including a join, and nothing makes any
>>>>> guarantees about (id, uid) being on the same mapper except for what the
>>>>> user
>>>>> knows about the data.
>>>>>
>>>>> What's the right approach to let the user force this through?
>>>>> a) this is an edge case optimization that's more trouble than it is
>>>>> worth
>>>>> b) something like "set pig.i.know.what.i.am.doing.**collectedgroup=true
>>>>> to
>>>>> disable sanity checks
>>>>> c) using 'collected-its-cool-dmitriy-**said-its-ok'
>>>>> d) drop the checks altogether
>>>>> e) something else?
>>>>>
>>>>> D
>>>>>
>>>>>
>>>>
>>
>

Re: Using 'collected' group

Posted by Thejas Nair <th...@hortonworks.com>.
I would vote for option C - i would like the user to sign off in each 
place the feature is used.

pig scripts will be modified over time, and person making the edit might 
not notice that the checks are turned off elsewhere in the script. If it 
is set in a properties file, it could get inadvertently used. I think 
dealing with incorrect results is too expensive, and justifies this.

-Thejas


On 10/7/11 8:23 AM, Alan Gates wrote:
> I would vote for Dmitriy's original option b, on a per feature basis.  I know per feature switches are more cumbersome, but a "turn off all sanity checks" option is dangerous.  When removing safeties it seems better to do it one at a time.
>
> Alan.
>
> On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:
>
>> Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
>> totally backwards, since if you are a dummy, the last thing you will do is
>> use a little-known parameter to protect yourself... but I digress.
>>
>> Being able to set safety valves per-script seems like a good idea. Make it
>> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
>> etc?)
>>
>> D
>>
>> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan<ha...@apache.org>wrote:
>>
>>> One possibility is to introduce 'mode' in Pig with default value of
>>> 'strict'. Other values being 'non-strict' or potentially others. Another
>>> use
>>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>>> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
>>> you can still use it in most cases. I dont recall all the details but
>>> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>>>
>>> Ashutosh
>>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy<dv...@gmail.com>  wrote:
>>>
>>>> Hi guys,
>>>> It seems like our 'collected' option for group is pretty limited.
>>>> Imagine I have the following (silly example) script:
>>>>
>>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>>> text:chararray, ts:long);
>>>> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
>>>>
>>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>>> (ngram:chararray);
>>>>
>>>> -- get only happy ngrams, using replicated to avoid MR step
>>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>>> 'replicated';
>>>>
>>>> -- find only happy tweets. We know ngrams that were exploded from a
>>> single
>>>> tweet
>>>> -- must be in the same mapper still, so in theory this should work
>>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>>>
>>>>
>>>> But this doesn't work, of course, because there's a whole mess of
>>> operators
>>>> between the load and the group, including a join, and nothing makes any
>>>> guarantees about (id, uid) being on the same mapper except for what the
>>>> user
>>>> knows about the data.
>>>>
>>>> What's the right approach to let the user force this through?
>>>> a) this is an edge case optimization that's more trouble than it is worth
>>>> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
>>>> disable sanity checks
>>>> c) using 'collected-its-cool-dmitriy-said-its-ok'
>>>> d) drop the checks altogether
>>>> e) something else?
>>>>
>>>> D
>>>>
>>>
>


Re: Using 'collected' group

Posted by Thejas Nair <th...@hortonworks.com>.
I would vote for option C - i would like the user to sign off in each 
place the feature is used.

pig scripts will be modified over time, and person making the edit might 
not notice that the checks are turned off elsewhere in the script. If it 
is set in a properties file, it could get inadvertently used. I think 
dealing with incorrect results is too expensive, and justifies this.

-Thejas


On 10/7/11 8:23 AM, Alan Gates wrote:
> I would vote for Dmitriy's original option b, on a per feature basis.  I know per feature switches are more cumbersome, but a "turn off all sanity checks" option is dangerous.  When removing safeties it seems better to do it one at a time.
>
> Alan.
>
> On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:
>
>> Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
>> totally backwards, since if you are a dummy, the last thing you will do is
>> use a little-known parameter to protect yourself... but I digress.
>>
>> Being able to set safety valves per-script seems like a good idea. Make it
>> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
>> etc?)
>>
>> D
>>
>> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan<ha...@apache.org>wrote:
>>
>>> One possibility is to introduce 'mode' in Pig with default value of
>>> 'strict'. Other values being 'non-strict' or potentially others. Another
>>> use
>>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>>> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
>>> you can still use it in most cases. I dont recall all the details but
>>> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>>>
>>> Ashutosh
>>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy<dv...@gmail.com>  wrote:
>>>
>>>> Hi guys,
>>>> It seems like our 'collected' option for group is pretty limited.
>>>> Imagine I have the following (silly example) script:
>>>>
>>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>>> text:chararray, ts:long);
>>>> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
>>>>
>>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>>> (ngram:chararray);
>>>>
>>>> -- get only happy ngrams, using replicated to avoid MR step
>>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>>> 'replicated';
>>>>
>>>> -- find only happy tweets. We know ngrams that were exploded from a
>>> single
>>>> tweet
>>>> -- must be in the same mapper still, so in theory this should work
>>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>>>
>>>>
>>>> But this doesn't work, of course, because there's a whole mess of
>>> operators
>>>> between the load and the group, including a join, and nothing makes any
>>>> guarantees about (id, uid) being on the same mapper except for what the
>>>> user
>>>> knows about the data.
>>>>
>>>> What's the right approach to let the user force this through?
>>>> a) this is an edge case optimization that's more trouble than it is worth
>>>> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
>>>> disable sanity checks
>>>> c) using 'collected-its-cool-dmitriy-said-its-ok'
>>>> d) drop the checks altogether
>>>> e) something else?
>>>>
>>>> D
>>>>
>>>
>


Re: Using 'collected' group

Posted by Alan Gates <ga...@hortonworks.com>.
I would vote for Dmitriy's original option b, on a per feature basis.  I know per feature switches are more cumbersome, but a "turn off all sanity checks" option is dangerous.  When removing safeties it seems better to do it one at a time.

Alan.

On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:

> Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
> totally backwards, since if you are a dummy, the last thing you will do is
> use a little-known parameter to protect yourself... but I digress.
> 
> Being able to set safety valves per-script seems like a good idea. Make it
> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
> etc?)
> 
> D
> 
> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan <ha...@apache.org>wrote:
> 
>> One possibility is to introduce 'mode' in Pig with default value of
>> 'strict'. Other values being 'non-strict' or potentially others. Another
>> use
>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
>> you can still use it in most cases. I dont recall all the details but
>> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>> 
>> Ashutosh
>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>> 
>>> Hi guys,
>>> It seems like our 'collected' option for group is pretty limited.
>>> Imagine I have the following (silly example) script:
>>> 
>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>> text:chararray, ts:long);
>>> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
>>> 
>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>> (ngram:chararray);
>>> 
>>> -- get only happy ngrams, using replicated to avoid MR step
>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>> 'replicated';
>>> 
>>> -- find only happy tweets. We know ngrams that were exploded from a
>> single
>>> tweet
>>> -- must be in the same mapper still, so in theory this should work
>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>> 
>>> 
>>> But this doesn't work, of course, because there's a whole mess of
>> operators
>>> between the load and the group, including a join, and nothing makes any
>>> guarantees about (id, uid) being on the same mapper except for what the
>>> user
>>> knows about the data.
>>> 
>>> What's the right approach to let the user force this through?
>>> a) this is an edge case optimization that's more trouble than it is worth
>>> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
>>> disable sanity checks
>>> c) using 'collected-its-cool-dmitriy-said-its-ok'
>>> d) drop the checks altogether
>>> e) something else?
>>> 
>>> D
>>> 
>> 


Re: Using 'collected' group

Posted by Alan Gates <ga...@hortonworks.com>.
I would vote for Dmitriy's original option b, on a per feature basis.  I know per feature switches are more cumbersome, but a "turn off all sanity checks" option is dangerous.  When removing safeties it seems better to do it one at a time.

Alan.

On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:

> Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
> totally backwards, since if you are a dummy, the last thing you will do is
> use a little-known parameter to protect yourself... but I digress.
> 
> Being able to set safety valves per-script seems like a good idea. Make it
> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
> etc?)
> 
> D
> 
> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan <ha...@apache.org>wrote:
> 
>> One possibility is to introduce 'mode' in Pig with default value of
>> 'strict'. Other values being 'non-strict' or potentially others. Another
>> use
>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
>> you can still use it in most cases. I dont recall all the details but
>> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>> 
>> Ashutosh
>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>> 
>>> Hi guys,
>>> It seems like our 'collected' option for group is pretty limited.
>>> Imagine I have the following (silly example) script:
>>> 
>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>> text:chararray, ts:long);
>>> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
>>> 
>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>> (ngram:chararray);
>>> 
>>> -- get only happy ngrams, using replicated to avoid MR step
>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>> 'replicated';
>>> 
>>> -- find only happy tweets. We know ngrams that were exploded from a
>> single
>>> tweet
>>> -- must be in the same mapper still, so in theory this should work
>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>> 
>>> 
>>> But this doesn't work, of course, because there's a whole mess of
>> operators
>>> between the load and the group, including a join, and nothing makes any
>>> guarantees about (id, uid) being on the same mapper except for what the
>>> user
>>> knows about the data.
>>> 
>>> What's the right approach to let the user force this through?
>>> a) this is an edge case optimization that's more trouble than it is worth
>>> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
>>> disable sanity checks
>>> c) using 'collected-its-cool-dmitriy-said-its-ok'
>>> d) drop the checks altogether
>>> e) something else?
>>> 
>>> D
>>> 
>> 


Re: Using 'collected' group

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
totally backwards, since if you are a dummy, the last thing you will do is
use a little-known parameter to protect yourself... but I digress.

Being able to set safety valves per-script seems like a good idea. Make it
global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
etc?)

D

On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan <ha...@apache.org>wrote:

> One possibility is to introduce 'mode' in Pig with default value of
> 'strict'. Other values being 'non-strict' or potentially others. Another
> use
> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
> you can still use it in most cases. I dont recall all the details but
> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>
> Ashutosh
> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > Hi guys,
> > It seems like our 'collected' option for group is pretty limited.
> > Imagine I have the following (silly example) script:
> >
> > tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
> > text:chararray, ts:long);
> > happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
> >
> > ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
> > (ngram:chararray);
> >
> > -- get only happy ngrams, using replicated to avoid MR step
> > happy_ngrams = join ngrams by ngram, happy_words by word using
> > 'replicated';
> >
> > -- find only happy tweets. We know ngrams that were exploded from a
> single
> > tweet
> > -- must be in the same mapper still, so in theory this should work
> > happy_tweets = group happy_ngrams by (id, uid) using 'collected';
> >
> >
> > But this doesn't work, of course, because there's a whole mess of
> operators
> > between the load and the group, including a join, and nothing makes any
> > guarantees about (id, uid) being on the same mapper except for what the
> > user
> > knows about the data.
> >
> > What's the right approach to let the user force this through?
> > a) this is an edge case optimization that's more trouble than it is worth
> > b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
> > disable sanity checks
> > c) using 'collected-its-cool-dmitriy-said-its-ok'
> > d) drop the checks altogether
> > e) something else?
> >
> > D
> >
>

Re: Using 'collected' group

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
totally backwards, since if you are a dummy, the last thing you will do is
use a little-known parameter to protect yourself... but I digress.

Being able to set safety valves per-script seems like a good idea. Make it
global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
etc?)

D

On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan <ha...@apache.org>wrote:

> One possibility is to introduce 'mode' in Pig with default value of
> 'strict'. Other values being 'non-strict' or potentially others. Another
> use
> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
> you can still use it in most cases. I dont recall all the details but
> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>
> Ashutosh
> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > Hi guys,
> > It seems like our 'collected' option for group is pretty limited.
> > Imagine I have the following (silly example) script:
> >
> > tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
> > text:chararray, ts:long);
> > happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
> >
> > ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
> > (ngram:chararray);
> >
> > -- get only happy ngrams, using replicated to avoid MR step
> > happy_ngrams = join ngrams by ngram, happy_words by word using
> > 'replicated';
> >
> > -- find only happy tweets. We know ngrams that were exploded from a
> single
> > tweet
> > -- must be in the same mapper still, so in theory this should work
> > happy_tweets = group happy_ngrams by (id, uid) using 'collected';
> >
> >
> > But this doesn't work, of course, because there's a whole mess of
> operators
> > between the load and the group, including a join, and nothing makes any
> > guarantees about (id, uid) being on the same mapper except for what the
> > user
> > knows about the data.
> >
> > What's the right approach to let the user force this through?
> > a) this is an edge case optimization that's more trouble than it is worth
> > b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
> > disable sanity checks
> > c) using 'collected-its-cool-dmitriy-said-its-ok'
> > d) drop the checks altogether
> > e) something else?
> >
> > D
> >
>

Re: Using 'collected' group

Posted by Ashutosh Chauhan <ha...@apache.org>.
One possibility is to introduce 'mode' in Pig with default value of
'strict'. Other values being 'non-strict' or potentially others. Another use
case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
PigStorage cannot guarantee all the requirements imposed by Merge Join, but
you can still use it in most cases. I dont recall all the details but
discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518

Ashutosh
On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hi guys,
> It seems like our 'collected' option for group is pretty limited.
> Imagine I have the following (silly example) script:
>
> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
> text:chararray, ts:long);
> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
>
> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
> (ngram:chararray);
>
> -- get only happy ngrams, using replicated to avoid MR step
> happy_ngrams = join ngrams by ngram, happy_words by word using
> 'replicated';
>
> -- find only happy tweets. We know ngrams that were exploded from a single
> tweet
> -- must be in the same mapper still, so in theory this should work
> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>
>
> But this doesn't work, of course, because there's a whole mess of operators
> between the load and the group, including a join, and nothing makes any
> guarantees about (id, uid) being on the same mapper except for what the
> user
> knows about the data.
>
> What's the right approach to let the user force this through?
> a) this is an edge case optimization that's more trouble than it is worth
> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
> disable sanity checks
> c) using 'collected-its-cool-dmitriy-said-its-ok'
> d) drop the checks altogether
> e) something else?
>
> D
>