You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tristan Croiset <tr...@croiset.com> on 2011/06/14 15:58:52 UTC

Sum normalization

Hi,

I'm looking to perform a sum normalization (divide a score by the sum of
scores of my data) with pig.

1) My first problem is I can't find a great way to do that.
Any suggestion?

I have an answer but I'm not really proud of it...
------------------------------------------------------------------------------
score_list = LOAD  'scores' USING PigStorage(';')
  AS (word: chararray, score: double);

score_list_ = FOREACH score_list GENERATE
  word,
  score,
  0 AS joinField;

group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
  0 AS joinField,
  SUM(score_list.score) as scoreTotal;

score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
DUMP out;
------------------------------------------------------------------------------

2) Secondly, I think there is a strange bug.
Considering the code above, if at the end I put only "GENERATE word" (and
not the scores), then it goes in some kind of infinite loop (repeating
"Spilling map output: record full = true"... in the log)


thanks,

tristan

Re: Sum normalization

Posted by Tristan Croiset <tr...@croiset.com>.
2011/6/14 Daniel Dai <ji...@yahoo-inc.com>

> Take a look of Pig scalar:
> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars
>
> thanks! that's indeed what I needed.


> For the bug you find, would you mind open a Jira ticket?
>

Sure.

bests,

tristan



>
> Thanks,
> Daniel
>
>
> On 06/14/2011 06:58 AM, Tristan Croiset wrote:
>
>> Hi,
>>
>> I'm looking to perform a sum normalization (divide a score by the sum of
>> scores of my data) with pig.
>>
>> 1) My first problem is I can't find a great way to do that.
>> Any suggestion?
>>
>> I have an answer but I'm not really proud of it...
>>
>> ------------------------------------------------------------------------------
>> score_list = LOAD  'scores' USING PigStorage(';')
>>   AS (word: chararray, score: double);
>>
>> score_list_ = FOREACH score_list GENERATE
>>   word,
>>   score,
>>   0 AS joinField;
>>
>> group_score = GROUP score_list ALL;
>> sum_score = FOREACH group_score GENERATE
>>   0 AS joinField,
>>   SUM(score_list.score) as scoreTotal;
>>
>> score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
>> out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
>> DUMP out;
>>
>> ------------------------------------------------------------------------------
>>
>> 2) Secondly, I think there is a strange bug.
>> Considering the code above, if at the end I put only "GENERATE word" (and
>> not the scores), then it goes in some kind of infinite loop (repeating
>> "Spilling map output: record full = true"... in the log)
>>
>>
>> thanks,
>>
>> tristan
>>
>
>

Re: Sum normalization

Posted by Daniel Dai <ji...@yahoo-inc.com>.
Take a look of Pig scalar: 
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars

Try this query:
score_list = LOAD  'scores' USING PigStorage(';')
   AS (word: chararray, score: double);

score_list_ = FOREACH score_list GENERATE
   word,
   score,
   0 AS joinField;

group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
   0 AS joinField,
   SUM(score_list.score) as scoreTotal;

out = FOREACH score_list_ GENERATE word, (score / sum_score.scoreTotal);
dump out;

For the bug you find, would you mind open a Jira ticket?

Thanks,
Daniel

On 06/14/2011 06:58 AM, Tristan Croiset wrote:
> Hi,
>
> I'm looking to perform a sum normalization (divide a score by the sum of
> scores of my data) with pig.
>
> 1) My first problem is I can't find a great way to do that.
> Any suggestion?
>
> I have an answer but I'm not really proud of it...
> ------------------------------------------------------------------------------
> score_list = LOAD  'scores' USING PigStorage(';')
>    AS (word: chararray, score: double);
>
> score_list_ = FOREACH score_list GENERATE
>    word,
>    score,
>    0 AS joinField;
>
> group_score = GROUP score_list ALL;
> sum_score = FOREACH group_score GENERATE
>    0 AS joinField,
>    SUM(score_list.score) as scoreTotal;
>
> score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
> out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
> DUMP out;
> ------------------------------------------------------------------------------
>
> 2) Secondly, I think there is a strange bug.
> Considering the code above, if at the end I put only "GENERATE word" (and
> not the scores), then it goes in some kind of infinite loop (repeating
> "Spilling map output: record full = true"... in the log)
>
>
> thanks,
>
> tristan