You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tristan Croiset <tr...@croiset.com> on 2011/06/14 15:58:52 UTC
Sum normalization
Hi,
I'm looking to perform a sum normalization (divide a score by the sum of
scores of my data) with pig.
1) My first problem is I can't find a great way to do that.
Any suggestion?
I have an answer but I'm not really proud of it...
------------------------------------------------------------------------------
score_list = LOAD 'scores' USING PigStorage(';')
AS (word: chararray, score: double);
score_list_ = FOREACH score_list GENERATE
word,
score,
0 AS joinField;
group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
0 AS joinField,
SUM(score_list.score) as scoreTotal;
score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
DUMP out;
------------------------------------------------------------------------------
2) Secondly, I think there is a strange bug.
Considering the code above, if at the end I put only "GENERATE word" (and
not the scores), then it goes in some kind of infinite loop (repeating
"Spilling map output: record full = true"... in the log)
thanks,
tristan
Re: Sum normalization
Posted by Tristan Croiset <tr...@croiset.com>.
2011/6/14 Daniel Dai <ji...@yahoo-inc.com>
> Take a look of Pig scalar:
> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars
>
> thanks! that's indeed what I needed.
> For the bug you find, would you mind open a Jira ticket?
>
Sure.
bests,
tristan
>
> Thanks,
> Daniel
>
>
> On 06/14/2011 06:58 AM, Tristan Croiset wrote:
>
>> Hi,
>>
>> I'm looking to perform a sum normalization (divide a score by the sum of
>> scores of my data) with pig.
>>
>> 1) My first problem is I can't find a great way to do that.
>> Any suggestion?
>>
>> I have an answer but I'm not really proud of it...
>>
>> ------------------------------------------------------------------------------
>> score_list = LOAD 'scores' USING PigStorage(';')
>> AS (word: chararray, score: double);
>>
>> score_list_ = FOREACH score_list GENERATE
>> word,
>> score,
>> 0 AS joinField;
>>
>> group_score = GROUP score_list ALL;
>> sum_score = FOREACH group_score GENERATE
>> 0 AS joinField,
>> SUM(score_list.score) as scoreTotal;
>>
>> score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
>> out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
>> DUMP out;
>>
>> ------------------------------------------------------------------------------
>>
>> 2) Secondly, I think there is a strange bug.
>> Considering the code above, if at the end I put only "GENERATE word" (and
>> not the scores), then it goes in some kind of infinite loop (repeating
>> "Spilling map output: record full = true"... in the log)
>>
>>
>> thanks,
>>
>> tristan
>>
>
>
Re: Sum normalization
Posted by Daniel Dai <ji...@yahoo-inc.com>.
Take a look of Pig scalar:
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars
Try this query:
score_list = LOAD 'scores' USING PigStorage(';')
AS (word: chararray, score: double);
score_list_ = FOREACH score_list GENERATE
word,
score,
0 AS joinField;
group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
0 AS joinField,
SUM(score_list.score) as scoreTotal;
out = FOREACH score_list_ GENERATE word, (score / sum_score.scoreTotal);
dump out;
For the bug you find, would you mind open a Jira ticket?
Thanks,
Daniel
On 06/14/2011 06:58 AM, Tristan Croiset wrote:
> Hi,
>
> I'm looking to perform a sum normalization (divide a score by the sum of
> scores of my data) with pig.
>
> 1) My first problem is I can't find a great way to do that.
> Any suggestion?
>
> I have an answer but I'm not really proud of it...
> ------------------------------------------------------------------------------
> score_list = LOAD 'scores' USING PigStorage(';')
> AS (word: chararray, score: double);
>
> score_list_ = FOREACH score_list GENERATE
> word,
> score,
> 0 AS joinField;
>
> group_score = GROUP score_list ALL;
> sum_score = FOREACH group_score GENERATE
> 0 AS joinField,
> SUM(score_list.score) as scoreTotal;
>
> score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
> out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
> DUMP out;
> ------------------------------------------------------------------------------
>
> 2) Secondly, I think there is a strange bug.
> Considering the code above, if at the end I put only "GENERATE word" (and
> not the scores), then it goes in some kind of infinite loop (repeating
> "Spilling map output: record full = true"... in the log)
>
>
> thanks,
>
> tristan