You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Ajay Garg (JIRA)" <ji...@apache.org> on 2008/07/08 12:20:32 UTC
[jira] Created: (PIG-296) UDF for cumulative statistics
UDF for cumulative statistics
-----------------------------
Key: PIG-296
URL: https://issues.apache.org/jira/browse/PIG-296
Project: Pig
Issue Type: Improvement
Reporter: Ajay Garg
Priority: Minor
udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description.
To use
A = load 'data' using PigStorage as ( query, freq );
B = group A all;
C = foreach B {
Ordered = order A by freq using numeric.OrderDescending;
generate
statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
};
D = foreach C generate FLATTEN(A);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-296) UDF for cumulative statistics
Posted by "Pi Song (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611604#action_12611604 ]
Pi Song commented on PIG-296:
-----------------------------
Generally it looks alright but I cannot access the link. Could you please post it in a public location like PigWiki ?
> UDF for cumulative statistics
> -----------------------------
>
> Key: PIG-296
> URL: https://issues.apache.org/jira/browse/PIG-296
> Project: Pig
> Issue Type: Improvement
> Reporter: Ajay Garg
> Priority: Minor
> Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description.
> To use
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
> Ordered = order A by freq using numeric.OrderDescending;
> generate
> statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
> ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-296) UDF for cumulative statistics
Posted by "Ajay Garg (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611964#action_12611964 ]
Ajay Garg commented on PIG-296:
-------------------------------
h2. Specification of Cumulative Sum , row, rank, dense rank
h2. {color:red} Cumulative Sum{color}
Useful for calculating cumulative distributions.
x[] is an ordered set of values
cumulative sum[1] = x[1]
cumulative sum[i] = cumulative sum[i] + x[i]
---
h2. {color:red}Row{color}
Label of the nth item in an ordered set.
x[] is an ordered set of values
i = 1;
row[i] = i
i++;
||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|5|
|news|4,000|6|
---
h2. {color:red}Rank{color}
Useful for calculating Zipf distributions. Duplicate values of x result in the same rank value. Gaps in the sequence values for Rank occur following a run of duplicate values of x.
x[] is an ordered set of values
i = 1;
if (i == 1) {
rank[1] = 1
} else if (x[i] == x[i-1]) {
rank[i] = rank[i-1]
} else {
rank[i] = i
}
i++
||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|4|
|news|4,000|6|
---
h2. {color:red}Dense Rank{color}
Useful for calculating top-N or bottom-N. Unlike Rank, there are no gaps in the sequence values for Dense Rank.
x[] is an ordered set of values
if (i == 1) {
dense_rank[1] = 1
} else if (x[i] == x[i-1]) {
dense_rank[i] = dense_rank[i-1]
} else {
dense_rank[i] = dense_rank[i-1] + 1
}
[i] and [i-1] can be represented using current and previous values and need not use an indexed array
||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|4|
|news|4,000|5|
> UDF for cumulative statistics
> -----------------------------
>
> Key: PIG-296
> URL: https://issues.apache.org/jira/browse/PIG-296
> Project: Pig
> Issue Type: Improvement
> Reporter: Ajay Garg
> Priority: Minor
> Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description.
> To use
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
> Ordered = order A by freq using numeric.OrderDescending;
> generate
> statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
> ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-296) UDF for cumulative statistics
Posted by "Ajay Garg (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ajay Garg updated PIG-296:
--------------------------
Attachment: cumulative.patch
Patch attached....
> UDF for cumulative statistics
> -----------------------------
>
> Key: PIG-296
> URL: https://issues.apache.org/jira/browse/PIG-296
> Project: Pig
> Issue Type: Improvement
> Reporter: Ajay Garg
> Priority: Minor
> Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description.
> To use
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
> Ordered = order A by freq using numeric.OrderDescending;
> generate
> statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
> ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-296) UDF for cumulative statistics
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich updated PIG-296:
-------------------------------
Description:
udf for computive cumulative sum, row, rank, dense rank.
To use
A = load 'data' using PigStorage as ( query, freq );
B = group A all;
C = foreach B {
Ordered = order A by freq using numeric.OrderDescending;
generate
statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
};
D = foreach C generate FLATTEN(A);
was:
udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description.
To use
A = load 'data' using PigStorage as ( query, freq );
B = group A all;
C = foreach B {
Ordered = order A by freq using numeric.OrderDescending;
generate
statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
};
D = foreach C generate FLATTEN(A);
> UDF for cumulative statistics
> -----------------------------
>
> Key: PIG-296
> URL: https://issues.apache.org/jira/browse/PIG-296
> Project: Pig
> Issue Type: Improvement
> Reporter: Ajay Garg
> Priority: Minor
> Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank.
> To use
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
> Ordered = order A by freq using numeric.OrderDescending;
> generate
> statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
> ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-296) UDF for cumulative statistics
Posted by "Ajay Garg (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ajay Garg updated PIG-296:
--------------------------
Attachment: newCumulative.patch
updated patch attached(newCumulative.patch) which arrange the tuples in decreasing order before combining them to maintain the order. Please look at it and comment.
Thanks
> UDF for cumulative statistics
> -----------------------------
>
> Key: PIG-296
> URL: https://issues.apache.org/jira/browse/PIG-296
> Project: Pig
> Issue Type: Improvement
> Reporter: Ajay Garg
> Priority: Minor
> Attachments: cumulative.patch, newCumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank.
> To use
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
> Ordered = order A by freq using numeric.OrderDescending;
> generate
> statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
> ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-296) UDF for cumulative statistics
Posted by "Pi Song (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612136#action_12612136 ]
Pi Song commented on PIG-296:
-----------------------------
The logic looks alright.
Have you tested with very big bags? Our bag implementation doesn't maintain order (partly due to spills). That might cause trouble for you.
> UDF for cumulative statistics
> -----------------------------
>
> Key: PIG-296
> URL: https://issues.apache.org/jira/browse/PIG-296
> Project: Pig
> Issue Type: Improvement
> Reporter: Ajay Garg
> Priority: Minor
> Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description.
> To use
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
> Ordered = order A by freq using numeric.OrderDescending;
> generate
> statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th column, this refers to the column freq by offset
> ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.