You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Ajay Garg (JIRA)" <ji...@apache.org> on 2008/07/08 12:20:32 UTC

[jira] Created: (PIG-296) UDF for cumulative statistics

UDF for cumulative statistics
-----------------------------

                 Key: PIG-296
                 URL: https://issues.apache.org/jira/browse/PIG-296
             Project: Pig
          Issue Type: Improvement
            Reporter: Ajay Garg
            Priority: Minor


udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description. 

To use 
A = load 'data' using PigStorage as ( query, freq );
B = group A all;
C = foreach B {
    Ordered = order A by freq using numeric.OrderDescending;
    generate
        statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
                ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
};
D = foreach C generate FLATTEN(A);


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-296) UDF for cumulative statistics

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611604#action_12611604 ] 

Pi Song commented on PIG-296:
-----------------------------

Generally it looks alright but I cannot access the link. Could you please post it in a public location like PigWiki ?

> UDF for cumulative statistics
> -----------------------------
>
>                 Key: PIG-296
>                 URL: https://issues.apache.org/jira/browse/PIG-296
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Ajay Garg
>            Priority: Minor
>         Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description. 
> To use 
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
>     Ordered = order A by freq using numeric.OrderDescending;
>     generate
>         statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
>                 ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-296) UDF for cumulative statistics

Posted by "Ajay Garg (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611964#action_12611964 ] 

Ajay Garg commented on PIG-296:
-------------------------------

h2. Specification of Cumulative Sum , row, rank, dense rank 



h2. {color:red}  Cumulative Sum{color} 

Useful for calculating cumulative distributions.

x[] is an ordered set of values

cumulative sum[1] = x[1]
cumulative sum[i] = cumulative sum[i] + x[i]

---
 h2. {color:red}Row{color}

Label of the nth item in an ordered set.

x[] is an ordered set of values

i = 1;

row[i] = i

i++;


||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|5|
|news|4,000|6|

---

h2.  {color:red}Rank{color}

Useful for calculating Zipf distributions. Duplicate values of x result in the same rank value. Gaps in the sequence values for Rank occur following a run of duplicate values of x.

x[] is an ordered set of values

i = 1;

if (i == 1) {
    rank[1] = 1
} else if (x[i] == x[i-1]) {
    rank[i] = rank[i-1]
} else {
    rank[i] = i
}

i++

||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|4|
|news|4,000|6|

---

 h2. {color:red}Dense Rank{color}

Useful for calculating top-N or bottom-N. Unlike Rank, there are no gaps in the sequence values for Dense Rank.

x[] is an ordered set of values

if (i == 1) {
    dense_rank[1] = 1
} else if (x[i] == x[i-1]) {
    dense_rank[i] = dense_rank[i-1]
} else {
    dense_rank[i] = dense_rank[i-1] + 1
}

[i] and [i-1] can be represented using current and previous values and need not use an indexed array

||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|4|
|news|4,000|5|

> UDF for cumulative statistics
> -----------------------------
>
>                 Key: PIG-296
>                 URL: https://issues.apache.org/jira/browse/PIG-296
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Ajay Garg
>            Priority: Minor
>         Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description. 
> To use 
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
>     Ordered = order A by freq using numeric.OrderDescending;
>     generate
>         statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
>                 ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-296) UDF for cumulative statistics

Posted by "Ajay Garg (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ajay Garg updated PIG-296:
--------------------------

    Attachment: cumulative.patch

Patch attached....

> UDF for cumulative statistics
> -----------------------------
>
>                 Key: PIG-296
>                 URL: https://issues.apache.org/jira/browse/PIG-296
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Ajay Garg
>            Priority: Minor
>         Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description. 
> To use 
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
>     Ordered = order A by freq using numeric.OrderDescending;
>     generate
>         statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
>                 ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-296) UDF for cumulative statistics

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-296:
-------------------------------

    Description: 
udf for computive cumulative sum, row, rank, dense rank.  

To use 
A = load 'data' using PigStorage as ( query, freq );
B = group A all;
C = foreach B {
    Ordered = order A by freq using numeric.OrderDescending;
    generate
        statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
                ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
};
D = foreach C generate FLATTEN(A);


  was:
udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description. 

To use 
A = load 'data' using PigStorage as ( query, freq );
B = group A all;
C = foreach B {
    Ordered = order A by freq using numeric.OrderDescending;
    generate
        statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
                ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
};
D = foreach C generate FLATTEN(A);



> UDF for cumulative statistics
> -----------------------------
>
>                 Key: PIG-296
>                 URL: https://issues.apache.org/jira/browse/PIG-296
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Ajay Garg
>            Priority: Minor
>         Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank.  
> To use 
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
>     Ordered = order A by freq using numeric.OrderDescending;
>     generate
>         statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
>                 ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-296) UDF for cumulative statistics

Posted by "Ajay Garg (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ajay Garg updated PIG-296:
--------------------------

    Attachment: newCumulative.patch

updated patch attached(newCumulative.patch) which arrange the tuples in decreasing order before combining them to maintain the order. Please look at it and comment. 
Thanks

> UDF for cumulative statistics
> -----------------------------
>
>                 Key: PIG-296
>                 URL: https://issues.apache.org/jira/browse/PIG-296
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Ajay Garg
>            Priority: Minor
>         Attachments: cumulative.patch, newCumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank.  
> To use 
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
>     Ordered = order A by freq using numeric.OrderDescending;
>     generate
>         statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
>                 ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-296) UDF for cumulative statistics

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612136#action_12612136 ] 

Pi Song commented on PIG-296:
-----------------------------

The logic looks alright. 

Have you tested with very big bags? Our bag implementation doesn't maintain order (partly due to spills). That might cause trouble for you.

> UDF for cumulative statistics
> -----------------------------
>
>                 Key: PIG-296
>                 URL: https://issues.apache.org/jira/browse/PIG-296
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Ajay Garg
>            Priority: Minor
>         Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for detailed description. 
> To use 
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
>     Ordered = order A by freq using numeric.OrderDescending;
>     generate
>         statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th column, this refers to the column freq by offset
>                 ( query, freq, freq_cumulative_sum, freq_row, freq_rank, freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.