You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Gianmarco De Francisci Morales (Created) (JIRA)" <ji...@apache.org> on 2011/11/04 23:29:53 UTC

[jira] [Created] (PIG-2353) RANK function like in SQL

RANK function like in SQL
-------------------------

                 Key: PIG-2353
                 URL: https://issues.apache.org/jira/browse/PIG-2353
             Project: Pig
          Issue Type: New Feature
            Reporter: Gianmarco De Francisci Morales


Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281472#comment-13281472 ] 

Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------

My current view is that non-partitioned rank is simply a group by + a rank UDF.
So there is no need for a separate implementation of it unless we have some performance gain. Maybe we need something specific if we plan to support nested rank.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Assigned] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allan Avendaño reassigned PIG-2353:
-----------------------------------

    Assignee: Allan Avendaño  (was: Gianmarco De Francisci Morales)
    
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281954#comment-13281954 ] 

Daniel Dai commented on PIG-2353:
---------------------------------

So partitioned and non-partitioned RANK are using different implementation, right?
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Ashutosh Chauhan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173431#comment-13173431 ] 

Ashutosh Chauhan commented on PIG-2353:
---------------------------------------

I was also thinking of this problem of implementing statistical measures (like top-K, median, quantiles) etc. efficiently in a distributed manner which is amenable to MR framework. Rank is a basis of it. I came up with similiar outline as yours, your have laid it out well. I think this is pretty useful to be in Pig and these are kind of features which higher level language like Pig should make available to its users. Sophisticated users will expect this and this will derive adoption.   
+1 for distributed implementation of RANK in Pig.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2353:
----------------------------

    Labels: gsoc2012  (was: )
    
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>              Labels: gsoc2012
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Apurv Verma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237987#comment-13237987 ] 

Apurv Verma commented on PIG-2353:
----------------------------------

Hello,
I am an undergraduate student from India and I would be interested in working on this as a GSoC project. I have a beginner level knowledge of writing map-reduce tasks so would need help with it. I have understood the algorithm which Gianmarco has outlined in the comments.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>              Labels: gsoc2012
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481817#comment-13481817 ] 

Olga Natkovich commented on PIG-2353:
-------------------------------------

Yes, I think that's fine - I did not realize it was covered in a separate JIRA, thanks!
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>             Fix For: 0.11
>
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG-2353-4.txt, PIG-2353-5.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173405#comment-13173405 ] 

Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------

My idea would be to have a distributed implementation of RANK in the following manner:

Run a Map-only job with n mapper, each mapper just computes the number of records in each input split and accumulates it in an internal variable (or alternatively it uses dynamic counters).
At the end, we have a map(partition_id => number_of_records).
This map is small enough to be put in the distributed cache.
Compute the cumulative sum of each number of records.
Then launch a second Map-only job with exactly n mappers, each will read it's input split and the cumulative number of records preceding it, initialize the counter with this value and finally RANK the records as they come in.

This would be a distributed implementation of RANK that could scale very well.
I haven't figured out how to integrate it into Pig yet.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281934#comment-13281934 ] 

Daniel Dai commented on PIG-2353:
---------------------------------

You mean the global rank is implemented by group all + UDF? Do we have a plan for a distributed implementation in this project?
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399500#comment-13399500 ] 

Allan Avendaño commented on PIG-2353:
-------------------------------------

Current implementation is now available for your review at https://reviews.apache.org/r/5523/diff/#index_header
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2353-2, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281959#comment-13281959 ] 

Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------

Yes, partitioned rank can be simply group by + UDF.
Global rank should follow the implementation blueprint that I outlined in this Jira, or something similar to make it fully scalable.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280877#comment-13280877 ] 

Allan Avendaño commented on PIG-2353:
-------------------------------------

Hi to everybody,

I am working on this functionality for GSOC 2012, with Gianmarco as my mentor. 
I had been working on syntax, and now is recognized this syntax, recommended by Gianmarco:

RANK <relation> ( BY <column> (ASC|DES)? )?

I was also looking for some other functionality that can be incorporated, and on SQL Server, Oracle and Postgresql [1][2][3], it is also possible to specify a "partition" (ranking over a specific group) at the same rank operation. Gianmarco already pointed me out that it could imply some performance flaws. 


Looking forward for yours feedback/suggestion.

References:

[1] http://msdn.microsoft.com/en-us/library/ms176102.aspx
[2] http://www.techonthenet.com/oracle/functions/rank.php
[3] http://www.postgresql.org/docs/9.1/static/tutorial-window.html
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-2353:
------------------------------------------------

    Assignee: Gianmarco De Francisci Morales

Assigning to myself as per Apache guidelines as I'd like to mentor this.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Jonathan Coveney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Coveney updated PIG-2353:
----------------------------------

    Attachment: PIG2353.patch

This provides a rank function, which needs a sorted input. Equal tuples will get increasing rank numbers.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-2353:
------------------------------------------------

    Labels: gsoc2012 mentor  (was: gsoc2012)
    
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales resolved PIG-2353.
-------------------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.11
     Release Note: 
Pig includes a new RANK operator:
RANK <relation> ( BY <column> (ASC|DES)? )?
This operator prepends a consecutive integer to each tuple in the relation starting from 1.
If the BY clause is present, RANK sorts the relation before ranking it, otherwise it uses the order in which it receives the relation (e.g. the order in which the relation is stored if RANK is performed right after a LOAD).



+1 for me.
Passes local tests and manual testing.
Committed to trunk.

Thank Allan!
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>             Fix For: 0.11
>
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG-2353-4.txt, PIG-2353-5.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173078#comment-13173078 ] 

Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------

Hi Jonathan,
thanks for giving it a try!

I think the approach is fine for an initial implementation.
To scale it out, we need a deeper integration with Pig (i.e. it need to be an operator and not a UDF), but this is the subject for another Jira.

Just one more comment.
I am not sure about testing in piggybank.
Should we use e2e testing instead of JUnit?

                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173438#comment-13173438 ] 

Jonathan Coveney commented on PIG-2353:
---------------------------------------

So Gianmarco, are you thinking this sort of syntax:

{code}
A = <some relation>
B = RANK A BY <column name> <ASC | DESC>;
{code}

IE it'd just follow the order syntax, but add the rank to the end?

And I assume your n map job would run after already sorting the, right? So first rank would run the order by, and then it would run the two jobs that would actually append the rank?
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173987#comment-13173987 ] 

Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------

Actually I was thinking that RANK would only do the counting and appending.
This way you could get a sort + rank with
{code}
B = RANK ( ORDER A BY <column> ASC);
{code}

But you could also get your dataset from file and rank it directly, without any specific order
{code}
A = LOAD 'path/to/file';
B = RANK A;
C = ORDER B BY <column>
{code}

This, for example, gives you the permutation that was used to sort the dataset, which might be useful.
Also, RANK would allow to create a data column that reflects the ordering that you have in your data.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Allan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239880#comment-13239880 ] 

Allan commented on PIG-2353:
----------------------------

Dear all,

Let me introduce myself, I am Allan Avendaño, student of Master Computing Engineering at Rome. I am interested into collaborate with Rank function like SQL [#PIG-2353] for Gsoc. 

I have been working with MR paradigm since three years, mainly with two research projects (which were part of undergraduate projects). 

One was aimed to analyze the incidence of navigability factors on websites university network, by creating a inverse correlation among them through links and citations.
My undergraduate project was driven to solve the previous navigability problem, by establishing relations among them according to terms used and topics. Was really interesting to interleave some MR phases (some modifications to Mahout code) and Pig.

I was checking the activities of this feature, and also an initial approach at [#PIG-821], I think also could be useful dense rank an nth-tile, and other statistical inference operations. 

Really thankful for your guidance and comments.

Best Regards
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allan Avendaño updated PIG-2353:
--------------------------------

    Attachment: PIG-2353-4.txt

All unit and e2e tests passed. 
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG-2353-4.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229114#comment-13229114 ] 

Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------

Thanks Daniel, I am excited this Jira is going to be a candidate for GSoC, I was going to propose it myself!
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>              Labels: gsoc2012
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allan Avendaño updated PIG-2353:
--------------------------------

    Attachment: PIG-2353-3.txt

Code generated so far. 
It passed all junit and e2e test on a cluster.
New code has been documented. 
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481703#comment-13481703 ] 

Allan Avendaño commented on PIG-2353:
-------------------------------------

Hi Olga!

Does PIG-2947 apply as release notes? 
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>             Fix For: 0.11
>
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG-2353-4.txt, PIG-2353-5.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174412#comment-13174412 ] 

Jonathan Coveney commented on PIG-2353:
---------------------------------------

I see no reason why it couldn't do both. The grammar syntax could be
{code}
RANK relation ( BY column )? such that if you specify the column to rank by, it'll sort it, but if you don't, it just sorts it as it got it.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281166#comment-13281166 ] 

Daniel Dai commented on PIG-2353:
---------------------------------

We can use secondary sort to implement partitioned rank. However, I think partitioned rank and non-partitioned rank may have to adopt a totally different implementation. We can focus on non-partitioned rank first.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "David Ciemiewicz (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194966#comment-13194966 ] 

David Ciemiewicz commented on PIG-2353:
---------------------------------------

There is a much more efficient way to compute RANK, DENSE_RANK, CUMULATIVE_SUM and more if you have billions of rows of data, especially if the data follows a power law/zipf distribution (like queries do).  It involves using Map-Reduce to compute a histogram of the frequencies/counts and then serializing and sorting the histogram which is something like 20,000 rows for 1B queries.

https://issues.apache.org/jira/browse/PIG-821
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2353:
----------------------------

    Description: 
Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

  was:Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

    
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>              Labels: gsoc2012
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-2353:
------------------------------------------------

    Release Note: 
Pig includes a new RANK operator:
RANK <relation> ( BY <column> (ASC|DES)? (DENSE)? )?
This operator prepends a consecutive integer to each tuple in the relation starting from 1.
If the BY clause is present, RANK sorts the relation before ranking it, otherwise it uses the order in which it receives the relation (e.g. the order in which the relation is stored if RANK is performed right after a LOAD).
The DENSE modifier produces a dense rank, which has no gaps in it regardless of ties.



  was:
Pig includes a new RANK operator:
RANK <relation> ( BY <column> (ASC|DES)? )?
This operator prepends a consecutive integer to each tuple in the relation starting from 1.
If the BY clause is present, RANK sorts the relation before ranking it, otherwise it uses the order in which it receives the relation (e.g. the order in which the relation is stored if RANK is performed right after a LOAD).



    
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>             Fix For: 0.11
>
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG-2353-4.txt, PIG-2353-5.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447590#comment-13447590 ] 

Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------

There is a regression in the latest patch.
It does not work properly in a multi-machine environment.
It seems that the values of the counters are not properly serialized in the JobConf.
We need to add a test and fix the bug before committing the patch.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG-2353-4.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174413#comment-13174413 ] 

Jonathan Coveney commented on PIG-2353:
---------------------------------------

Weird, the above got garbled and I can't edit it, but I think the idea is clear.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173394#comment-13173394 ] 

Jonathan Coveney commented on PIG-2353:
---------------------------------------

Is there anything in the piggybank using the e2e? I'm not sure what the word is about piggybank and when to use e2e. Someone else will have to weigh in on that.

As far as that other JIRA...you should make it and link it, though I'm curious what benefit/optimization you forsee RANK having if it has access to Pig's internals.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allan Avendaño updated PIG-2353:
--------------------------------

    Attachment: PIG-2353-5.txt

New approach to set counter into the MapReduce Job with PORank operator, only if there are PORank operator(s) on roots.

Two tests added (ten in total) with complex scripts which prove particular scenarios. 
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG-2353-4.txt, PIG-2353-5.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allan Avendaño updated PIG-2353:
--------------------------------

    Attachment: PIG-2353-2

Here, rank operator is fully implemented (rank, dense rank and row number), now I'm working on refactoring, tests and documentation. 

Looking forward to your comments.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2353-2, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281943#comment-13281943 ] 

Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------

No, sorry, there is a typo in my previous comment.
What I meant is that partitioned rank is only group by + UDF.
The main aim of this project is a distributed implementation of the global RANK, which needs to be implemented from scratch.
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PIG-2353) RANK function like in SQL

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481537#comment-13481537 ] 

Olga Natkovich commented on PIG-2353:
-------------------------------------

Can you please add usage example to release notes section, thanks!
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>             Fix For: 0.11
>
>         Attachments: PIG-2353-2, PIG-2353-3.txt, PIG-2353-4.txt, PIG-2353-5.txt, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2353) RANK function like in SQL

Posted by "Allan Avendaño (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allan Avendaño updated PIG-2353:
--------------------------------

    Description: 
Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

  was:
Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.

This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

    
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Allan Avendaño
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2353-2, PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, increasing identifier without gaps, like what RANK does for SQL.
> This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
> Functionality implemented so far, is available at https://reviews.apache.org/r/5523/diff/#index_header

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira