You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "David Ciemiewicz (JIRA)" <ji...@apache.org> on 2011/02/20 01:41:38 UTC

[jira] Commented: (PIG-821) simulate NTILE(n) , rank() functionality in pig

    [ https://issues.apache.org/jira/browse/PIG-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996938#comment-12996938 ] 

David Ciemiewicz commented on PIG-821:
--------------------------------------

I tried doing an NTILE and generic UDF for computing cummulative statistics such as decile, rank, denserank, etc for working on query streams.  With a billion or more queries, the serialization of sequentially processing every single row was a bit slow.

What I realized was that the frequency histogram for something like a "powerlaw" distribution of a billion queries would be pretty compact (something like only 20K unique frequencies).  Computing sequential commulative statistics on 20K, 100K, or even 1M rows is far more efficient than doing it for >1B rows.

So I used map reduce to compute the frequency histogram, compute the deciles on the frequency histogram, then map them back to the individual queries which then exploits the parallelism of map-reduce.

With the introduction of Jython scripting for UDFs in Pig 0.8, it makes it pretty simple to distribute the example.  Furthermore, once inline Jython scripts are implemented (as requested in another JIRA, this will be even more convenient.

Here's the Pig:
{code}

register 'ntiles.py' using jython as ntiles;

Queries = load 'queries.txt' using PigStorage() as ( query: chararray, freq: long );

FreqGroup = group Queries by ( freq );

FreqHisto = foreach FreqGroup generate
	group as freq,
	COUNT(Queries) as count;

FreqHistoAll = group FreqHisto all;

CummStats = foreach FreqHistoAll {
	FreqHistoOrdered = order FreqHisto by freq desc;
	generate
	FLATTEN(ntiles.freqhisto_cummstats(FreqHistoOrdered, 10));
	};

QueryDeciles = join Queries by freq, CummStats by freq;

QueryDeciles = order QueryDeciles by freq desc, query;

store QueryDeciles into 'query-deciles.txt' using PigStorage();
{code}

Here's the Jython function for computing the cumulative stats on the frequency historgram:
{code}

@outputSchema("freqhisto_cummstats:bag{stats:tuple(freq:long,count:long,total:long,proportion:double,cummcount:long,cummproportion:double,ntile:long,rank:long,denserank:long)}")
def freqhisto_cummstats(freqhisto, ntiles):
  outBag = []
  totalcount = 0
  for freqcount in freqhisto:
    freq = freqcount[0]
    count = freqcount[1]
    totalcount += freq * count
  cummcount = 0
  ntile = 1
  rank = 1
  denserank = 1
  for freqcount in freqhisto:
    freq = freqcount[0]
    count = freqcount[1]
    cummcount += freq * count
    proportion = float(freq) * count / totalcount
    cummproportion = float(cummcount) / totalcount
    if cummproportion >= float(ntile)/ntiles:
      ntile += 1
      if ntile > ntiles:
        ntile = ntiles
    tup=(freq, count, totalcount, proportion, cummcount, cummproportion, ntile, rank, denserank)
    outBag.append(tup)
    rank += 1
    denserank += freq * count
  return outBag
{code}

Here's a sample Perl program to generate "faux" query log data:
{code}

#! /usr/bin/perl

use Digest::MD5 qw(md5 md5_hex md5_base64);
$MAX = 1001;

foreach $i ( 1 .. $MAX ) {
    foreach $j ( 1 .. $i ) {
	$digest = md5_base64($i*$MAX+$j);
	print join("\t", $digest, $MAX-$i), "\n";
    }
}
{code}

> simulate NTILE(n) , rank() functionality in pig
> -----------------------------------------------
>
>                 Key: PIG-821
>                 URL: https://issues.apache.org/jira/browse/PIG-821
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.2.0
>         Environment: mithril gold -gateway 4000
>            Reporter: Rekha
>
> Hi,
> I came across a job which has some processing which I cant seem to get easily over-the-counter from pig.
> These are NTILE() /rank() operations available in oracle.
> While I am trying to write a UDF, that is not working out too well for me yet.. :(
> I have a ntile(n) over (partititon by x, y, z order by a desc, b desc) operation to be done in pig scripts.
> Is there a default function in pig scripting which can do this?
> For example, lets consider a simple example at http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions091.htm
> So here, how would we ideally substitute NTILE() with? any pig counterpart function/udf?
> SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC) 
>    AS quartile FROM employees
>    WHERE department_id = 100;
>  
> LAST_NAME                     SALARY   QUARTILE
> ------------------------- ---------- ----------
> Greenberg                      12000          1
> Faviet                          9000          1
> Chen                            8200          2
> Urman                           7800          2
> Sciarra                         7700          3
> Popp                            6900          4
>  
> In real case, i have ntile over multiple columns, so ideal way to find histograms/boundary/spitting out the bucket number is needed.
> Similarly a pig function is required for rank() over(partition by a,b,c order by d desc) as e
> Please let me know soon.
> Thanks & Regards,
> /Rekha

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira