You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Josh Devins <in...@joshdevins.net> on 2010/11/14 15:29:32 UTC

Implementation of ORDER and LIMIT

Hi all,

I'm happily using Pig to ORDER BY and LIMIT some large relations quite
effectively. However I'm curious about how these are/would be implemented in
"raw" MapReduce. Can anyone shed some light/point to some details, examples
or pseudo-code somewhere?

Cheers,

Josh

Re: Implementation of ORDER and LIMIT

Posted by Josh Devins <jo...@gmail.com>.
Thanks Alan,

It was indeed a purely academic question. I've had no issues at all with the
limits or order by not working in Pig. I'm a happy Pig user ;)

Cheers,

Josh


On 15 November 2010 21:56, Alan Gates <ga...@yahoo-inc.com> wrote:

> POSort is only used for sorts of bags in memory (such as sort inside a
> foreach) not top level sorts.  In both cases the physical operators only
> capture part of the actual operations, since much of the work is done by the
> Hadoop framework.
>
> Very briefly, order by works by taking a sample of the input, building a
> partitioner that will produce a balanced total ordering of the data (that
> is, each part file will be approximately the same size) and then running an
> MR job that uses the order by key as the grouping key along with the just
> built partitioner.  Limit works by applying the limit to each mapper and
> then running a reduce pass in a single reduce, again applying the limit.
>
> Are these questions purely academic or are their applications where you'd
> like to use Pig's order and limit but you can't do the other processing in
> Pig?  If the latter, I'd recommend checking out the new mapreduce command
> introduced in 0.8 (which we'll release here in a week or two I hope) which
> allows you to invoke MR jobs from Pig.  You can learn more about this at
> https://issues.apache.org/jira/browse/PIG-506.  You can also see the
> documentation for this feature in
> http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?view=markup(search on MAPREDUCE).  Sorry, this is the forrest version.  You can also
> see it in html by checking out the code and building it yourself.
>
> Alan.
>
>
> On Nov 15, 2010, at 12:50 AM, Rekha Joshi wrote:
>
>  Hi Josh
>>
>> AFAIR, all relationaloperators reside in source PO*.java under
>> o.a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
>> Alternatively check POLimit, POSort under
>> http://pig.apache.org/docs/r0.7.0/api/
>>
>> PigServer is the starting point. and internally will have formations of
>> logical/physical plan of jobs.The executionengine executes the job. Refer
>> files under o.a.p.backend.hadoop.executionengine.
>> More details under http://wiki.apache.org/pig/PigExecutionModel
>>
>> Thanks & Regards,
>> /Rekha.
>>
>> On 11/14/10 7:59 PM, "Josh Devins" <in...@joshdevins.net> wrote:
>>
>> Hi all,
>>
>> I'm happily using Pig to ORDER BY and LIMIT some large relations quite
>> effectively. However I'm curious about how these are/would be implemented
>> in
>> "raw" MapReduce. Can anyone shed some light/point to some details,
>> examples
>> or pseudo-code somewhere?
>>
>> Cheers,
>>
>> Josh
>>
>>
>

Re: Implementation of ORDER and LIMIT

Posted by Alan Gates <ga...@yahoo-inc.com>.
POSort is only used for sorts of bags in memory (such as sort inside a  
foreach) not top level sorts.  In both cases the physical operators  
only capture part of the actual operations, since much of the work is  
done by the Hadoop framework.

Very briefly, order by works by taking a sample of the input, building  
a partitioner that will produce a balanced total ordering of the data  
(that is, each part file will be approximately the same size) and then  
running an MR job that uses the order by key as the grouping key along  
with the just built partitioner.  Limit works by applying the limit to  
each mapper and then running a reduce pass in a single reduce, again  
applying the limit.

Are these questions purely academic or are their applications where  
you'd like to use Pig's order and limit but you can't do the other  
processing in Pig?  If the latter, I'd recommend checking out the new  
mapreduce command introduced in 0.8 (which we'll release here in a  
week or two I hope) which allows you to invoke MR jobs from Pig.  You  
can learn more about this at https://issues.apache.org/jira/browse/PIG-506 
.  You can also see the documentation for this feature in http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?view=markup 
  (search on MAPREDUCE).  Sorry, this is the forrest version.  You can  
also see it in html by checking out the code and building it yourself.

Alan.

On Nov 15, 2010, at 12:50 AM, Rekha Joshi wrote:

> Hi Josh
>
> AFAIR, all relationaloperators reside in source PO*.java under  
> o 
> .a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
> Alternatively check POLimit, POSort under http://pig.apache.org/docs/r0.7.0/api/
>
> PigServer is the starting point. and internally will have formations  
> of logical/physical plan of jobs.The executionengine executes the  
> job. Refer files under o.a.p.backend.hadoop.executionengine.
> More details under http://wiki.apache.org/pig/PigExecutionModel
>
> Thanks & Regards,
> /Rekha.
>
> On 11/14/10 7:59 PM, "Josh Devins" <in...@joshdevins.net> wrote:
>
> Hi all,
>
> I'm happily using Pig to ORDER BY and LIMIT some large relations quite
> effectively. However I'm curious about how these are/would be  
> implemented in
> "raw" MapReduce. Can anyone shed some light/point to some details,  
> examples
> or pseudo-code somewhere?
>
> Cheers,
>
> Josh
>


IntelliJ

Posted by Dhananjay Ragade <dr...@linkedin.com>.
Does anyone know of a plugin for pig in IntelliJ (or an alternate plugin
that if assigned to handle ".pig" can do a decent enough job with syntax
highlighting?)

thanks,
Dhananjay


Re: Implementation of ORDER and LIMIT

Posted by Rekha Joshi <re...@yahoo-inc.com>.
Hi Josh

AFAIR, all relationaloperators reside in source PO*.java under o.a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
Alternatively check POLimit, POSort under http://pig.apache.org/docs/r0.7.0/api/

PigServer is the starting point. and internally will have formations of logical/physical plan of jobs.The executionengine executes the job. Refer files under o.a.p.backend.hadoop.executionengine.
More details under http://wiki.apache.org/pig/PigExecutionModel

Thanks & Regards,
/Rekha.

On 11/14/10 7:59 PM, "Josh Devins" <in...@joshdevins.net> wrote:

Hi all,

I'm happily using Pig to ORDER BY and LIMIT some large relations quite
effectively. However I'm curious about how these are/would be implemented in
"raw" MapReduce. Can anyone shed some light/point to some details, examples
or pseudo-code somewhere?

Cheers,

Josh