You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Josh Devins <in...@joshdevins.net> on 2010/11/14 15:29:32 UTC
Implementation of ORDER and LIMIT
Hi all,
I'm happily using Pig to ORDER BY and LIMIT some large relations quite
effectively. However I'm curious about how these are/would be implemented in
"raw" MapReduce. Can anyone shed some light/point to some details, examples
or pseudo-code somewhere?
Cheers,
Josh
Re: Implementation of ORDER and LIMIT
Posted by Josh Devins <jo...@gmail.com>.
Thanks Alan,
It was indeed a purely academic question. I've had no issues at all with the
limits or order by not working in Pig. I'm a happy Pig user ;)
Cheers,
Josh
On 15 November 2010 21:56, Alan Gates <ga...@yahoo-inc.com> wrote:
> POSort is only used for sorts of bags in memory (such as sort inside a
> foreach) not top level sorts. In both cases the physical operators only
> capture part of the actual operations, since much of the work is done by the
> Hadoop framework.
>
> Very briefly, order by works by taking a sample of the input, building a
> partitioner that will produce a balanced total ordering of the data (that
> is, each part file will be approximately the same size) and then running an
> MR job that uses the order by key as the grouping key along with the just
> built partitioner. Limit works by applying the limit to each mapper and
> then running a reduce pass in a single reduce, again applying the limit.
>
> Are these questions purely academic or are their applications where you'd
> like to use Pig's order and limit but you can't do the other processing in
> Pig? If the latter, I'd recommend checking out the new mapreduce command
> introduced in 0.8 (which we'll release here in a week or two I hope) which
> allows you to invoke MR jobs from Pig. You can learn more about this at
> https://issues.apache.org/jira/browse/PIG-506. You can also see the
> documentation for this feature in
> http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?view=markup(search on MAPREDUCE). Sorry, this is the forrest version. You can also
> see it in html by checking out the code and building it yourself.
>
> Alan.
>
>
> On Nov 15, 2010, at 12:50 AM, Rekha Joshi wrote:
>
> Hi Josh
>>
>> AFAIR, all relationaloperators reside in source PO*.java under
>> o.a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
>> Alternatively check POLimit, POSort under
>> http://pig.apache.org/docs/r0.7.0/api/
>>
>> PigServer is the starting point. and internally will have formations of
>> logical/physical plan of jobs.The executionengine executes the job. Refer
>> files under o.a.p.backend.hadoop.executionengine.
>> More details under http://wiki.apache.org/pig/PigExecutionModel
>>
>> Thanks & Regards,
>> /Rekha.
>>
>> On 11/14/10 7:59 PM, "Josh Devins" <in...@joshdevins.net> wrote:
>>
>> Hi all,
>>
>> I'm happily using Pig to ORDER BY and LIMIT some large relations quite
>> effectively. However I'm curious about how these are/would be implemented
>> in
>> "raw" MapReduce. Can anyone shed some light/point to some details,
>> examples
>> or pseudo-code somewhere?
>>
>> Cheers,
>>
>> Josh
>>
>>
>
Re: Implementation of ORDER and LIMIT
Posted by Alan Gates <ga...@yahoo-inc.com>.
POSort is only used for sorts of bags in memory (such as sort inside a
foreach) not top level sorts. In both cases the physical operators
only capture part of the actual operations, since much of the work is
done by the Hadoop framework.
Very briefly, order by works by taking a sample of the input, building
a partitioner that will produce a balanced total ordering of the data
(that is, each part file will be approximately the same size) and then
running an MR job that uses the order by key as the grouping key along
with the just built partitioner. Limit works by applying the limit to
each mapper and then running a reduce pass in a single reduce, again
applying the limit.
Are these questions purely academic or are their applications where
you'd like to use Pig's order and limit but you can't do the other
processing in Pig? If the latter, I'd recommend checking out the new
mapreduce command introduced in 0.8 (which we'll release here in a
week or two I hope) which allows you to invoke MR jobs from Pig. You
can learn more about this at https://issues.apache.org/jira/browse/PIG-506
. You can also see the documentation for this feature in http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?view=markup
(search on MAPREDUCE). Sorry, this is the forrest version. You can
also see it in html by checking out the code and building it yourself.
Alan.
On Nov 15, 2010, at 12:50 AM, Rekha Joshi wrote:
> Hi Josh
>
> AFAIR, all relationaloperators reside in source PO*.java under
> o
> .a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
> Alternatively check POLimit, POSort under http://pig.apache.org/docs/r0.7.0/api/
>
> PigServer is the starting point. and internally will have formations
> of logical/physical plan of jobs.The executionengine executes the
> job. Refer files under o.a.p.backend.hadoop.executionengine.
> More details under http://wiki.apache.org/pig/PigExecutionModel
>
> Thanks & Regards,
> /Rekha.
>
> On 11/14/10 7:59 PM, "Josh Devins" <in...@joshdevins.net> wrote:
>
> Hi all,
>
> I'm happily using Pig to ORDER BY and LIMIT some large relations quite
> effectively. However I'm curious about how these are/would be
> implemented in
> "raw" MapReduce. Can anyone shed some light/point to some details,
> examples
> or pseudo-code somewhere?
>
> Cheers,
>
> Josh
>
IntelliJ
Posted by Dhananjay Ragade <dr...@linkedin.com>.
Does anyone know of a plugin for pig in IntelliJ (or an alternate plugin
that if assigned to handle ".pig" can do a decent enough job with syntax
highlighting?)
thanks,
Dhananjay
Re: Implementation of ORDER and LIMIT
Posted by Rekha Joshi <re...@yahoo-inc.com>.
Hi Josh
AFAIR, all relationaloperators reside in source PO*.java under o.a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
Alternatively check POLimit, POSort under http://pig.apache.org/docs/r0.7.0/api/
PigServer is the starting point. and internally will have formations of logical/physical plan of jobs.The executionengine executes the job. Refer files under o.a.p.backend.hadoop.executionengine.
More details under http://wiki.apache.org/pig/PigExecutionModel
Thanks & Regards,
/Rekha.
On 11/14/10 7:59 PM, "Josh Devins" <in...@joshdevins.net> wrote:
Hi all,
I'm happily using Pig to ORDER BY and LIMIT some large relations quite
effectively. However I'm curious about how these are/would be implemented in
"raw" MapReduce. Can anyone shed some light/point to some details, examples
or pseudo-code somewhere?
Cheers,
Josh