You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Vincent Fabro <vi...@gmail.com> on 2015/05/03 00:12:32 UTC

Access number of reducer tasks from Crunch

Dear all

Is it possible to access the number of reducer tasks from Crunch (something
equivalent to context.getNumReduceTasks() in Hadoop)?

Context: I'm porting Nutch to Crunch. One operation (in  GeneratorJob.java,
GeneratorMapper.java and GeneratorReducer.java -
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java)
takes the n top urls acccording to a score. If I understand well, "n/num of
reduce tasks" urls are selected for each reduce task (GeneratorReducer,
line 102). If there's a good shuffle, the result is good enough.

Thanks in advance!

Vincent

Re: Access number of reducer tasks from Crunch

Posted by Vincent Fabro <vi...@gmail.com>.
Ok, I missed Aggregate.top() (guess my research wasn't thorough).
I'll go with the framework's built-in function, seem cleaner than using
Context.

Thanks a lot for your answers!

Vincent

On Sun, May 3, 2015 at 8:11 AM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Vincent,
>
> Yeah, you can get at it. Each DoFn inherits a protected getContext()
> method that has the getNumReduceTasks() method defined on it, just like it
> does in the Nutch code you cited. We try (with varying degrees of success)
> to make the underlying MR framework as accessible as possible.
>
> J
>
> On Sun, May 3, 2015 at 2:16 AM, David Ortiz <dp...@gmail.com> wrote:
>
>> Do you actually care about the number of reducers, or just get top n from
>> a table?  The latter is built into the framework.
>>
>> On Sat, May 2, 2015, 6:12 PM Vincent Fabro <vi...@gmail.com>
>> wrote:
>>
>>> Dear all
>>>
>>> Is it possible to access the number of reducer tasks from Crunch
>>> (something equivalent to context.getNumReduceTasks() in Hadoop)?
>>>
>>> Context: I'm porting Nutch to Crunch. One operation (in
>>> GeneratorJob.java, GeneratorMapper.java and GeneratorReducer.java -
>>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java)
>>> takes the n top urls acccording to a score. If I understand well, "n/num of
>>> reduce tasks" urls are selected for each reduce task (GeneratorReducer,
>>> line 102). If there's a good shuffle, the result is good enough.
>>>
>>> Thanks in advance!
>>>
>>> Vincent
>>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Access number of reducer tasks from Crunch

Posted by Josh Wills <jw...@cloudera.com>.
Hey Vincent,

Yeah, you can get at it. Each DoFn inherits a protected getContext() method
that has the getNumReduceTasks() method defined on it, just like it does in
the Nutch code you cited. We try (with varying degrees of success) to make
the underlying MR framework as accessible as possible.

J

On Sun, May 3, 2015 at 2:16 AM, David Ortiz <dp...@gmail.com> wrote:

> Do you actually care about the number of reducers, or just get top n from
> a table?  The latter is built into the framework.
>
> On Sat, May 2, 2015, 6:12 PM Vincent Fabro <vi...@gmail.com>
> wrote:
>
>> Dear all
>>
>> Is it possible to access the number of reducer tasks from Crunch
>> (something equivalent to context.getNumReduceTasks() in Hadoop)?
>>
>> Context: I'm porting Nutch to Crunch. One operation (in
>> GeneratorJob.java, GeneratorMapper.java and GeneratorReducer.java -
>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java)
>> takes the n top urls acccording to a score. If I understand well, "n/num of
>> reduce tasks" urls are selected for each reduce task (GeneratorReducer,
>> line 102). If there's a good shuffle, the result is good enough.
>>
>> Thanks in advance!
>>
>> Vincent
>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Access number of reducer tasks from Crunch

Posted by David Ortiz <dp...@gmail.com>.
Do you actually care about the number of reducers, or just get top n from a
table?  The latter is built into the framework.

On Sat, May 2, 2015, 6:12 PM Vincent Fabro <vi...@gmail.com>
wrote:

> Dear all
>
> Is it possible to access the number of reducer tasks from Crunch
> (something equivalent to context.getNumReduceTasks() in Hadoop)?
>
> Context: I'm porting Nutch to Crunch. One operation (in
> GeneratorJob.java, GeneratorMapper.java and GeneratorReducer.java -
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java)
> takes the n top urls acccording to a score. If I understand well, "n/num of
> reduce tasks" urls are selected for each reduce task (GeneratorReducer,
> line 102). If there's a good shuffle, the result is good enough.
>
> Thanks in advance!
>
> Vincent
>