You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sebastian Briesemeister <se...@unister-gmbh.de> on 2013/03/27 17:10:29 UTC

Number of Clustering MR-Jobs

Dear all,

I am trying to start the FuzzyKMeansDriver on a hadoop cluster so that
it starts multiple MapReduce-Jobs. However, it always starts just a
single MR-Job?!

I figured it might be caused by the fact that I generated my input data
into a single file using SequenceFile.Writer???
Or is there another way to influence the number of mapper tasks?

Thanks in advance
Sebastian

Re: Number of Clustering MR-Jobs

Posted by Sebastian Schelter <ss...@googlemail.com>.
It would also be very hard to do automatically, as clusters are shared
and a framework cannot know how much of the shared resources (available
map slots) it can take.

On 28.03.2013 10:07, Sean Owen wrote:
> This is really a Hadoop-level thing. I am not sure I have ever
> successfully induced M/R to run multiple mappers on less than one
> block of data, even with a low max split size. Reducers you can
> control.
> 
> On Thu, Mar 28, 2013 at 9:04 AM, Sebastian Briesemeister
> <se...@unister-gmbh.de> wrote:
>> Thank you.
>>
>> Splitting the files leads to multiple MR-tasks!
>>
>> Only changing the MR settings of hadoop did not help. In the future it
>> would be nice if the drivers would scale themself and would split the
>> data according to the dataset size and the number of available MR-slots.


Re: Number of Clustering MR-Jobs

Posted by Sean Owen <sr...@gmail.com>.
This is really a Hadoop-level thing. I am not sure I have ever
successfully induced M/R to run multiple mappers on less than one
block of data, even with a low max split size. Reducers you can
control.

On Thu, Mar 28, 2013 at 9:04 AM, Sebastian Briesemeister
<se...@unister-gmbh.de> wrote:
> Thank you.
>
> Splitting the files leads to multiple MR-tasks!
>
> Only changing the MR settings of hadoop did not help. In the future it
> would be nice if the drivers would scale themself and would split the
> data according to the dataset size and the number of available MR-slots.

Re: Number of Clustering MR-Jobs

Posted by Sebastian Schelter <ss...@googlemail.com>.
Sebastian,

For CPU-bound problems like matrix factorization with ALS, we have
recently seen good results with multithreaded mappers, where we had the
users specify the number of cores to use per mapper.

On 28.03.2013 10:20, Ted Dunning wrote:
> This is a longstanding Hadoop issue.
> 
> Your suggestion is interesting, but only a few cases would benefit.  The
> problem is that splitting involves reading from a very small number of
> nodes and thus is not much better than just running the program with few
> mappers.  If the data is large enough to make splitting fast, then Hadoop
> will just do it.
> 
> The only win for splitting is when the cost per chunk is very high.  I
> think that only random forest might fit into that category.
> 
> On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
> sebastian.briesemeister@unister-gmbh.de> wrote:
> 
>> Splitting the files leads to multiple MR-tasks!
>>
>> Only changing the MR settings of hadoop did not help. In the future it
>> would be nice if the drivers would scale themself and would split the
>> data according to the dataset size and the number of available MR-slots.
>>
> 


Re: Number of Clustering MR-Jobs

Posted by Sebastian Briesemeister <se...@unister-gmbh.de>.
I tried to increase the heap space, but it wasn't enough.

It seems the problem is not the number of mappers. I will start another
thread for this problem with some more details.

Cheers
Sebastian


Am 28.03.2013 16:41, schrieb Dan Filimon:
> From what I've seen, even if the mapper does throw an out of memory
> exception, Hadoop will restart it increasing the memory.
>
> There are ways to configure the mapper/reducer JVMs to use more memory by
> default through the Configuration although I don't recall the exact
> options. It's probably documented in your Hadoop distribution's
> documentation.
>
>
> On Thu, Mar 28, 2013 at 2:52 PM, Sebastian Briesemeister <
> sebastian.briesemeister@unister-gmbh.de> wrote:
>
>> In my case, each map processes requires a lot of memory and I would like
>> to distribute this consumption on multiple nodes.
>>
>> However, I still get out of memory exceptions even if I split the input
>> file into several very small input files??? I though the mapper would
>> consider only one file at a time and would, hence, have no problems with
>> heap space?
>>
>>
>>
>> Am 28.03.2013 10:20, schrieb Ted Dunning:
>>> This is a longstanding Hadoop issue.
>>>
>>> Your suggestion is interesting, but only a few cases would benefit.  The
>>> problem is that splitting involves reading from a very small number of
>>> nodes and thus is not much better than just running the program with few
>>> mappers.  If the data is large enough to make splitting fast, then Hadoop
>>> will just do it.
>>>
>>> The only win for splitting is when the cost per chunk is very high.  I
>>> think that only random forest might fit into that category.
>>>
>>> On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
>>> sebastian.briesemeister@unister-gmbh.de> wrote:
>>>
>>>> Splitting the files leads to multiple MR-tasks!
>>>>
>>>> Only changing the MR settings of hadoop did not help. In the future it
>>>> would be nice if the drivers would scale themself and would split the
>>>> data according to the dataset size and the number of available MR-slots.
>>>>
>>


Re: Number of Clustering MR-Jobs

Posted by Dan Filimon <da...@gmail.com>.
>From what I've seen, even if the mapper does throw an out of memory
exception, Hadoop will restart it increasing the memory.

There are ways to configure the mapper/reducer JVMs to use more memory by
default through the Configuration although I don't recall the exact
options. It's probably documented in your Hadoop distribution's
documentation.


On Thu, Mar 28, 2013 at 2:52 PM, Sebastian Briesemeister <
sebastian.briesemeister@unister-gmbh.de> wrote:

> In my case, each map processes requires a lot of memory and I would like
> to distribute this consumption on multiple nodes.
>
> However, I still get out of memory exceptions even if I split the input
> file into several very small input files??? I though the mapper would
> consider only one file at a time and would, hence, have no problems with
> heap space?
>
>
>
> Am 28.03.2013 10:20, schrieb Ted Dunning:
> > This is a longstanding Hadoop issue.
> >
> > Your suggestion is interesting, but only a few cases would benefit.  The
> > problem is that splitting involves reading from a very small number of
> > nodes and thus is not much better than just running the program with few
> > mappers.  If the data is large enough to make splitting fast, then Hadoop
> > will just do it.
> >
> > The only win for splitting is when the cost per chunk is very high.  I
> > think that only random forest might fit into that category.
> >
> > On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
> > sebastian.briesemeister@unister-gmbh.de> wrote:
> >
> >> Splitting the files leads to multiple MR-tasks!
> >>
> >> Only changing the MR settings of hadoop did not help. In the future it
> >> would be nice if the drivers would scale themself and would split the
> >> data according to the dataset size and the number of available MR-slots.
> >>
>
>

Re: Number of Clustering MR-Jobs

Posted by Sebastian Briesemeister <se...@unister-gmbh.de>.
In my case, each map processes requires a lot of memory and I would like
to distribute this consumption on multiple nodes.

However, I still get out of memory exceptions even if I split the input
file into several very small input files??? I though the mapper would
consider only one file at a time and would, hence, have no problems with
heap space?



Am 28.03.2013 10:20, schrieb Ted Dunning:
> This is a longstanding Hadoop issue.
>
> Your suggestion is interesting, but only a few cases would benefit.  The
> problem is that splitting involves reading from a very small number of
> nodes and thus is not much better than just running the program with few
> mappers.  If the data is large enough to make splitting fast, then Hadoop
> will just do it.
>
> The only win for splitting is when the cost per chunk is very high.  I
> think that only random forest might fit into that category.
>
> On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
> sebastian.briesemeister@unister-gmbh.de> wrote:
>
>> Splitting the files leads to multiple MR-tasks!
>>
>> Only changing the MR settings of hadoop did not help. In the future it
>> would be nice if the drivers would scale themself and would split the
>> data according to the dataset size and the number of available MR-slots.
>>


Re: Number of Clustering MR-Jobs

Posted by Ted Dunning <te...@gmail.com>.
This is a longstanding Hadoop issue.

Your suggestion is interesting, but only a few cases would benefit.  The
problem is that splitting involves reading from a very small number of
nodes and thus is not much better than just running the program with few
mappers.  If the data is large enough to make splitting fast, then Hadoop
will just do it.

The only win for splitting is when the cost per chunk is very high.  I
think that only random forest might fit into that category.

On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
sebastian.briesemeister@unister-gmbh.de> wrote:

> Splitting the files leads to multiple MR-tasks!
>
> Only changing the MR settings of hadoop did not help. In the future it
> would be nice if the drivers would scale themself and would split the
> data according to the dataset size and the number of available MR-slots.
>

Re: Number of Clustering MR-Jobs

Posted by Sebastian Briesemeister <se...@unister-gmbh.de>.
Thank you.

Splitting the files leads to multiple MR-tasks!

Only changing the MR settings of hadoop did not help. In the future it
would be nice if the drivers would scale themself and would split the
data according to the dataset size and the number of available MR-slots.

Cheers
Sebastian

Am 28.03.2013 07:25, schrieb Dan Filimon:
> Yes, it des depend on the number of mappers and what Ted suggested
> (splitting the input file) worked for me.
>
> Here's [1] the code I used to split a SequenceFile (I wrote so that it
> re-splits m files into n files, hence the name).
>
> [1] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/java/org/apache/mahout/clustering/streaming/tools/ResplitSequenceFiles.java
>
> On Thu, Mar 28, 2013 at 2:26 AM, Ted Dunning <te...@gmail.com> wrote:
>> Your idea that this is related to your single input file is the most likely
>> cause.
>>
>> If your input file is relatively small then splitting it up to force
>> multiple mappers is the easiest solution.
>>
>> If your input file is larger, then you might be able to convince the
>> map-reduce framework to use more mappers.
>>
>> On Wed, Mar 27, 2013 at 6:09 PM, Sebastian Briesemeister <
>> sebastian.briesemeister@unister.de> wrote:
>>
>>> Yes, correct. It currently starts a single Map task.
>>>
>>>
>>>
>>> Ted Dunning <te...@gmail.com> schrieb:
>>>
>>>> Do you mean that it starts a single map task?
>>>>
>>>> On Wed, Mar 27, 2013 at 5:10 PM, Sebastian Briesemeister <
>>>> sebastian.briesemeister@unister-gmbh.de> wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>> I am trying to start the FuzzyKMeansDriver on a hadoop cluster so
>>>> that
>>>>> it starts multiple MapReduce-Jobs. However, it always starts just a
>>>>> single MR-Job?!
>>>>>
>>>>> I figured it might be caused by the fact that I generated my input
>>>> data
>>>>> into a single file using SequenceFile.Writer???
>>>>> Or is there another way to influence the number of mapper tasks?
>>>>>
>>>>> Thanks in advance
>>>>> Sebastian
>>>>>
>>> --
>>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>>> gesendet.


Re: Number of Clustering MR-Jobs

Posted by Dan Filimon <da...@gmail.com>.
Yes, it des depend on the number of mappers and what Ted suggested
(splitting the input file) worked for me.

Here's [1] the code I used to split a SequenceFile (I wrote so that it
re-splits m files into n files, hence the name).

[1] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/java/org/apache/mahout/clustering/streaming/tools/ResplitSequenceFiles.java

On Thu, Mar 28, 2013 at 2:26 AM, Ted Dunning <te...@gmail.com> wrote:
> Your idea that this is related to your single input file is the most likely
> cause.
>
> If your input file is relatively small then splitting it up to force
> multiple mappers is the easiest solution.
>
> If your input file is larger, then you might be able to convince the
> map-reduce framework to use more mappers.
>
> On Wed, Mar 27, 2013 at 6:09 PM, Sebastian Briesemeister <
> sebastian.briesemeister@unister.de> wrote:
>
>> Yes, correct. It currently starts a single Map task.
>>
>>
>>
>> Ted Dunning <te...@gmail.com> schrieb:
>>
>> >Do you mean that it starts a single map task?
>> >
>> >On Wed, Mar 27, 2013 at 5:10 PM, Sebastian Briesemeister <
>> >sebastian.briesemeister@unister-gmbh.de> wrote:
>> >
>> >> Dear all,
>> >>
>> >> I am trying to start the FuzzyKMeansDriver on a hadoop cluster so
>> >that
>> >> it starts multiple MapReduce-Jobs. However, it always starts just a
>> >> single MR-Job?!
>> >>
>> >> I figured it might be caused by the fact that I generated my input
>> >data
>> >> into a single file using SequenceFile.Writer???
>> >> Or is there another way to influence the number of mapper tasks?
>> >>
>> >> Thanks in advance
>> >> Sebastian
>> >>
>>
>> --
>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>> gesendet.

Re: Number of Clustering MR-Jobs

Posted by Ted Dunning <te...@gmail.com>.
Your idea that this is related to your single input file is the most likely
cause.

If your input file is relatively small then splitting it up to force
multiple mappers is the easiest solution.

If your input file is larger, then you might be able to convince the
map-reduce framework to use more mappers.

On Wed, Mar 27, 2013 at 6:09 PM, Sebastian Briesemeister <
sebastian.briesemeister@unister.de> wrote:

> Yes, correct. It currently starts a single Map task.
>
>
>
> Ted Dunning <te...@gmail.com> schrieb:
>
> >Do you mean that it starts a single map task?
> >
> >On Wed, Mar 27, 2013 at 5:10 PM, Sebastian Briesemeister <
> >sebastian.briesemeister@unister-gmbh.de> wrote:
> >
> >> Dear all,
> >>
> >> I am trying to start the FuzzyKMeansDriver on a hadoop cluster so
> >that
> >> it starts multiple MapReduce-Jobs. However, it always starts just a
> >> single MR-Job?!
> >>
> >> I figured it might be caused by the fact that I generated my input
> >data
> >> into a single file using SequenceFile.Writer???
> >> Or is there another way to influence the number of mapper tasks?
> >>
> >> Thanks in advance
> >> Sebastian
> >>
>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> gesendet.

Re: Number of Clustering MR-Jobs

Posted by Sebastian Briesemeister <se...@unister.de>.
Yes, correct. It currently starts a single Map task. 



Ted Dunning <te...@gmail.com> schrieb:

>Do you mean that it starts a single map task?
>
>On Wed, Mar 27, 2013 at 5:10 PM, Sebastian Briesemeister <
>sebastian.briesemeister@unister-gmbh.de> wrote:
>
>> Dear all,
>>
>> I am trying to start the FuzzyKMeansDriver on a hadoop cluster so
>that
>> it starts multiple MapReduce-Jobs. However, it always starts just a
>> single MR-Job?!
>>
>> I figured it might be caused by the fact that I generated my input
>data
>> into a single file using SequenceFile.Writer???
>> Or is there another way to influence the number of mapper tasks?
>>
>> Thanks in advance
>> Sebastian
>>

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Re: Number of Clustering MR-Jobs

Posted by Ted Dunning <te...@gmail.com>.
Do you mean that it starts a single map task?

On Wed, Mar 27, 2013 at 5:10 PM, Sebastian Briesemeister <
sebastian.briesemeister@unister-gmbh.de> wrote:

> Dear all,
>
> I am trying to start the FuzzyKMeansDriver on a hadoop cluster so that
> it starts multiple MapReduce-Jobs. However, it always starts just a
> single MR-Job?!
>
> I figured it might be caused by the fact that I generated my input data
> into a single file using SequenceFile.Writer???
> Or is there another way to influence the number of mapper tasks?
>
> Thanks in advance
> Sebastian
>