You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Molek <mp...@gmail.com> on 2013/08/01 17:00:44 UTC

Re: Modify number of mappers for a mahout process?

One trick to getting more mappers on a job when running from the command
line is to pass a '-Dmapred.max.split.size=xxxx' argument. The xxxx is a
size in bytes. So if you have some hypothetical 10MB input set, but you
want to force ~100 mappers, use '-Dmapred.max.split.size=1000000'


On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit <ga...@ebay.com>wrote:

>
> Hi,
>
> It sounds to me like this could be related to one of the Qs I've posted
> several days ago (is it?):
> My mahout clustering processes seem to be running very slow (several good
> hours on just ~1M items), and I'm wondering if there's anything that needs
> to be changed in setting/configuration. (and how?)
>         I'm running on a large cluster and could potentially use thousands
> of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.)
> are only using max 5 mappers (I tried it on several data sets).
>         I've tried to define the number of mappers by something like:
> -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still
> only uses <=5 mappers.
>         Is there a different way to set the number of mappers/reducers for
> a mahout process?
>         Or is there another configuration issue I need to consider?
>
> I'd definitely be happy to use such a parameter, does it not exist?
> (I'm running mahout as installed on the cluster)
>
> Is there currently a workaround, besides running a mahout jar as an hadoop
> job?
> When I originally tried to run a mahout jar that uses KMeansDriver (and
> that runs great on my local machine)- it did not even initiate a job on the
> hadoop cluster. It seemed to be running parallel but in fact it was running
> only on the local node.         Is this a known issue? Is there a fix for
> this? (I ended up dropping it and calling mahout step by step from command
> line, but I'd be happy to know if there a fix for this).
>
> Thanks,
>
> Galit.
>
> -----Original Message-----
> From: Ryan Josal [mailto:rjosal@gmail.com]
> Sent: Monday, July 29, 2013 9:33 PM
> To: Adam Baron
> Cc: Ryan Josal; user@mahout.apache.org
> Subject: Re: Run more than one mapper for TestForest?
>
> If you're running mahout from the CLI, you'll have to modify the Hadoop
> config file or your env manually for each job.  This is code I put in to my
> custom job executions so I didn't have to calculate and set that up every
> time.  Maybe that's your best route in that position.  You could just
> provide your own mahout jar and run it as you would any other Hadoop job
> and ignore the installed Mahout.  I do think this could be a useful
> parameter for a number of standard mahout jobs though; I know I would use
> it.  Does anyone in the mahout community see this as a generally useful
> feature for a Mahout job?
>
> Ryan
>
> On Jul 29, 2013, at 10:25, Adam Baron <ad...@gmail.com> wrote:
>
> > Ryan,
> >
> > Thanks for the fix, the code looks reasonable to me.  Which version of
> Mahout will this be in?  0.9?
> >
> > Unfortunately, I'm using a large shared Hadoop cluster which is not
> administered by my team.   So I'm not in a position push the latest from
> the Mahout dev trunk into our environment; the admins will only install
> official releases.
> >
> > Regards,
> >           Adam
> >
> > On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <ry...@josal.com> wrote:
> >> Late reply, but for what it's still worth, since I've seen a couple
> other threads here on the topic of too few mappers, I added a parameter to
> set a minimum number of mappers.  Some of my mahout jobs needed more
> mappers, but were not given many because of the small input file size.
> >>
> >>         addOption("minMapTasks", "m", "Minimum number of map tasks to
> >> run", String.valueOf(1));
> >>
> >>
> >>         int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
> >>         int mapTasksThatWouldRun = (int)
> (vectorFileSizeBytes/getSplitSize()) + 1;
> >>         log.info("map tasks min: " + minMapTasks + " current: " +
> mapTasksThatWouldRun);
> >>         if (minMapTasks > mapTasksThatWouldRun) {
> >>             String splitSizeBytes =
> String.valueOf(vectorFileSizeBytes/minMapTasks);
> >>             log.info("Forcing mapred.max.split.size to " +
> splitSizeBytes + " to ensure minimum map tasks = " + minMapTasks);
> >>             hadoopConf.set("mapred.max.split.size", splitSizeBytes);
> >>         }
> >>
> >>     // there is actually a private method in hadoop to calculate this
> >>     private long getSplitSize() {
> >>         long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024
> * 1024);
> >>         long maxSize = hadoopConf.getLong("mapred.max.split.size",
> Long.MAX_VALUE);
> >>         int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
> >>         long splitSize = Math.max(minSize, Math.min(maxSize,
> blockSize));
> >>         log.info(String.format("min: %,d block: %,d max: %,d split:
> %,d", minSize, blockSize, maxSize, splitSize));
> >>         return splitSize;
> >>     }
> >>
> >> It seems like there should be a more straightforward way to do this,
> but it works for me and I've used it on a lot of jobs to set a minimum
> number of mappers.
> >>
> >> Ryan
> >>
> >> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
> >>
> >> > I'm attempting to run
> >> > org.apache.mahout.classifier.df.mapreduce.TestForest
> >> > on a CSV with 200,000 rows that have 500,000 features per row.
> >> > However, TestForest is  running extremely slow, likely because only
> >> > 1 mapper was assigned to the job.  This seems strange because the
> >> > org.apache.mahout.classifier.df.mapreduce.BuildForest step on the
> >> > same data used 1772 mappers and took about 6 minutes.  (BTW: I know
> >> > I
> >> > *shouldn't* use the same data set for the training and the testing
> >> > steps; this is purely a technical experiment to see if Mahout's
> >> > Random Forest can handle the data sizes we typically deal with).
> >> >
> >> > Any idea on how to get
> >> > org.apache.mahout.classifier.df.mapreduce.TestForest
> >> > to use more mappers?  Glancing at the code (and thinking about what
> >> > is happening intuitively), it should be ripe for parallelization.
> >> >
> >> > Thanks,
> >> >        Adam
> >
>

Re: Modify number of mappers for a mahout process?

Posted by Ryan Josal <rj...@gmail.com>.
Galit, yes this does sound like this is related, and as Matt said, you can test this by setting the max split size on the CLI.  I didn't personally find this to be a reliable and efficient method, so I wrote the -m parameter to my job to set it right every time.  It seems that this would be useful to have as a general parameter for Mahout jobs; is there agreement on this, and if so can I get some guidance on how to contribute?

Ryan

On Aug 1, 2013, at 8:00, Matt Molek <mp...@gmail.com> wrote:

> One trick to getting more mappers on a job when running from the command
> line is to pass a '-Dmapred.max.split.size=xxxx' argument. The xxxx is a
> size in bytes. So if you have some hypothetical 10MB input set, but you
> want to force ~100 mappers, use '-Dmapred.max.split.size=1000000'
> 
> 
> On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit <ga...@ebay.com>wrote:
> 
>> 
>> Hi,
>> 
>> It sounds to me like this could be related to one of the Qs I've posted
>> several days ago (is it?):
>> My mahout clustering processes seem to be running very slow (several good
>> hours on just ~1M items), and I'm wondering if there's anything that needs
>> to be changed in setting/configuration. (and how?)
>>        I'm running on a large cluster and could potentially use thousands
>> of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.)
>> are only using max 5 mappers (I tried it on several data sets).
>>        I've tried to define the number of mappers by something like:
>> -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still
>> only uses <=5 mappers.
>>        Is there a different way to set the number of mappers/reducers for
>> a mahout process?
>>        Or is there another configuration issue I need to consider?
>> 
>> I'd definitely be happy to use such a parameter, does it not exist?
>> (I'm running mahout as installed on the cluster)
>> 
>> Is there currently a workaround, besides running a mahout jar as an hadoop
>> job?
>> When I originally tried to run a mahout jar that uses KMeansDriver (and
>> that runs great on my local machine)- it did not even initiate a job on the
>> hadoop cluster. It seemed to be running parallel but in fact it was running
>> only on the local node.         Is this a known issue? Is there a fix for
>> this? (I ended up dropping it and calling mahout step by step from command
>> line, but I'd be happy to know if there a fix for this).
>> 
>> Thanks,
>> 
>> Galit.
>> 
>> -----Original Message-----
>> From: Ryan Josal [mailto:rjosal@gmail.com]
>> Sent: Monday, July 29, 2013 9:33 PM
>> To: Adam Baron
>> Cc: Ryan Josal; user@mahout.apache.org
>> Subject: Re: Run more than one mapper for TestForest?
>> 
>> If you're running mahout from the CLI, you'll have to modify the Hadoop
>> config file or your env manually for each job.  This is code I put in to my
>> custom job executions so I didn't have to calculate and set that up every
>> time.  Maybe that's your best route in that position.  You could just
>> provide your own mahout jar and run it as you would any other Hadoop job
>> and ignore the installed Mahout.  I do think this could be a useful
>> parameter for a number of standard mahout jobs though; I know I would use
>> it.  Does anyone in the mahout community see this as a generally useful
>> feature for a Mahout job?
>> 
>> Ryan
>> 
>> On Jul 29, 2013, at 10:25, Adam Baron <ad...@gmail.com> wrote:
>> 
>>> Ryan,
>>> 
>>> Thanks for the fix, the code looks reasonable to me.  Which version of
>> Mahout will this be in?  0.9?
>>> 
>>> Unfortunately, I'm using a large shared Hadoop cluster which is not
>> administered by my team.   So I'm not in a position push the latest from
>> the Mahout dev trunk into our environment; the admins will only install
>> official releases.
>>> 
>>> Regards,
>>>          Adam
>>> 
>>> On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <ry...@josal.com> wrote:
>>>> Late reply, but for what it's still worth, since I've seen a couple
>> other threads here on the topic of too few mappers, I added a parameter to
>> set a minimum number of mappers.  Some of my mahout jobs needed more
>> mappers, but were not given many because of the small input file size.
>>>> 
>>>>        addOption("minMapTasks", "m", "Minimum number of map tasks to
>>>> run", String.valueOf(1));
>>>> 
>>>> 
>>>>        int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
>>>>        int mapTasksThatWouldRun = (int)
>> (vectorFileSizeBytes/getSplitSize()) + 1;
>>>>        log.info("map tasks min: " + minMapTasks + " current: " +
>> mapTasksThatWouldRun);
>>>>        if (minMapTasks > mapTasksThatWouldRun) {
>>>>            String splitSizeBytes =
>> String.valueOf(vectorFileSizeBytes/minMapTasks);
>>>>            log.info("Forcing mapred.max.split.size to " +
>> splitSizeBytes + " to ensure minimum map tasks = " + minMapTasks);
>>>>            hadoopConf.set("mapred.max.split.size", splitSizeBytes);
>>>>        }
>>>> 
>>>>    // there is actually a private method in hadoop to calculate this
>>>>    private long getSplitSize() {
>>>>        long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024
>> * 1024);
>>>>        long maxSize = hadoopConf.getLong("mapred.max.split.size",
>> Long.MAX_VALUE);
>>>>        int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
>>>>        long splitSize = Math.max(minSize, Math.min(maxSize,
>> blockSize));
>>>>        log.info(String.format("min: %,d block: %,d max: %,d split:
>> %,d", minSize, blockSize, maxSize, splitSize));
>>>>        return splitSize;
>>>>    }
>>>> 
>>>> It seems like there should be a more straightforward way to do this,
>> but it works for me and I've used it on a lot of jobs to set a minimum
>> number of mappers.
>>>> 
>>>> Ryan
>>>> 
>>>> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
>>>> 
>>>>> I'm attempting to run
>>>>> org.apache.mahout.classifier.df.mapreduce.TestForest
>>>>> on a CSV with 200,000 rows that have 500,000 features per row.
>>>>> However, TestForest is  running extremely slow, likely because only
>>>>> 1 mapper was assigned to the job.  This seems strange because the
>>>>> org.apache.mahout.classifier.df.mapreduce.BuildForest step on the
>>>>> same data used 1772 mappers and took about 6 minutes.  (BTW: I know
>>>>> I
>>>>> *shouldn't* use the same data set for the training and the testing
>>>>> steps; this is purely a technical experiment to see if Mahout's
>>>>> Random Forest can handle the data sizes we typically deal with).
>>>>> 
>>>>> Any idea on how to get
>>>>> org.apache.mahout.classifier.df.mapreduce.TestForest
>>>>> to use more mappers?  Glancing at the code (and thinking about what
>>>>> is happening intuitively), it should be ripe for parallelization.
>>>>> 
>>>>> Thanks,
>>>>>       Adam
>> 

Re: Modify number of mappers for a mahout process?

Posted by Matt Molek <mp...@gmail.com>.
Oops, I'm sorry. I had one too many zeros there, should be
'-Dmapred.max.split.size=100000'

Just (input size)/(desired number of mappers)