You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Molek <mp...@gmail.com> on 2013/08/01 17:00:44 UTC
Re: Modify number of mappers for a mahout process?
One trick to getting more mappers on a job when running from the command
line is to pass a '-Dmapred.max.split.size=xxxx' argument. The xxxx is a
size in bytes. So if you have some hypothetical 10MB input set, but you
want to force ~100 mappers, use '-Dmapred.max.split.size=1000000'
On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit <ga...@ebay.com>wrote:
>
> Hi,
>
> It sounds to me like this could be related to one of the Qs I've posted
> several days ago (is it?):
> My mahout clustering processes seem to be running very slow (several good
> hours on just ~1M items), and I'm wondering if there's anything that needs
> to be changed in setting/configuration. (and how?)
> I'm running on a large cluster and could potentially use thousands
> of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.)
> are only using max 5 mappers (I tried it on several data sets).
> I've tried to define the number of mappers by something like:
> -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still
> only uses <=5 mappers.
> Is there a different way to set the number of mappers/reducers for
> a mahout process?
> Or is there another configuration issue I need to consider?
>
> I'd definitely be happy to use such a parameter, does it not exist?
> (I'm running mahout as installed on the cluster)
>
> Is there currently a workaround, besides running a mahout jar as an hadoop
> job?
> When I originally tried to run a mahout jar that uses KMeansDriver (and
> that runs great on my local machine)- it did not even initiate a job on the
> hadoop cluster. It seemed to be running parallel but in fact it was running
> only on the local node. Is this a known issue? Is there a fix for
> this? (I ended up dropping it and calling mahout step by step from command
> line, but I'd be happy to know if there a fix for this).
>
> Thanks,
>
> Galit.
>
> -----Original Message-----
> From: Ryan Josal [mailto:rjosal@gmail.com]
> Sent: Monday, July 29, 2013 9:33 PM
> To: Adam Baron
> Cc: Ryan Josal; user@mahout.apache.org
> Subject: Re: Run more than one mapper for TestForest?
>
> If you're running mahout from the CLI, you'll have to modify the Hadoop
> config file or your env manually for each job. This is code I put in to my
> custom job executions so I didn't have to calculate and set that up every
> time. Maybe that's your best route in that position. You could just
> provide your own mahout jar and run it as you would any other Hadoop job
> and ignore the installed Mahout. I do think this could be a useful
> parameter for a number of standard mahout jobs though; I know I would use
> it. Does anyone in the mahout community see this as a generally useful
> feature for a Mahout job?
>
> Ryan
>
> On Jul 29, 2013, at 10:25, Adam Baron <ad...@gmail.com> wrote:
>
> > Ryan,
> >
> > Thanks for the fix, the code looks reasonable to me. Which version of
> Mahout will this be in? 0.9?
> >
> > Unfortunately, I'm using a large shared Hadoop cluster which is not
> administered by my team. So I'm not in a position push the latest from
> the Mahout dev trunk into our environment; the admins will only install
> official releases.
> >
> > Regards,
> > Adam
> >
> > On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <ry...@josal.com> wrote:
> >> Late reply, but for what it's still worth, since I've seen a couple
> other threads here on the topic of too few mappers, I added a parameter to
> set a minimum number of mappers. Some of my mahout jobs needed more
> mappers, but were not given many because of the small input file size.
> >>
> >> addOption("minMapTasks", "m", "Minimum number of map tasks to
> >> run", String.valueOf(1));
> >>
> >>
> >> int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
> >> int mapTasksThatWouldRun = (int)
> (vectorFileSizeBytes/getSplitSize()) + 1;
> >> log.info("map tasks min: " + minMapTasks + " current: " +
> mapTasksThatWouldRun);
> >> if (minMapTasks > mapTasksThatWouldRun) {
> >> String splitSizeBytes =
> String.valueOf(vectorFileSizeBytes/minMapTasks);
> >> log.info("Forcing mapred.max.split.size to " +
> splitSizeBytes + " to ensure minimum map tasks = " + minMapTasks);
> >> hadoopConf.set("mapred.max.split.size", splitSizeBytes);
> >> }
> >>
> >> // there is actually a private method in hadoop to calculate this
> >> private long getSplitSize() {
> >> long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024
> * 1024);
> >> long maxSize = hadoopConf.getLong("mapred.max.split.size",
> Long.MAX_VALUE);
> >> int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
> >> long splitSize = Math.max(minSize, Math.min(maxSize,
> blockSize));
> >> log.info(String.format("min: %,d block: %,d max: %,d split:
> %,d", minSize, blockSize, maxSize, splitSize));
> >> return splitSize;
> >> }
> >>
> >> It seems like there should be a more straightforward way to do this,
> but it works for me and I've used it on a lot of jobs to set a minimum
> number of mappers.
> >>
> >> Ryan
> >>
> >> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
> >>
> >> > I'm attempting to run
> >> > org.apache.mahout.classifier.df.mapreduce.TestForest
> >> > on a CSV with 200,000 rows that have 500,000 features per row.
> >> > However, TestForest is running extremely slow, likely because only
> >> > 1 mapper was assigned to the job. This seems strange because the
> >> > org.apache.mahout.classifier.df.mapreduce.BuildForest step on the
> >> > same data used 1772 mappers and took about 6 minutes. (BTW: I know
> >> > I
> >> > *shouldn't* use the same data set for the training and the testing
> >> > steps; this is purely a technical experiment to see if Mahout's
> >> > Random Forest can handle the data sizes we typically deal with).
> >> >
> >> > Any idea on how to get
> >> > org.apache.mahout.classifier.df.mapreduce.TestForest
> >> > to use more mappers? Glancing at the code (and thinking about what
> >> > is happening intuitively), it should be ripe for parallelization.
> >> >
> >> > Thanks,
> >> > Adam
> >
>
Re: Modify number of mappers for a mahout process?
Posted by Ryan Josal <rj...@gmail.com>.
Galit, yes this does sound like this is related, and as Matt said, you can test this by setting the max split size on the CLI. I didn't personally find this to be a reliable and efficient method, so I wrote the -m parameter to my job to set it right every time. It seems that this would be useful to have as a general parameter for Mahout jobs; is there agreement on this, and if so can I get some guidance on how to contribute?
Ryan
On Aug 1, 2013, at 8:00, Matt Molek <mp...@gmail.com> wrote:
> One trick to getting more mappers on a job when running from the command
> line is to pass a '-Dmapred.max.split.size=xxxx' argument. The xxxx is a
> size in bytes. So if you have some hypothetical 10MB input set, but you
> want to force ~100 mappers, use '-Dmapred.max.split.size=1000000'
>
>
> On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit <ga...@ebay.com>wrote:
>
>>
>> Hi,
>>
>> It sounds to me like this could be related to one of the Qs I've posted
>> several days ago (is it?):
>> My mahout clustering processes seem to be running very slow (several good
>> hours on just ~1M items), and I'm wondering if there's anything that needs
>> to be changed in setting/configuration. (and how?)
>> I'm running on a large cluster and could potentially use thousands
>> of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.)
>> are only using max 5 mappers (I tried it on several data sets).
>> I've tried to define the number of mappers by something like:
>> -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still
>> only uses <=5 mappers.
>> Is there a different way to set the number of mappers/reducers for
>> a mahout process?
>> Or is there another configuration issue I need to consider?
>>
>> I'd definitely be happy to use such a parameter, does it not exist?
>> (I'm running mahout as installed on the cluster)
>>
>> Is there currently a workaround, besides running a mahout jar as an hadoop
>> job?
>> When I originally tried to run a mahout jar that uses KMeansDriver (and
>> that runs great on my local machine)- it did not even initiate a job on the
>> hadoop cluster. It seemed to be running parallel but in fact it was running
>> only on the local node. Is this a known issue? Is there a fix for
>> this? (I ended up dropping it and calling mahout step by step from command
>> line, but I'd be happy to know if there a fix for this).
>>
>> Thanks,
>>
>> Galit.
>>
>> -----Original Message-----
>> From: Ryan Josal [mailto:rjosal@gmail.com]
>> Sent: Monday, July 29, 2013 9:33 PM
>> To: Adam Baron
>> Cc: Ryan Josal; user@mahout.apache.org
>> Subject: Re: Run more than one mapper for TestForest?
>>
>> If you're running mahout from the CLI, you'll have to modify the Hadoop
>> config file or your env manually for each job. This is code I put in to my
>> custom job executions so I didn't have to calculate and set that up every
>> time. Maybe that's your best route in that position. You could just
>> provide your own mahout jar and run it as you would any other Hadoop job
>> and ignore the installed Mahout. I do think this could be a useful
>> parameter for a number of standard mahout jobs though; I know I would use
>> it. Does anyone in the mahout community see this as a generally useful
>> feature for a Mahout job?
>>
>> Ryan
>>
>> On Jul 29, 2013, at 10:25, Adam Baron <ad...@gmail.com> wrote:
>>
>>> Ryan,
>>>
>>> Thanks for the fix, the code looks reasonable to me. Which version of
>> Mahout will this be in? 0.9?
>>>
>>> Unfortunately, I'm using a large shared Hadoop cluster which is not
>> administered by my team. So I'm not in a position push the latest from
>> the Mahout dev trunk into our environment; the admins will only install
>> official releases.
>>>
>>> Regards,
>>> Adam
>>>
>>> On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <ry...@josal.com> wrote:
>>>> Late reply, but for what it's still worth, since I've seen a couple
>> other threads here on the topic of too few mappers, I added a parameter to
>> set a minimum number of mappers. Some of my mahout jobs needed more
>> mappers, but were not given many because of the small input file size.
>>>>
>>>> addOption("minMapTasks", "m", "Minimum number of map tasks to
>>>> run", String.valueOf(1));
>>>>
>>>>
>>>> int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
>>>> int mapTasksThatWouldRun = (int)
>> (vectorFileSizeBytes/getSplitSize()) + 1;
>>>> log.info("map tasks min: " + minMapTasks + " current: " +
>> mapTasksThatWouldRun);
>>>> if (minMapTasks > mapTasksThatWouldRun) {
>>>> String splitSizeBytes =
>> String.valueOf(vectorFileSizeBytes/minMapTasks);
>>>> log.info("Forcing mapred.max.split.size to " +
>> splitSizeBytes + " to ensure minimum map tasks = " + minMapTasks);
>>>> hadoopConf.set("mapred.max.split.size", splitSizeBytes);
>>>> }
>>>>
>>>> // there is actually a private method in hadoop to calculate this
>>>> private long getSplitSize() {
>>>> long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024
>> * 1024);
>>>> long maxSize = hadoopConf.getLong("mapred.max.split.size",
>> Long.MAX_VALUE);
>>>> int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
>>>> long splitSize = Math.max(minSize, Math.min(maxSize,
>> blockSize));
>>>> log.info(String.format("min: %,d block: %,d max: %,d split:
>> %,d", minSize, blockSize, maxSize, splitSize));
>>>> return splitSize;
>>>> }
>>>>
>>>> It seems like there should be a more straightforward way to do this,
>> but it works for me and I've used it on a lot of jobs to set a minimum
>> number of mappers.
>>>>
>>>> Ryan
>>>>
>>>> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
>>>>
>>>>> I'm attempting to run
>>>>> org.apache.mahout.classifier.df.mapreduce.TestForest
>>>>> on a CSV with 200,000 rows that have 500,000 features per row.
>>>>> However, TestForest is running extremely slow, likely because only
>>>>> 1 mapper was assigned to the job. This seems strange because the
>>>>> org.apache.mahout.classifier.df.mapreduce.BuildForest step on the
>>>>> same data used 1772 mappers and took about 6 minutes. (BTW: I know
>>>>> I
>>>>> *shouldn't* use the same data set for the training and the testing
>>>>> steps; this is purely a technical experiment to see if Mahout's
>>>>> Random Forest can handle the data sizes we typically deal with).
>>>>>
>>>>> Any idea on how to get
>>>>> org.apache.mahout.classifier.df.mapreduce.TestForest
>>>>> to use more mappers? Glancing at the code (and thinking about what
>>>>> is happening intuitively), it should be ripe for parallelization.
>>>>>
>>>>> Thanks,
>>>>> Adam
>>
Re: Modify number of mappers for a mahout process?
Posted by Matt Molek <mp...@gmail.com>.
Oops, I'm sorry. I had one too many zeros there, should be
'-Dmapred.max.split.size=100000'
Just (input size)/(desired number of mappers)