You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Phoenix Bai <ba...@gmail.com> on 2012/11/13 13:01:32 UTC

Issue: Canopy is processing extremly slow, what goes wrong?

Hi All,

1) data size:
environment: company`s hadoop clusters.
Raw data: 12M
tfidf vectors: 25M (ng is set to 2)

2) running command:
tfidf vectors is feed to canopy and run the command below:

hadoop jar $MAHOUT_HOME/mahout-core-0.5-job.jar
org.apache.mahout.clustering.canopy.CanopyDriver
-Dmapred.max.split.size=4000000 \
-i /mahout/vectors/tbvideo-vectors/tfidf-vectors \
-o /mahout/output/tbvideo-canopy-centroids/ \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-t1 0.70 -t2 0.3

3) canopy running status:
and the MR runs like forever. I mean, map could finish very quickly while
the reducer task always hang at 66% like below:

12/11/13 16:29:00 INFO mapred.JobClient:  map 96% reduce 0%
12/11/13 16:29:07 INFO mapred.JobClient:  map 96% reduce 30%
12/11/13 16:29:26 INFO mapred.JobClient:  map 100% reduce 30%
12/11/13 16:29:41 INFO mapred.JobClient:  map 100% reduce 66%
12/11/13 19:34:39 INFO mapred.JobClient:  map 100% reduce 0%
12/11/13 19:34:47 INFO mapred.JobClient: Task Id :
attempt_201210311519_1936030_r_000000_0, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 137.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:456)
12/11/13 19:35:06 INFO mapred.JobClient:  map 100% reduce 66%

or sometimes erorr like this:

000000_0, Status : FAILED
Task attempt_201210311519_1900983_r_000000_0 failed to report status
for 600 seconds. Killing!

Here is the jstack dump when it gets to 66%:

 *
"main" prio=10 tid=0x000000005071a000 nid=0x7ab8 runnable [0x0000000040a3a000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.mahout.math.OrderedIntDoubleMapping.find(OrderedIntDoubleMapping.java:83)
        at org.apache.mahout.math.OrderedIntDoubleMapping.get(OrderedIntDoubleMapping.java:88)
        at org.apache.mahout.math.SequentialAccessSparseVector.getQuick(SequentialAccessSparseVector.java:184)
        at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:138)
        at org.apache.mahout.clustering.AbstractCluster.formatVector(AbstractCluster.java:301)
        at org.apache.mahout.clustering.canopy.CanopyClusterer.addPointToCanopies(CanopyClusterer.java:163)
        at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:44)
        at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyRed
*
*ucer.java:29)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:544)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
        at org.apache.hadoop.mapred.Child.main(Child.java:167)*

**
4) So, my question is,

what is wrong? why it always hang at 66%?
I thought canopy is a faster algorithm when comparing to kmeans.
but in this case, kmeans could run whole lot faster than canopy.
I run the canopy several times across two days, and never get to see the
end.
it always throws errors whenever get to the 66% of reducing process.

Please, enlighten me. or give me to a direction to what could be the
problem? and How could I fix it?
it is only 30M data, so it can`t be the size, right?

Thanks all in advance!

Re: Issue: Canopy is processing extremly slow, what goes wrong?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Keep trying larger values until you get a tractable number of canopies, 
then run cluster dumper to see what they look like. You may also need to 
increase the heap memory available to your reducers. It is an iterative 
process.

On 11/14/12 5:41 AM, Phoenix Bai wrote:
> Hi Jeff,
>
> it is really nice of you to reply. :)
>
> I changed t2=0.45 and run it again, but still, it stuck at 66%.
>
> I am using consine measure, so the range of the value that make sense to me
> is 0-1.
>
> and 0.45 seems the biggest value i could get to, but still, it is not
> working.
>
> so, what is the problem here? is it the implementation of the code or i am
> setting the way too off-set value for parameters?
> is there any more info that i could provide to help you to help me analyze
> the issue?
>
> if i set t3,t4, would it help?
>
> thanks
>
> On Tue, Nov 13, 2012 at 10:01 PM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>> Canopy is very sensitive to the value of T2: Too small a value will cause
>> the creation of very many canopies in each mapper and these will swamp the
>> reducer.  I suggest you begin with T1=T2= <a larger value> until you get
>> enough canopies. With CosineDistanceMeasure, a value of 1 ought to produce
>> only a single canopy and you can go smaller until you get a reasonable
>> number. There are also T3 and T4 arguments that allow you to specify the T1
>> and T2 values used by the reducer.
>>
>>
>> On 11/13/12 7:01 AM, Phoenix Bai wrote:
>>
>>>   Hi All,
>>>
>>> 1) data size:
>>> environment: company`s hadoop clusters.
>>> Raw data: 12M
>>> tfidf vectors: 25M (ng is set to 2)
>>>
>>> 2) running command:
>>> tfidf vectors is feed to canopy and run the command below:
>>>
>>> hadoop jar $MAHOUT_HOME/mahout-core-0.5-**job.jar
>>> org.apache.mahout.clustering.**canopy.CanopyDriver
>>> -Dmapred.max.split.size=**4000000 \
>>> -i /mahout/vectors/tbvideo-**vectors/tfidf-vectors \
>>> -o /mahout/output/tbvideo-canopy-**centroids/ \
>>> -dm org.apache.mahout.common.**distance.CosineDistanceMeasure \
>>> -t1 0.70 -t2 0.3
>>>
>>> 3) canopy running status:
>>> and the MR runs like forever. I mean, map could finish very quickly while
>>> the reducer task always hang at 66% like below:
>>>
>>> 12/11/13 16:29:00 INFO mapred.JobClient:  map 96% reduce 0%
>>> 12/11/13 16:29:07 INFO mapred.JobClient:  map 96% reduce 30%
>>> 12/11/13 16:29:26 INFO mapred.JobClient:  map 100% reduce 30%
>>> 12/11/13 16:29:41 INFO mapred.JobClient:  map 100% reduce 66%
>>> 12/11/13 19:34:39 INFO mapred.JobClient:  map 100% reduce 0%
>>> 12/11/13 19:34:47 INFO mapred.JobClient: Task Id :
>>> attempt_201210311519_1936030_**r_000000_0, Status : FAILED
>>> java.io.IOException: Task process exit with nonzero status of 137.
>>>    at org.apache.hadoop.mapred.**TaskRunner.run(TaskRunner.**java:456)
>>> 12/11/13 19:35:06 INFO mapred.JobClient:  map 100% reduce 66%
>>>
>>> or sometimes erorr like this:
>>>
>>> 000000_0, Status : FAILED
>>> Task attempt_201210311519_1900983_**r_000000_0 failed to report status
>>> for 600 seconds. Killing!
>>>
>>> Here is the jstack dump when it gets to 66%:
>>>
>>>    *
>>> "main" prio=10 tid=0x000000005071a000 nid=0x7ab8 runnable
>>> [0x0000000040a3a000]
>>>      java.lang.Thread.State: RUNNABLE
>>>           at org.apache.mahout.math.**OrderedIntDoubleMapping.find(**
>>> OrderedIntDoubleMapping.java:**83)
>>>           at org.apache.mahout.math.**OrderedIntDoubleMapping.get(**
>>> OrderedIntDoubleMapping.java:**88)
>>>           at org.apache.mahout.math.**SequentialAccessSparseVector.**
>>> getQuick(**SequentialAccessSparseVector.**java:184)
>>>           at org.apache.mahout.math.**AbstractVector.get(**
>>> AbstractVector.java:138)
>>>           at org.apache.mahout.clustering.**AbstractCluster.formatVector(*
>>> *AbstractCluster.java:301)
>>>           at org.apache.mahout.clustering.**canopy.CanopyClusterer.**
>>> addPointToCanopies(**CanopyClusterer.java:163)
>>>           at org.apache.mahout.clustering.**canopy.CanopyReducer.reduce(**
>>> CanopyReducer.java:44)
>>>           at org.apache.mahout.clustering.**canopy.CanopyReducer.reduce(**
>>> CanopyRed
>>> *
>>> *ucer.java:29)
>>>
>>>           at org.apache.hadoop.mapreduce.**Reducer.run(Reducer.java:176)
>>>           at org.apache.hadoop.mapred.**ReduceTask.runNewReducer(**
>>> ReduceTask.java:544)
>>>           at org.apache.hadoop.mapred.**ReduceTask.run(ReduceTask.**
>>> java:407)
>>>           at org.apache.hadoop.mapred.**Child.main(Child.java:167)*
>>>
>>> **
>>>
>>> 4) So, my question is,
>>>
>>> what is wrong? why it always hang at 66%?
>>> I thought canopy is a faster algorithm when comparing to kmeans.
>>> but in this case, kmeans could run whole lot faster than canopy.
>>> I run the canopy several times across two days, and never get to see the
>>> end.
>>> it always throws errors whenever get to the 66% of reducing process.
>>>
>>> Please, enlighten me. or give me to a direction to what could be the
>>> problem? and How could I fix it?
>>> it is only 30M data, so it can`t be the size, right?
>>>
>>> Thanks all in advance!
>>>
>>>

Re: Issue: Canopy is processing extremly slow, what goes wrong?

Posted by Phoenix Bai <ba...@gmail.com>.

Hi Jeff,

it is really nice of you to reply. :)

I changed t2=0.45 and run it again, but still, it stuck at 66%.

I am using consine measure, so the range of the value that make sense to me
is 0-1.

and 0.45 seems the biggest value i could get to, but still, it is not
working.

so, what is the problem here? is it the implementation of the code or i am
setting the way too off-set value for parameters?
is there any more info that i could provide to help you to help me analyze
the issue?

if i set t3,t4, would it help?

thanks

On Tue, Nov 13, 2012 at 10:01 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Canopy is very sensitive to the value of T2: Too small a value will cause
> the creation of very many canopies in each mapper and these will swamp the
> reducer.  I suggest you begin with T1=T2= <a larger value> until you get
> enough canopies. With CosineDistanceMeasure, a value of 1 ought to produce
> only a single canopy and you can go smaller until you get a reasonable
> number. There are also T3 and T4 arguments that allow you to specify the T1
> and T2 values used by the reducer.
>
>
> On 11/13/12 7:01 AM, Phoenix Bai wrote:
>
>>  Hi All,
>>
>> 1) data size:
>> environment: company`s hadoop clusters.
>> Raw data: 12M
>> tfidf vectors: 25M (ng is set to 2)
>>
>> 2) running command:
>> tfidf vectors is feed to canopy and run the command below:
>>
>> hadoop jar $MAHOUT_HOME/mahout-core-0.5-**job.jar
>> org.apache.mahout.clustering.**canopy.CanopyDriver
>> -Dmapred.max.split.size=**4000000 \
>> -i /mahout/vectors/tbvideo-**vectors/tfidf-vectors \
>> -o /mahout/output/tbvideo-canopy-**centroids/ \
>> -dm org.apache.mahout.common.**distance.CosineDistanceMeasure \
>> -t1 0.70 -t2 0.3
>>
>> 3) canopy running status:
>> and the MR runs like forever. I mean, map could finish very quickly while
>> the reducer task always hang at 66% like below:
>>
>> 12/11/13 16:29:00 INFO mapred.JobClient:  map 96% reduce 0%
>> 12/11/13 16:29:07 INFO mapred.JobClient:  map 96% reduce 30%
>> 12/11/13 16:29:26 INFO mapred.JobClient:  map 100% reduce 30%
>> 12/11/13 16:29:41 INFO mapred.JobClient:  map 100% reduce 66%
>> 12/11/13 19:34:39 INFO mapred.JobClient:  map 100% reduce 0%
>> 12/11/13 19:34:47 INFO mapred.JobClient: Task Id :
>> attempt_201210311519_1936030_**r_000000_0, Status : FAILED
>> java.io.IOException: Task process exit with nonzero status of 137.
>>   at org.apache.hadoop.mapred.**TaskRunner.run(TaskRunner.**java:456)
>> 12/11/13 19:35:06 INFO mapred.JobClient:  map 100% reduce 66%
>>
>> or sometimes erorr like this:
>>
>> 000000_0, Status : FAILED
>> Task attempt_201210311519_1900983_**r_000000_0 failed to report status
>> for 600 seconds. Killing!
>>
>> Here is the jstack dump when it gets to 66%:
>>
>>   *
>> "main" prio=10 tid=0x000000005071a000 nid=0x7ab8 runnable
>> [0x0000000040a3a000]
>>     java.lang.Thread.State: RUNNABLE
>>          at org.apache.mahout.math.**OrderedIntDoubleMapping.find(**
>> OrderedIntDoubleMapping.java:**83)
>>          at org.apache.mahout.math.**OrderedIntDoubleMapping.get(**
>> OrderedIntDoubleMapping.java:**88)
>>          at org.apache.mahout.math.**SequentialAccessSparseVector.**
>> getQuick(**SequentialAccessSparseVector.**java:184)
>>          at org.apache.mahout.math.**AbstractVector.get(**
>> AbstractVector.java:138)
>>          at org.apache.mahout.clustering.**AbstractCluster.formatVector(*
>> *AbstractCluster.java:301)
>>          at org.apache.mahout.clustering.**canopy.CanopyClusterer.**
>> addPointToCanopies(**CanopyClusterer.java:163)
>>          at org.apache.mahout.clustering.**canopy.CanopyReducer.reduce(**
>> CanopyReducer.java:44)
>>          at org.apache.mahout.clustering.**canopy.CanopyReducer.reduce(**
>> CanopyRed
>> *
>> *ucer.java:29)
>>
>>          at org.apache.hadoop.mapreduce.**Reducer.run(Reducer.java:176)
>>          at org.apache.hadoop.mapred.**ReduceTask.runNewReducer(**
>> ReduceTask.java:544)
>>          at org.apache.hadoop.mapred.**ReduceTask.run(ReduceTask.**
>> java:407)
>>          at org.apache.hadoop.mapred.**Child.main(Child.java:167)*
>>
>> **
>>
>> 4) So, my question is,
>>
>> what is wrong? why it always hang at 66%?
>> I thought canopy is a faster algorithm when comparing to kmeans.
>> but in this case, kmeans could run whole lot faster than canopy.
>> I run the canopy several times across two days, and never get to see the
>> end.
>> it always throws errors whenever get to the 66% of reducing process.
>>
>> Please, enlighten me. or give me to a direction to what could be the
>> problem? and How could I fix it?
>> it is only 30M data, so it can`t be the size, right?
>>
>> Thanks all in advance!
>>
>>
>

Re: Issue: Canopy is processing extremly slow, what goes wrong?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Canopy is very sensitive to the value of T2: Too small a value will 
cause the creation of very many canopies in each mapper and these will 
swamp the reducer.  I suggest you begin with T1=T2= <a larger value> 
until you get enough canopies. With CosineDistanceMeasure, a value of 1 
ought to produce only a single canopy and you can go smaller until you 
get a reasonable number. There are also T3 and T4 arguments that allow 
you to specify the T1 and T2 values used by the reducer.

On 11/13/12 7:01 AM, Phoenix Bai wrote:
> Hi All,
>
> 1) data size:
> environment: company`s hadoop clusters.
> Raw data: 12M
> tfidf vectors: 25M (ng is set to 2)
>
> 2) running command:
> tfidf vectors is feed to canopy and run the command below:
>
> hadoop jar $MAHOUT_HOME/mahout-core-0.5-job.jar
> org.apache.mahout.clustering.canopy.CanopyDriver
> -Dmapred.max.split.size=4000000 \
> -i /mahout/vectors/tbvideo-vectors/tfidf-vectors \
> -o /mahout/output/tbvideo-canopy-centroids/ \
> -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> -t1 0.70 -t2 0.3
>
> 3) canopy running status:
> and the MR runs like forever. I mean, map could finish very quickly while
> the reducer task always hang at 66% like below:
>
> 12/11/13 16:29:00 INFO mapred.JobClient:  map 96% reduce 0%
> 12/11/13 16:29:07 INFO mapred.JobClient:  map 96% reduce 30%
> 12/11/13 16:29:26 INFO mapred.JobClient:  map 100% reduce 30%
> 12/11/13 16:29:41 INFO mapred.JobClient:  map 100% reduce 66%
> 12/11/13 19:34:39 INFO mapred.JobClient:  map 100% reduce 0%
> 12/11/13 19:34:47 INFO mapred.JobClient: Task Id :
> attempt_201210311519_1936030_r_000000_0, Status : FAILED
> java.io.IOException: Task process exit with nonzero status of 137.
>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:456)
> 12/11/13 19:35:06 INFO mapred.JobClient:  map 100% reduce 66%
>
> or sometimes erorr like this:
>
> 000000_0, Status : FAILED
> Task attempt_201210311519_1900983_r_000000_0 failed to report status
> for 600 seconds. Killing!
>
> Here is the jstack dump when it gets to 66%:
>
>   *
> "main" prio=10 tid=0x000000005071a000 nid=0x7ab8 runnable [0x0000000040a3a000]
>     java.lang.Thread.State: RUNNABLE
>          at org.apache.mahout.math.OrderedIntDoubleMapping.find(OrderedIntDoubleMapping.java:83)
>          at org.apache.mahout.math.OrderedIntDoubleMapping.get(OrderedIntDoubleMapping.java:88)
>          at org.apache.mahout.math.SequentialAccessSparseVector.getQuick(SequentialAccessSparseVector.java:184)
>          at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:138)
>          at org.apache.mahout.clustering.AbstractCluster.formatVector(AbstractCluster.java:301)
>          at org.apache.mahout.clustering.canopy.CanopyClusterer.addPointToCanopies(CanopyClusterer.java:163)
>          at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:44)
>          at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyRed
> *
> *ucer.java:29)
>          at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>          at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:544)
>          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
>          at org.apache.hadoop.mapred.Child.main(Child.java:167)*
>
> **
> 4) So, my question is,
>
> what is wrong? why it always hang at 66%?
> I thought canopy is a faster algorithm when comparing to kmeans.
> but in this case, kmeans could run whole lot faster than canopy.
> I run the canopy several times across two days, and never get to see the
> end.
> it always throws errors whenever get to the 66% of reducing process.
>
> Please, enlighten me. or give me to a direction to what could be the
> problem? and How could I fix it?
> it is only 30M data, so it can`t be the size, right?
>
> Thanks all in advance!
>