You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by chyen <ch...@stpi.narl.org.tw> on 2013/01/14 11:13:08 UTC
Mahout clustering question
Hello,
I use mahout to do text clustering
my PC device and sofeware is below
server:
CPU:Intel Xeon E5-2620 2GHz,Ram:64GB
software:
unbuntu-12.4.1 on VirtualBox,hadoop-1.0.4,mahout-0.7
I use canopy alogrithm to clustering 80000 txt
but it run for a long time, just need two or three weeks to finish it...
but I had found CPU utilitation just below 20%...
I have found someone also has this problem,
http://mail-archives.apache.org/mod_mbox/mahout-user/201212.mbox/%3C79595651
86420075099@unknownmsgid%3E#archives
but I still doesn't know how to accelerate it,
on the other hand, is some parameter setup I got loss?
or the server is not powerful to run this job?
someone can give me a direction? Thanks a lot.
Fisher
RE: Mahout clustering question
Posted by chyen <ch...@stpi.narl.org.tw>.
Hi, Jonas
Thanks of your reply,
I had already try some sample cases,
15000 items --> use k-means run fast (less than 30 mins)
15000 items --> use canopy run fast (around 80 mins)
80000 items --> use k-means run fast (around 80 mins)
80000 items --> use canopy run slowly (estimated 2-3 weeks)
I just wants to use canopy centroids as k-means input, so I need to run
canopy first, then run k-means
PS: my real data is about 300000-400000 or more items..
I provide some parameters setup which I think it might be associated to this
case.
If miss any information, please tell me to provide it, thanks a lot.
----------------------------------------------------------------------------
--------------------------
$ more $MAHOUT_HOME/bin/mahout
...
JAVA_HEAP_MAX=-Xmx24g
...
MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.dir=$MAHOUT_LOG_DIR"
MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.file=$MAHOUT_LOGFILE"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.min.split.size=512MB"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.child.java.opts=-Xmx16384m"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.child.java.opts=-Xmx16384m"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.output.compress=true"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.compress.map.output=true"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=1"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=1"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.factor=30"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.mb=2048"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.file.buffer.size=32786"
...
----------------------------------------------------------------------------
--------------------------
$ more $HADOOP_HOME/conf/mapred-site.xml's content
...
<property>
<name>mapred.cluster.max.map.memory.mb</name>
<value>32768</value>
</property>
<property>
<name>mapred.cluster.max.reduce.memory.mb</name>
<value>32768</value>
</property>
<property>
<name>mapred.cluster.map.memory.mb</name>
<value>16384</value>
</property>
<property>
<name>mapred.cluster.reduce.memory.mb</name>
<value>16384</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx16384M</value>
</property>
...
----------------------------------------------------------------------------
--------------------------
>Hello,
>
>why don't you try with a sample of your data first?
>
>Also, you should tell more about your setup and how you executed the
program. Without that, we can hardly guess if something is wrong.
>
>Regards,
>
>Jonas
>>2013/1/14 chyen <ch...@stpi.narl.org.tw>
>> Hello,
>>
>> I use mahout to do text clustering
>> my PC device and sofeware is below
>> server:
>>
>> CPU:Intel Xeon E5-2620 2GHz,Ram:64GB
>> software:
>> unbuntu-12.4.1 on VirtualBox,hadoop-1.0.4,mahout-0.7
>>
>> I use canopy alogrithm to clustering 80000 txt
>> but it run for a long time, just need two or three weeks to finish it...
>> but I had found CPU utilitation just below 20%...
>> I have found someone also has this problem,
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201212.mbox/%3C79
>> 595651
>> 86420075099@unknownmsgid%3E#archives
>> but I still doesn't know how to accelerate it,
>> on the other hand, is some parameter setup I got loss?
>> or the server is not powerful to run this job?
>> someone can give me a direction? Thanks a lot.
>>
>> Fisher
Re: Mahout clustering question
Posted by Jonas Grote <jf...@gmail.com>.
Hello,
why don't you try with a sample of your data first?
Also, you should tell more about your setup and how you executed the
program. Without that, we can hardly guess if something is wrong.
Regards,
Jonas
2013/1/14 chyen <ch...@stpi.narl.org.tw>
> Hello,
>
>
>
> I use mahout to do text clustering
>
>
>
> my PC device and sofeware is below
>
>
>
> server:
>
> CPU:Intel Xeon E5-2620 2GHz,Ram:64GB
>
>
>
> software:
>
> unbuntu-12.4.1 on VirtualBox,hadoop-1.0.4,mahout-0.7
>
>
>
> I use canopy alogrithm to clustering 80000 txt
>
> but it run for a long time, just need two or three weeks to finish it...
>
>
>
> but I had found CPU utilitation just below 20%...
>
>
>
> I have found someone also has this problem,
>
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201212.mbox/%3C79595651
> 86420075099@unknownmsgid%3E#archives
>
>
>
> but I still doesn't know how to accelerate it,
>
> on the other hand, is some parameter setup I got loss?
>
> or the server is not powerful to run this job?
>
>
>
> someone can give me a direction? Thanks a lot.
>
>
>
> Fisher
>
>
>
>
>
>
>
>