You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by chyen <ch...@stpi.narl.org.tw> on 2013/01/14 11:13:08 UTC

Mahout clustering question

Hello,

 

I use mahout to do text clustering

 

my PC device and sofeware is below

 

server: 

CPU:Intel Xeon E5-2620 2GHz,Ram:64GB 

 

software:

unbuntu-12.4.1 on VirtualBox,hadoop-1.0.4,mahout-0.7

 

I use canopy alogrithm to clustering 80000 txt

but it run for a long time, just need two or three weeks to finish it...

 

but I had found CPU utilitation just below 20%...

 

I have found someone also has this problem,

http://mail-archives.apache.org/mod_mbox/mahout-user/201212.mbox/%3C79595651
86420075099@unknownmsgid%3E#archives

 

but I still doesn't know how to accelerate it,

on the other hand, is some parameter setup I got loss?

or the server is not powerful to run this job?

 

someone can give me a direction? Thanks a lot.

 

Fisher

RE: Mahout clustering question

Posted by chyen <ch...@stpi.narl.org.tw>.

Hi, Jonas

Thanks of your reply,
I had already try some sample cases,
15000 items --> use k-means run fast (less than 30 mins)
15000 items --> use canopy run fast (around 80 mins)
80000 items --> use k-means run fast (around 80 mins)
80000 items --> use canopy run slowly (estimated 2-3 weeks)

I just wants to use canopy centroids as k-means input, so I need to run
canopy first, then run k-means

PS: my real data is about 300000-400000 or more items..

I provide some parameters setup which I think it might be associated to this
case. 
If miss any information, please tell me to provide it, thanks a lot.
----------------------------------------------------------------------------
--------------------------
$ more $MAHOUT_HOME/bin/mahout
...
JAVA_HEAP_MAX=-Xmx24g
...
MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.dir=$MAHOUT_LOG_DIR"
MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.file=$MAHOUT_LOGFILE"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.min.split.size=512MB"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.child.java.opts=-Xmx16384m"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.child.java.opts=-Xmx16384m"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.output.compress=true"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.compress.map.output=true"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=1"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=1"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.factor=30"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.mb=2048"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.file.buffer.size=32786"
...
----------------------------------------------------------------------------
--------------------------
$ more $HADOOP_HOME/conf/mapred-site.xml's content
...
	<property>
		<name>mapred.cluster.max.map.memory.mb</name>
		<value>32768</value>
	</property>
	<property>
		<name>mapred.cluster.max.reduce.memory.mb</name>
		<value>32768</value>
	</property>
	<property>
		<name>mapred.cluster.map.memory.mb</name>
		<value>16384</value>
	</property>
	<property>
		<name>mapred.cluster.reduce.memory.mb</name>
		<value>16384</value>
	</property>
	<property>
		<name>mapred.child.java.opts</name>
		<value>-Xmx16384M</value>
	</property>	
...
----------------------------------------------------------------------------
--------------------------





>Hello,
>
>why don't you try with a sample of your data first?
>
>Also, you should tell more about your setup and how you executed the
program. Without that, we can hardly guess if something is wrong.
>
>Regards,
>
>Jonas


>>2013/1/14 chyen <ch...@stpi.narl.org.tw>
>> Hello,
>>
>> I use mahout to do text clustering
>> my PC device and sofeware is below
>> server:
>>
>> CPU:Intel Xeon E5-2620 2GHz,Ram:64GB
>> software:
>> unbuntu-12.4.1 on VirtualBox,hadoop-1.0.4,mahout-0.7
>>
>> I use canopy alogrithm to clustering 80000 txt
>> but it run for a long time, just need two or three weeks to finish it...
>> but I had found CPU utilitation just below 20%...
>> I have found someone also has this problem,
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201212.mbox/%3C79
>> 595651
>> 86420075099@unknownmsgid%3E#archives
>> but I still doesn't know how to accelerate it,
>> on the other hand, is some parameter setup I got loss?
>> or the server is not powerful to run this job?
>> someone can give me a direction? Thanks a lot.
>>
>> Fisher

Re: Mahout clustering question

Posted by Jonas Grote <jf...@gmail.com>.

Hello,

why don't you try with a sample of your data first?

Also, you should tell more about your setup and how you executed the
program. Without that, we can hardly guess if something is wrong.

Regards,

Jonas


2013/1/14 chyen <ch...@stpi.narl.org.tw>

> Hello,
>
>
>
> I use mahout to do text clustering
>
>
>
> my PC device and sofeware is below
>
>
>
> server:
>
> CPU:Intel Xeon E5-2620 2GHz,Ram:64GB
>
>
>
> software:
>
> unbuntu-12.4.1 on VirtualBox,hadoop-1.0.4,mahout-0.7
>
>
>
> I use canopy alogrithm to clustering 80000 txt
>
> but it run for a long time, just need two or three weeks to finish it...
>
>
>
> but I had found CPU utilitation just below 20%...
>
>
>
> I have found someone also has this problem,
>
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201212.mbox/%3C79595651
> 86420075099@unknownmsgid%3E#archives
>
>
>
> but I still doesn't know how to accelerate it,
>
> on the other hand, is some parameter setup I got loss?
>
> or the server is not powerful to run this job?
>
>
>
> someone can give me a direction? Thanks a lot.
>
>
>
> Fisher
>
>
>
>
>
>
>
>