You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Valerio <va...@gmail.com> on 2010/08/27 21:06:52 UTC

mahout guide or tutorial or how to for test and run kmean on hadoop

hi all,

I need some guides that explain how to use mahout with the kmeans algorithm and
first of all,what type of dataset mahout uses?
I'm doing my thesis and I must run a k means clustering on weka,but weka must
call hadoop in background to parallelize the job. I discovered that mahout run
the kmeans on hadoop so i will call it from weka,but I don't understand what
type of files the kmeans of mahout read as input and how it works.

can someone help me?

Thanks all,
Valerio Ceraudo

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Valerio <va...@gmail.com>.

thanks very much!!
I will watch it because i think that this will be very usefull for my purpose!

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Grant Ingersoll <gs...@apache.org>.

FWIW, there is a basic ARFF converter in Mahout utils.  It may need some attention, but I believe it can handle basic conversion of ARFF to Mahout's vectors.

On Aug 27, 2010, at 3:06 PM, Valerio wrote:

> hi all,
> 
> I need some guides that explain how to use mahout with the kmeans algorithm and
> first of all,what type of dataset mahout uses?
> I'm doing my thesis and I must run a k means clustering on weka,but weka must
> call hadoop in background to parallelize the job. I discovered that mahout run
> the kmeans on hadoop so i will call it from weka,but I don't understand what
> type of files the kmeans of mahout read as input and how it works.
> 
> can someone help me?
> 
> Thanks all,
> Valerio Ceraudo
> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8

Re : Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Valerio <va...@gmail.com>.

Thanks for your explanation about VectorWritable.
Actually I can modify weka folder to call the right clustering and I tried
hadoop with the wordcount and it works.
I read on the Mahout's quisckstart page about VectorWritable
but I didn't how to create them, I tried this:

I took this file:
all-people-strings.lc.txt inside reuters that is just a list of name
and I did some new test:
I ran this command:

bin/mahout seqdirectory --input
/home/vuvvo/Scaricati/reuters21578/all-people-strings.lc.txt --output
/home/vuvvo/Scaricati/reuters21578/prova/prova2 --charset UTF-8

this command produced a new file,called chunk-0 but I can't open it to see how
is inside,so then this command I ran another one:

bin/mahout seq2sparse --input /home/vuvvo/Scaricati/reuters21578/prova/prova2/
--norm 2 --weight TF --output /home/vuvvo/Scaricati/reuters21578/prova/prova2/
--minDF 5 --maxDFPercent 90

that  produced a new folder with some sub-directory:
tokenized-documents   vectors    wordcount    dictionary.file-0

now I try to cluster the part-00000 inside vectors folder with this command:

bin/mahout kmeans --input
/home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/part-00000 --k 3 
--output /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans 
--clusters 
/home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans/clusters


but i receveid this message:


no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
28-ago-2010 4.21.44 org.apache.hadoop.util.NativeCodeLoader <clinit>
AVVERTENZA: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
28-ago-2010 4.21.44 org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
28-ago-2010 4.21.44 org.slf4j.impl.JCLLoggerAdapter error
GRAVE: MahoutDriver failed with args: [--input,
/home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/part-00000, --k, 3,
--output, /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans,
--clusters,
/home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans/clusters,
null]
Index: 0, Size: 0
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
	at java.util.ArrayList.rangeCheck(ArrayList.java:571)
	at java.util.ArrayList.get(ArrayList.java:349)
	at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(
RandomSeedGenerator.java:113)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:164)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.
invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.
java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.
java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)

where I wrong?I started hadoop but i don't understand where is the error =(

if I found a solution,maybe i finish my thesi ^^

than i must understand how to trasform a .irff file of weka
 in a VectorWritable.

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Hi Valerio,

All the Mahout clustering implementations operate over Hadoop sequence 
files of Mahout type VectorWritable. These entities allow you to 
represent dense or sparse numeric information which may be further 
annotated by NamedVector wrappers to encode vector names in the data 
set. If you can run Hadoop jobs or call Java from weka then you may be 
able to use our code directly. Look at the driver class under each 
algorithm for entry points. If all else fails we also have a command 
line interface.

All the clustering jobs accept VectorWritable input files and produce 
Hadoop directories (clusters-i) containing the Clusters produced by the 
particular clustering iteration(s) plus an optional directory 
(clusteredPoints) containing sequence files of clustered points which 
are keyed by the clusterId and contain WeightedVectorWritable wrappers 
around the original input vector. These wrappers encode the pdf of the 
cluster assignment.

Hope this helps,
Jeff

On 8/27/10 12:06 PM, Valerio wrote:
> hi all,
>
> I need some guides that explain how to use mahout with the kmeans algorithm and
> first of all,what type of dataset mahout uses?
> I'm doing my thesis and I must run a k means clustering on weka,but weka must
> call hadoop in background to parallelize the job. I discovered that mahout run
> the kmeans on hadoop so i will call it from weka,but I don't understand what
> type of files the kmeans of mahout read as input and how it works.
>
> can someone help me?
>
> Thanks all,
> Valerio Ceraudo
>
>

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 29, 2010, at 9:39 AM, Valerio wrote:

> Thanks for help,
> I will modified the kmeans code to work with .arff files of weka directly. 
> About the bugs is just a sensation,also because I founded an old topic talking 
> about it.
> This afternoon i will try to clean and install again mahout to do some 
> test on reuters sgm files.
> I will let you know if it works or if i will have the 
> same problems.
> 
> I have just one more question for you:
> 
> Mahout works on hadoop, I have already installed hadoop, so mahout uses an 
> indipendent hadoop inside it or I need to attach mahout to mine hadoop?

You give your Hadoop cluster the Mahout JOB jar and then go from there.

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Valerio <va...@gmail.com>.

Thanks for help,
I will modified the kmeans code to work with .arff files of weka directly. 
About the bugs is just a sensation,also because I founded an old topic talking 
about it.
This afternoon i will try to clean and install again mahout to do some 
test on reuters sgm files.
I will let you know if it works or if i will have the 
same problems.

I have just one more question for you:

Mahout works on hadoop, I have already installed hadoop, so mahout uses an 
indipendent hadoop inside it or I need to attach mahout to mine hadoop?

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  It has been reported recently that some of our jobs fail quietly 
and/or in unexpected ways when inputs are not correct. If you can 
duplicate this behavior please submit a JIRA and we will look into it. 
The 0.4 release is coming up maybe next month so please help us improve 
our user experience. To get a batch of correct files to inspect try 
running examples/bin/build-reuters.sh.


On 8/28/10 9:33 AM, Valerio wrote:
>>   Jeff Eastman<jdog<at>  windwardsolutions.com>  writes:
>>    Try naming the input *directory* not the particular input file.
> I tried,but the result was the same.
> But i did a discovery about a bug of mahout.
>
> When I try to convert a text file in a sequence with the command line:
>
> bin/mahout seqdirectory –input<PATH>  --output<PATH>  --charset UTF-8
>
> and then in a sparse vector with:
>
> bin/mahout seq2sparse --input<PATH>/content/reuters/seqfiles/ --norm 2 --weight
> TF --output<PATH>/content/reuters/seqfiles-TF/ --minDF 5 --maxDFPercent 90
>
> if the original file isn't correct,or the path is incorrect
> mahout create a fake chunk-0,not useful for the seq2sparse,and the second
> command create other
>
> useless things because files are empty and you can see this because the file
> part-00000 in the folder vector is around 90 bytes.
>
> I think that this was an old your answer to a similar problem like mine ^^
>
> have you got a link or a site where I can download a correct text file that is
> a dataset? so i can try to convert it in sequence and then in vectors to see
> what mahout kmeans produce.
>
> Thanks in advance!
>
>
>
>
>
>

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Valerio <va...@gmail.com>.

>  Jeff Eastman <jdog <at> windwardsolutions.com> writes:
>   Try naming the input *directory* not the particular input file.

I tried,but the result was the same.
But i did a discovery about a bug of mahout.

When I try to convert a text file in a sequence with the command line:

bin/mahout seqdirectory –input <PATH> --output <PATH> --charset UTF-8

and then in a sparse vector with:

bin/mahout seq2sparse --input <PATH>/content/reuters/seqfiles/ --norm 2 --weight
TF --output <PATH>/content/reuters/seqfiles-TF/ --minDF 5 --maxDFPercent 90

if the original file isn't correct,or the path is incorrect
mahout create a fake chunk-0,not useful for the seq2sparse,and the second 
command create other 

useless things because files are empty and you can see this because the file
part-00000 in the folder vector is around 90 bytes.

I think that this was an old your answer to a similar problem like mine ^^

have you got a link or a site where I can download a correct text file that is 
a dataset? so i can try to convert it in sequence and then in vectors to see
what mahout kmeans produce.

Thanks in advance!

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Try naming the input *directory* not the particular input file.

On 8/27/10 7:51 PM, Valerio wrote:
>
> Thanks but I need more information about the command to convert a text in a
> WritableVector and than to understand and run this file in a right file for the
> kmeans.
> I did some tempts and now i have got this result:
> Thanks for your explanation about VectorWritable.
> Actually I can modify weka folder to call the right clustering and I tried
> hadoop with the wordcount and it works.
> I read on the Mahout's quisckstart page about VectorWritable
> but I didn't how to create them, I tried this:
>
> bin/mahout kmeans --input
> /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/part-00000 --k 3
> --output /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans
> --clusters
> /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans/clusters
>
>
> but i receveid this message:
>
>
> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
> 28-ago-2010 4.21.44 org.apache.hadoop.util.NativeCodeLoader<clinit>
> AVVERTENZA: Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> 28-ago-2010 4.21.44 org.apache.hadoop.io.compress.CodecPool getCompressor
> INFO: Got brand-new compressor
> 28-ago-2010 4.21.44 org.slf4j.impl.JCLLoggerAdapter error
> GRAVE: MahoutDriver failed with args: [--input,
> /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/part-00000, --k, 3,
> --output, /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans,
> --clusters,
> /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans/clusters,
> null]
> Index: 0, Size: 0
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size:
> 0
> 	at java.util.ArrayList.rangeCheck(ArrayList.java:571)
> 	at java.util.ArrayList.get(ArrayList.java:349)
> 	at
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(
> RandomSeedGenerator.java:113)
> 	at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:164)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.
> invoke(NativeMethodAccessorImpl.java:57)
> 	at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.
> java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.
> java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>
> where I wrong?
>
>
>
>
>
>
>
>
>
>

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Valerio <va...@gmail.com>.


Thanks but I need more information about the command to convert a text in a
WritableVector and than to understand and run this file in a right file for the 
kmeans.
I did some tempts and now i have got this result:
Thanks for your explanation about VectorWritable.
Actually I can modify weka folder to call the right clustering and I tried
hadoop with the wordcount and it works.
I read on the Mahout's quisckstart page about VectorWritable
but I didn't how to create them, I tried this:

bin/mahout kmeans --input
/home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/part-00000 --k 3 
--output /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans
--clusters
/home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans/clusters


but i receveid this message:


no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
28-ago-2010 4.21.44 org.apache.hadoop.util.NativeCodeLoader <clinit>
AVVERTENZA: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
28-ago-2010 4.21.44 org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
28-ago-2010 4.21.44 org.slf4j.impl.JCLLoggerAdapter error
GRAVE: MahoutDriver failed with args: [--input,
/home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/part-00000, --k, 3,
--output, /home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans,
--clusters,
/home/vuvvo/Scaricati/reuters21578/prova/prova2/vectors/output-kmeans/clusters,
null]
Index: 0, Size: 0
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size:
0
	at java.util.ArrayList.rangeCheck(ArrayList.java:571)
	at java.util.ArrayList.get(ArrayList.java:349)
	at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(
RandomSeedGenerator.java:113)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:164)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.
invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.
java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.
java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)

where I wrong?

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Posted by Ted Dunning <te...@gmail.com>.

I don't know much about weka lately, but I don't know about any support for
calling Mahout clustering
algorithms from weka.  Typically people run Mahout clustering from the
command line.

On Fri, Aug 27, 2010 at 1:06 PM, Valerio <va...@gmail.com> wrote:

> hi all,
>
> I need some guides that explain how to use mahout with the kmeans algorithm
> and
> first of all,what type of dataset mahout uses?
> I'm doing my thesis and I must run a k means clustering on weka,but weka
> must
> call hadoop in background to parallelize the job. I discovered that mahout
> run
> the kmeans on hadoop so i will call it from weka,but I don't understand
> what
> type of files the kmeans of mahout read as input and how it works.
>
> can someone help me?
>
> Thanks all,
> Valerio Ceraudo
>
>