You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by da...@ontrenet.com on 2011/03/07 17:57:03 UTC

Complete canopy example?

Hi,
  I have a directory of text documents I want to do canopy clustering with
(mahout 0.4 standalone/no hadoop).
I'm having some difficulty doing this. Is there a complete example with
every step?

Here is what I do:

Step 1$ ./bin/mahout seqdirectory -i INPUT_FILES/ -o FEED_SEQ  -c UTF-8
-chunk 5

# My INPUT_FILES contains 1000 text files, yet the output FEED_SEQ
contains only 1 tiny chunk with a file in it. Is that right?

Step 2$ ./bin/mahout seq2sparse -i FEED_SEQ -o FEED_VEC  --maxNGramSize 3

# This seems to generate a bit of output. no errors

Step 3$ ./bin/mahout canopy -i FEED_VEC -o FEED_CENTS -t1 1500 -t2 2000

Exception in thread "main" java.io.FileNotFoundException: File
file:/home/darren/Downloads/mahout-distribution-0.4/FEED_VEC/tokenized-documents/data
does not exist.

----

Step 1 output is suspicious to me: 

$ ./bin/mahout seqdirectory -i INPUT_FILES/ -o FEED_SEQ  -c UTF-8 -chunk 5
no HADOOP_HOME set, running locally
Mar 7, 2011 11:57:14 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 847 ms

----

Darren