You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Christopher Laux <ct...@gmail.com> on 2012/11/18 17:37:19 UTC

Conversion of point numbers to key strings

Hi all,

I can read mahout's output in "clusteredPoints" but that only provides
point numbers. When I input the data to a sequence file I used strings as
keys. Is there any way of recovering the key strings from the point
numbers? Or do I have to keep track of that myself?

Thanks,
Chris

Re: Conversion of point numbers to key strings

Posted by Christopher Laux <ct...@gmail.com>.
>
> Christopher, can you provide details on:
> 1. What version you are running?  Is this 0.7 or build from source?
>

This happens both with the current trunk and 0.7 distro.


> 2. Can you look at the script and turn on verbose logging in Java?
>

Couldn't find out how to do that :(

Thanks,
Chris

>
> On Mon, Nov 19, 2012 at 9:12 AM, Christopher Laux <ct...@gmail.com>
wrote:
>
>> Caused by: java.lang.NoSuchFieldError: LUCENE_36
>>    at
>>
>>
org.apache.mahout.vectorizer.DefaultAnalyzer.<init>(DefaultAnalyzer.java:34)
>>    ... 11 more
>>
>> Any idea what causes this?
>>

>
>  don't believe we have updated to 4 yet, unless I missed something.

Re: Conversion of point numbers to key strings

Posted by Grant Ingersoll <gs...@apache.org>.


On Nov 19, 2012, at 12:16 PM, Ted Dunning wrote:

> This looks like it may be an artifact of switching to Lucene 4.0.
> 
> Grant?

I don't believe we have updated to 4 yet, unless I missed something.

Christopher, can you provide details on:
1. What version you are running?  Is this 0.7 or build from source?
2. Can you look at the script and turn on verbose logging in Java?

> 
> On Mon, Nov 19, 2012 at 9:12 AM, Christopher Laux <ct...@gmail.com> wrote:
> 
>> Caused by: java.lang.NoSuchFieldError: LUCENE_36
>>    at
>> 
>> org.apache.mahout.vectorizer.DefaultAnalyzer.<init>(DefaultAnalyzer.java:34)
>>    ... 11 more
>> 
>> Any idea what causes this?
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidworks.com





Re: Conversion of point numbers to key strings

Posted by Ted Dunning <te...@gmail.com>.
This looks like it may be an artifact of switching to Lucene 4.0.

Grant?

On Mon, Nov 19, 2012 at 9:12 AM, Christopher Laux <ct...@gmail.com> wrote:

> Caused by: java.lang.NoSuchFieldError: LUCENE_36
>     at
>
> org.apache.mahout.vectorizer.DefaultAnalyzer.<init>(DefaultAnalyzer.java:34)
>     ... 11 more
>
> Any idea what causes this?
>

Re: Conversion of point numbers to key strings

Posted by Christopher Laux <ct...@gmail.com>.
Thanks for the hint. Now I get this exception:

$ mahout seq2sparse -i ~/run/posts2.seq -o ~/run/posts2-vec -seq -nv

Nov 19, 2012 6:09:22 PM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.IllegalStateException: java.lang.reflect.InvocationTargetException
    at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:70)
    at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:28)
    at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:58)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
    at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:62)
    ... 6 more
Caused by: java.lang.NoSuchFieldError: LUCENE_36
    at
org.apache.mahout.vectorizer.DefaultAnalyzer.<init>(DefaultAnalyzer.java:34)
    ... 11 more

Any idea what causes this?

Thanks,
Chris


On Sun, Nov 18, 2012 at 10:11 PM, DAN HELM <da...@verizon.net> wrote:

> Chris,
>
> I assume you ran the kmeans algorithm?
>
> I believe the clusteredPoints file should prefix the document vectors with
> the text version of the processed documents (assuming seq2sparse was run
> with named vector (-nv) option),
> as shown in "Cluster documents using kmeans", step 3. here:
>
> https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html
>
> But for the cluster id part (the Key), I believe one does have to map that
> numeric key with the corresponding ids from main cluster results (i.e., in
> "clusters-<n>-final" results).
> As I recall the corresponding keys in the "final" folder will be CL-<id>
> or VL-<id>, specifying the state of the final cluster (converged or not):
> http://lucene.472066.n3.nabble.com/retrieve-k-means-result-td1386091.html
> I believe you just need to parse the ids from the clusteredPoints output
> (the Key) and map them to the number following "CL-" or "VL-" in the
> "final" output to identify the corresponding clusters.
>
> Dan
>
>   *From:* Christopher Laux <ct...@gmail.com>
> *To:* user@mahout.apache.org
> *Sent:* Sunday, November 18, 2012 11:37 AM
> *Subject:* Conversion of point numbers to key strings
>
> Hi all,
>
> I can read mahout's output in "clusteredPoints" but that only provides
> point numbers. When I input the data to a sequence file I used strings as
> keys. Is there any way of recovering the key strings from the point
> numbers? Or do I have to keep track of that myself?
>
> Thanks,
> Chris
>
>
>
>

Re: Conversion of point numbers to key strings

Posted by DAN HELM <da...@verizon.net>.
Chris,
 
I assume you ran the kmeans algorithm?
 
I believe the clusteredPoints file should prefix the document vectors with the text version of the processed documents (assuming seq2sparse was run with named vector (-nv) option),  
as shown in "Cluster documents using kmeans", step 3. here:
https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html
 
But for the cluster id part (the Key), I believe one does have to map that numeric key with the corresponding ids from main cluster results (i.e., in "clusters-<n>-final" results).

As I recall the corresponding keys in the "final" folder will be CL-<id> or VL-<id>, specifying the state of the final cluster (converged or not):
http://lucene.472066.n3.nabble.com/retrieve-k-means-result-td1386091.html

I believe you just need to parse the ids from the clusteredPoints output (the Key) and map them to the number following "CL-" or "VL-" in the "final" output to identify the corresponding clusters.
 
Dan  

________________________________
 From: Christopher Laux <ct...@gmail.com>
To: user@mahout.apache.org 
Sent: Sunday, November 18, 2012 11:37 AM
Subject: Conversion of point numbers to key strings
  
Hi all,

I can read mahout's output in "clusteredPoints" but that only provides
point numbers. When I input the data to a sequence file I used strings as
keys. Is there any way of recovering the key strings from the point
numbers? Or do I have to keep track of that myself?

Thanks,
Chris