You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Christopher Laux <ct...@gmail.com> on 2012/11/18 17:37:19 UTC
Conversion of point numbers to key strings
Hi all,
I can read mahout's output in "clusteredPoints" but that only provides
point numbers. When I input the data to a sequence file I used strings as
keys. Is there any way of recovering the key strings from the point
numbers? Or do I have to keep track of that myself?
Thanks,
Chris
Re: Conversion of point numbers to key strings
Posted by Christopher Laux <ct...@gmail.com>.
>
> Christopher, can you provide details on:
> 1. What version you are running? Is this 0.7 or build from source?
>
This happens both with the current trunk and 0.7 distro.
> 2. Can you look at the script and turn on verbose logging in Java?
>
Couldn't find out how to do that :(
Thanks,
Chris
>
> On Mon, Nov 19, 2012 at 9:12 AM, Christopher Laux <ct...@gmail.com>
wrote:
>
>> Caused by: java.lang.NoSuchFieldError: LUCENE_36
>> at
>>
>>
org.apache.mahout.vectorizer.DefaultAnalyzer.<init>(DefaultAnalyzer.java:34)
>> ... 11 more
>>
>> Any idea what causes this?
>>
>
> don't believe we have updated to 4 yet, unless I missed something.
Re: Conversion of point numbers to key strings
Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 19, 2012, at 12:16 PM, Ted Dunning wrote:
> This looks like it may be an artifact of switching to Lucene 4.0.
>
> Grant?
I don't believe we have updated to 4 yet, unless I missed something.
Christopher, can you provide details on:
1. What version you are running? Is this 0.7 or build from source?
2. Can you look at the script and turn on verbose logging in Java?
>
> On Mon, Nov 19, 2012 at 9:12 AM, Christopher Laux <ct...@gmail.com> wrote:
>
>> Caused by: java.lang.NoSuchFieldError: LUCENE_36
>> at
>>
>> org.apache.mahout.vectorizer.DefaultAnalyzer.<init>(DefaultAnalyzer.java:34)
>> ... 11 more
>>
>> Any idea what causes this?
>>
--------------------------------------------
Grant Ingersoll
http://www.lucidworks.com
Re: Conversion of point numbers to key strings
Posted by Ted Dunning <te...@gmail.com>.
This looks like it may be an artifact of switching to Lucene 4.0.
Grant?
On Mon, Nov 19, 2012 at 9:12 AM, Christopher Laux <ct...@gmail.com> wrote:
> Caused by: java.lang.NoSuchFieldError: LUCENE_36
> at
>
> org.apache.mahout.vectorizer.DefaultAnalyzer.<init>(DefaultAnalyzer.java:34)
> ... 11 more
>
> Any idea what causes this?
>
Re: Conversion of point numbers to key strings
Posted by Christopher Laux <ct...@gmail.com>.
Thanks for the hint. Now I get this exception:
$ mahout seq2sparse -i ~/run/posts2.seq -o ~/run/posts2-vec -seq -nv
Nov 19, 2012 6:09:22 PM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.IllegalStateException: java.lang.reflect.InvocationTargetException
at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:70)
at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:28)
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:58)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:62)
... 6 more
Caused by: java.lang.NoSuchFieldError: LUCENE_36
at
org.apache.mahout.vectorizer.DefaultAnalyzer.<init>(DefaultAnalyzer.java:34)
... 11 more
Any idea what causes this?
Thanks,
Chris
On Sun, Nov 18, 2012 at 10:11 PM, DAN HELM <da...@verizon.net> wrote:
> Chris,
>
> I assume you ran the kmeans algorithm?
>
> I believe the clusteredPoints file should prefix the document vectors with
> the text version of the processed documents (assuming seq2sparse was run
> with named vector (-nv) option),
> as shown in "Cluster documents using kmeans", step 3. here:
>
> https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html
>
> But for the cluster id part (the Key), I believe one does have to map that
> numeric key with the corresponding ids from main cluster results (i.e., in
> "clusters-<n>-final" results).
> As I recall the corresponding keys in the "final" folder will be CL-<id>
> or VL-<id>, specifying the state of the final cluster (converged or not):
> http://lucene.472066.n3.nabble.com/retrieve-k-means-result-td1386091.html
> I believe you just need to parse the ids from the clusteredPoints output
> (the Key) and map them to the number following "CL-" or "VL-" in the
> "final" output to identify the corresponding clusters.
>
> Dan
>
> *From:* Christopher Laux <ct...@gmail.com>
> *To:* user@mahout.apache.org
> *Sent:* Sunday, November 18, 2012 11:37 AM
> *Subject:* Conversion of point numbers to key strings
>
> Hi all,
>
> I can read mahout's output in "clusteredPoints" but that only provides
> point numbers. When I input the data to a sequence file I used strings as
> keys. Is there any way of recovering the key strings from the point
> numbers? Or do I have to keep track of that myself?
>
> Thanks,
> Chris
>
>
>
>
Re: Conversion of point numbers to key strings
Posted by DAN HELM <da...@verizon.net>.
Chris,
I assume you ran the kmeans algorithm?
I believe the clusteredPoints file should prefix the document vectors with the text version of the processed documents (assuming seq2sparse was run with named vector (-nv) option),
as shown in "Cluster documents using kmeans", step 3. here:
https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html
But for the cluster id part (the Key), I believe one does have to map that numeric key with the corresponding ids from main cluster results (i.e., in "clusters-<n>-final" results).
As I recall the corresponding keys in the "final" folder will be CL-<id> or VL-<id>, specifying the state of the final cluster (converged or not):
http://lucene.472066.n3.nabble.com/retrieve-k-means-result-td1386091.html
I believe you just need to parse the ids from the clusteredPoints output (the Key) and map them to the number following "CL-" or "VL-" in the "final" output to identify the corresponding clusters.
Dan
________________________________
From: Christopher Laux <ct...@gmail.com>
To: user@mahout.apache.org
Sent: Sunday, November 18, 2012 11:37 AM
Subject: Conversion of point numbers to key strings
Hi all,
I can read mahout's output in "clusteredPoints" but that only provides
point numbers. When I input the data to a sequence file I used strings as
keys. Is there any way of recovering the key strings from the point
numbers? Or do I have to keep track of that myself?
Thanks,
Chris