You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Camilo Lopez <ca...@camilolopez.com> on 2011/04/20 18:58:51 UTC
Custom analyzers for seq2sparse
Hi List,
Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
This is the command
bin/mahout seq2sparse -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer -nv
Looking a little deeper (ie catching the InstantiationException and throwing getCause()) InstantiationException in turns out the problem is caused by a NullPointerException
Exception in thread "main" java.lang.NullPointerException
at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
Re: Custom analyzers for seq2sparse
Posted by Camilo Lopez <ca...@camilolopez.com>.
OK that did work for mahout thanks!, but now hadoop cannot load the class, even when
the jar containing it has been added to the hadoop classpath
hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ echo $HADOOP_CLASSPATH
/home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-core-3.0.2.jar:/home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-analyzers-3.0.2.jar:/home/hadoop/my_analyzer.jar
I get:
hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ bin/mahout seq2sparse -i /htmless_articles_seq -o /htmless_articles_vectors_2 -wt tfidf -a com.my.analyzers.MyAnalyzer
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/04/21 13:39:33 WARN driver.MahoutDriver: No seq2sparse.props found on classpath, will use command-line arguments only
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 3
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
11/04/21 13:39:33 INFO common.HadoopUtil: Deleting /htmless_articles_vectors_2
11/04/21 13:39:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/04/21 13:39:33 INFO input.FileInputFormat: Total input paths to process : 1
11/04/21 13:39:33 INFO mapred.JobClient: Running job: job_201104211109_0038
11/04/21 13:39:34 INFO mapred.JobClient: map 0% reduce 0%
11/04/21 13:39:43 INFO mapred.JobClient: Task Id : attempt_201104211109_0038_m_000000_0, Status : FAILED
java.lang.IllegalStateException: java.lang.ClassNotFoundException: com.my.analyzers.MyAnalyzer
at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:61)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.ClassNotFoundException: com.my.analyzers.MyAnalyzer
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:57)
... 4 more
Is there anything I'm missing there?
On 2011-04-20, at 1:32 PM, Ian Helmke wrote:
> Yes, if you make a subclass of StandardAnalyzer or your own Analyzer
> that has a constructor with no arguments (presumably which calls a
> superclass constructor with the arguments you want), that should work
> nicely. (You could also just add a zero-argument constructor to your
> own custom analyzer.)
>
> On Wed, Apr 20, 2011 at 1:25 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
>> Ian,
>>
>> Using 3.0.x ( the one that comes by default in Mahouts trunk now),
>> by nullary consstructor you mean I should overload the constructor to receive
>> no args in my own custom class?
>>
>>
>> On 2011-04-20, at 1:23 PM, Ian Helmke wrote:
>>
>>> What version of lucene are you using? If you use lucene 3.0 or later,
>>> you can't use StandardAnalyzer as-is because it has no no-args
>>> constructor. You could try the mahout DefaultAnalyzer (which wraps the
>>> lucene analyzer in a no-argument constructor). I have gotten custom
>>> analyzers to work, but they need to have a nullary constructor.
>>>
>>>
>>> On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
>>>> Hi List,
>>>>
>>>> Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
>>>>
>>>> This is the command
>>>>
>>>> bin/mahout seq2sparse -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer -nv
>>>>
>>>>
>>>> Looking a little deeper (ie catching the InstantiationException and throwing getCause()) InstantiationException in turns out the problem is caused by a NullPointerException
>>>>
>>>> Exception in thread "main" java.lang.NullPointerException
>>>> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>
>>>>
>>>> Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
>>>>
>>>>
>>>>
>>
>>
Re: Custom analyzers for seq2sparse
Posted by Ian Helmke <ih...@gmail.com>.
Yes, if you make a subclass of StandardAnalyzer or your own Analyzer
that has a constructor with no arguments (presumably which calls a
superclass constructor with the arguments you want), that should work
nicely. (You could also just add a zero-argument constructor to your
own custom analyzer.)
On Wed, Apr 20, 2011 at 1:25 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
> Ian,
>
> Using 3.0.x ( the one that comes by default in Mahouts trunk now),
> by nullary consstructor you mean I should overload the constructor to receive
> no args in my own custom class?
>
>
> On 2011-04-20, at 1:23 PM, Ian Helmke wrote:
>
>> What version of lucene are you using? If you use lucene 3.0 or later,
>> you can't use StandardAnalyzer as-is because it has no no-args
>> constructor. You could try the mahout DefaultAnalyzer (which wraps the
>> lucene analyzer in a no-argument constructor). I have gotten custom
>> analyzers to work, but they need to have a nullary constructor.
>>
>>
>> On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
>>> Hi List,
>>>
>>> Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
>>>
>>> This is the command
>>>
>>> bin/mahout seq2sparse -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer -nv
>>>
>>>
>>> Looking a little deeper (ie catching the InstantiationException and throwing getCause()) InstantiationException in turns out the problem is caused by a NullPointerException
>>>
>>> Exception in thread "main" java.lang.NullPointerException
>>> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>>
>>> Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
>>>
>>>
>>>
>
>
Re: Custom analyzers for seq2sparse
Posted by Camilo Lopez <ca...@camilolopez.com>.
Ian,
Using 3.0.x ( the one that comes by default in Mahouts trunk now),
by nullary consstructor you mean I should overload the constructor to receive
no args in my own custom class?
On 2011-04-20, at 1:23 PM, Ian Helmke wrote:
> What version of lucene are you using? If you use lucene 3.0 or later,
> you can't use StandardAnalyzer as-is because it has no no-args
> constructor. You could try the mahout DefaultAnalyzer (which wraps the
> lucene analyzer in a no-argument constructor). I have gotten custom
> analyzers to work, but they need to have a nullary constructor.
>
>
> On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
>> Hi List,
>>
>> Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
>>
>> This is the command
>>
>> bin/mahout seq2sparse -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer -nv
>>
>>
>> Looking a little deeper (ie catching the InstantiationException and throwing getCause()) InstantiationException in turns out the problem is caused by a NullPointerException
>>
>> Exception in thread "main" java.lang.NullPointerException
>> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>
>> Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
>>
>>
>>
Re: Custom analyzers for seq2sparse
Posted by Ian Helmke <ih...@gmail.com>.
What version of lucene are you using? If you use lucene 3.0 or later,
you can't use StandardAnalyzer as-is because it has no no-args
constructor. You could try the mahout DefaultAnalyzer (which wraps the
lucene analyzer in a no-argument constructor). I have gotten custom
analyzers to work, but they need to have a nullary constructor.
On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
> Hi List,
>
> Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
>
> This is the command
>
> bin/mahout seq2sparse -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer -nv
>
>
> Looking a little deeper (ie catching the InstantiationException and throwing getCause()) InstantiationException in turns out the problem is caused by a NullPointerException
>
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
> Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
>
>
>