You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Camilo Lopez <ca...@camilolopez.com> on 2011/04/20 18:58:51 UTC

Custom analyzers for seq2sparse

Hi List,

Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.

This is the command 
 
bin/mahout seq2sparse  -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer  -nv


Looking a little deeper (ie catching the InstantiationException and throwing getCause())  InstantiationException in turns out the problem is caused by a NullPointerException

Exception in thread "main" java.lang.NullPointerException
        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Am I missing something, is there another way to create/use custom analyzers in seq2sparse?



Re: Custom analyzers for seq2sparse

Posted by Camilo Lopez <ca...@camilolopez.com>.
OK that did work for mahout thanks!, but now hadoop cannot load the class, even when
the jar containing it has been added to the hadoop classpath

hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ echo $HADOOP_CLASSPATH 
/home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-core-3.0.2.jar:/home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-analyzers-3.0.2.jar:/home/hadoop/my_analyzer.jar


I get:

hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ bin/mahout seq2sparse  -i /htmless_articles_seq -o /htmless_articles_vectors_2 -wt tfidf -a com.my.analyzers.MyAnalyzer
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/04/21 13:39:33 WARN driver.MahoutDriver: No seq2sparse.props found on classpath, will use command-line arguments only
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 3
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
11/04/21 13:39:33 INFO common.HadoopUtil: Deleting /htmless_articles_vectors_2
11/04/21 13:39:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/04/21 13:39:33 INFO input.FileInputFormat: Total input paths to process : 1
11/04/21 13:39:33 INFO mapred.JobClient: Running job: job_201104211109_0038
11/04/21 13:39:34 INFO mapred.JobClient:  map 0% reduce 0%
11/04/21 13:39:43 INFO mapred.JobClient: Task Id : attempt_201104211109_0038_m_000000_0, Status : FAILED
java.lang.IllegalStateException: java.lang.ClassNotFoundException: com.my.analyzers.MyAnalyzer
        at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:61)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.ClassNotFoundException: com.my.analyzers.MyAnalyzer
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
        at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:57)
        ... 4 more


Is there anything I'm missing there?
 
On 2011-04-20, at 1:32 PM, Ian Helmke wrote:

> Yes, if you make a subclass of StandardAnalyzer or your own Analyzer
> that has a constructor with no arguments (presumably which calls a
> superclass constructor with the arguments you want), that should work
> nicely. (You could also just add a zero-argument constructor to your
> own custom analyzer.)
> 
> On Wed, Apr 20, 2011 at 1:25 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
>> Ian,
>> 
>> Using 3.0.x ( the one that comes by default in Mahouts trunk now),
>> by nullary consstructor you mean I should overload the constructor to receive
>> no args in my own custom class?
>> 
>> 
>> On 2011-04-20, at 1:23 PM, Ian Helmke wrote:
>> 
>>> What version of lucene are you using? If you use lucene 3.0 or later,
>>> you can't use StandardAnalyzer as-is because it has no no-args
>>> constructor. You could try the mahout DefaultAnalyzer (which wraps the
>>> lucene analyzer in a no-argument constructor). I have gotten custom
>>> analyzers to work, but they need to have a nullary constructor.
>>> 
>>> 
>>> On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
>>>> Hi List,
>>>> 
>>>> Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
>>>> 
>>>> This is the command
>>>> 
>>>> bin/mahout seq2sparse  -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer  -nv
>>>> 
>>>> 
>>>> Looking a little deeper (ie catching the InstantiationException and throwing getCause())  InstantiationException in turns out the problem is caused by a NullPointerException
>>>> 
>>>> Exception in thread "main" java.lang.NullPointerException
>>>>        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>> 
>>>> 
>>>> Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
>>>> 
>>>> 
>>>> 
>> 
>> 


Re: Custom analyzers for seq2sparse

Posted by Ian Helmke <ih...@gmail.com>.
Yes, if you make a subclass of StandardAnalyzer or your own Analyzer
that has a constructor with no arguments (presumably which calls a
superclass constructor with the arguments you want), that should work
nicely. (You could also just add a zero-argument constructor to your
own custom analyzer.)

On Wed, Apr 20, 2011 at 1:25 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
> Ian,
>
> Using 3.0.x ( the one that comes by default in Mahouts trunk now),
> by nullary consstructor you mean I should overload the constructor to receive
> no args in my own custom class?
>
>
> On 2011-04-20, at 1:23 PM, Ian Helmke wrote:
>
>> What version of lucene are you using? If you use lucene 3.0 or later,
>> you can't use StandardAnalyzer as-is because it has no no-args
>> constructor. You could try the mahout DefaultAnalyzer (which wraps the
>> lucene analyzer in a no-argument constructor). I have gotten custom
>> analyzers to work, but they need to have a nullary constructor.
>>
>>
>> On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
>>> Hi List,
>>>
>>> Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
>>>
>>> This is the command
>>>
>>> bin/mahout seq2sparse  -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer  -nv
>>>
>>>
>>> Looking a little deeper (ie catching the InstantiationException and throwing getCause())  InstantiationException in turns out the problem is caused by a NullPointerException
>>>
>>> Exception in thread "main" java.lang.NullPointerException
>>>        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>>
>>> Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
>>>
>>>
>>>
>
>

Re: Custom analyzers for seq2sparse

Posted by Camilo Lopez <ca...@camilolopez.com>.
Ian,

Using 3.0.x ( the one that comes by default in Mahouts trunk now),
by nullary consstructor you mean I should overload the constructor to receive 
no args in my own custom class?


On 2011-04-20, at 1:23 PM, Ian Helmke wrote:

> What version of lucene are you using? If you use lucene 3.0 or later,
> you can't use StandardAnalyzer as-is because it has no no-args
> constructor. You could try the mahout DefaultAnalyzer (which wraps the
> lucene analyzer in a no-argument constructor). I have gotten custom
> analyzers to work, but they need to have a nullary constructor.
> 
> 
> On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
>> Hi List,
>> 
>> Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
>> 
>> This is the command
>> 
>> bin/mahout seq2sparse  -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer  -nv
>> 
>> 
>> Looking a little deeper (ie catching the InstantiationException and throwing getCause())  InstantiationException in turns out the problem is caused by a NullPointerException
>> 
>> Exception in thread "main" java.lang.NullPointerException
>>        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> 
>> 
>> Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
>> 
>> 
>> 


Re: Custom analyzers for seq2sparse

Posted by Ian Helmke <ih...@gmail.com>.
What version of lucene are you using? If you use lucene 3.0 or later,
you can't use StandardAnalyzer as-is because it has no no-args
constructor. You could try the mahout DefaultAnalyzer (which wraps the
lucene analyzer in a no-argument constructor). I have gotten custom
analyzers to work, but they need to have a nullary constructor.


On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <ca...@camilolopez.com> wrote:
> Hi List,
>
> Trying to run custom analizer classes I'm always getting InstantiationException, at first I suspected my own code, but trying with what is supposed to be the default value 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the same exception.
>
> This is the command
>
> bin/mahout seq2sparse  -i /htmless_articles_seq -o /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a org.apache.lucene.analysis.standard.StandardAnalyzer  -nv
>
>
> Looking a little deeper (ie catching the InstantiationException and throwing getCause())  InstantiationException in turns out the problem is caused by a NullPointerException
>
> Exception in thread "main" java.lang.NullPointerException
>        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>        at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
> Am I missing something, is there another way to create/use custom analyzers in seq2sparse?
>
>
>