You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Delroy Cameron <de...@gmail.com> on 2010/04/19 18:01:16 UTC

SnowballAnalyzer

i wanted to use the SnowballAnalyzer to filter my documents for k-means
clustering. It appears to be part of Lucene 2.9.1 but not in the Mahout job
or jar file...
could anyone suggest a solution apart from downloading lucene 2.9.1.jar
separately and adding it to the Mahout job file...which i'm not sure will
work anyway..

--cheers
Delroy
-- 
View this message in context: http://n3.nabble.com/SnowballAnalyzer-tp729983p729983.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: SnowballAnalyzer

Posted by Delroy Cameron <de...@gmail.com>.
Grant, Sean and Robin, 
thanks guys, i got the  lucene-snowball-2.9.1.jar and extended
SnowballAnlyzer to a GenericSnowballAnlyzer class. Its constructor takes no
arguments and calls the SnowballAnlyzer constructor 
public GenericSnowballAnalyzer(){
		super(Version.LUCENE_23, "English", StopWordFilter.getStopWords());
}
i created an inner class with a map of stopwords (StopWordFilter) for the
constructor to avoid having to add another jar to the project. 
to add the snowball jar to the project, i had to add it to the
./m2/repository and update the MAHOUT_ROOT/pom.xml and MAHOUT_CORE/pom.xml
to capture the dependency. 
this worked out guys i was able to generate the vectors..thanks guys.

-----
--cheers
Delroy
-- 
View this message in context: http://lucene.472066.n3.nabble.com/SnowballAnalyzer-tp729983p747250.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: SnowballAnalyzer

Posted by Sean Owen <sr...@gmail.com>.
Yes, you can discover the available constructors and their parameters.

But I don't think that it make sense in general to just pass null / 0
to parameters or guess at dummy values. It'd be as likely to cause
even subtler errors.

I think what you have to do here is extend SnowballAnalyzer, where the
subclass has a no-arg constructor which calls super(String) with the
right argument.

On Tue, Apr 20, 2010 at 7:57 PM, Robin Anil <ro...@gmail.com> wrote:
> +dev
> @Delroy: Well even if you did correct the spelling. I believe
> SnowballAnalyzer cannot be instantiated without a parameter like
> StandardAnalyzer.
>
> Constructor signature is: SnowballAnalyzer(String name);
> @dev: I am not a java reflection expert. But is there a way we can find the
> parameters of the constructor and automatically put some dummy values in it?
>
> Robin
>

Re: SnowballAnalyzer

Posted by Sean Owen <sr...@gmail.com>.
Yes, you can discover the available constructors and their parameters.

But I don't think that it make sense in general to just pass null / 0
to parameters or guess at dummy values. It'd be as likely to cause
even subtler errors.

I think what you have to do here is extend SnowballAnalyzer, where the
subclass has a no-arg constructor which calls super(String) with the
right argument.

On Tue, Apr 20, 2010 at 7:57 PM, Robin Anil <ro...@gmail.com> wrote:
> +dev
> @Delroy: Well even if you did correct the spelling. I believe
> SnowballAnalyzer cannot be instantiated without a parameter like
> StandardAnalyzer.
>
> Constructor signature is: SnowballAnalyzer(String name);
> @dev: I am not a java reflection expert. But is there a way we can find the
> parameters of the constructor and automatically put some dummy values in it?
>
> Robin
>

Re: SnowballAnalyzer

Posted by Robin Anil <ro...@gmail.com>.
+dev
@Delroy: Well even if you did correct the spelling. I believe
SnowballAnalyzer cannot be instantiated without a parameter like
StandardAnalyzer.

Constructor signature is: SnowballAnalyzer(String name);
@dev: I am not a java reflection expert. But is there a way we can find the
parameters of the constructor and automatically put some dummy values in it?

Robin

On Wed, Apr 21, 2010 at 12:23 AM, Robin Anil <ro...@gmail.com> wrote:

> org.apache.lucene.analysis.snowball.SnowballAnalyzer
>
> Check spelling
>
> On Tue, Apr 20, 2010 at 10:23 PM, Delroy Cameron <delroy.cameron@gmail.com
> > wrote:
>
>>
>> Grant,
>>
>> i'm trying to generate the Sequence Vectors using the SnowballAnlyzer as
>> opposed to the StandardAnlyzer. I've already gone through this process
>> using
>> the StandardAnlyzer and plotted the output clusters using the k-means dump
>> file, so i'm familiar with clustering in Mahout. i'd like to repeat this
>> exercise with the SnowballAnlyzer, running the following command.
>>
>> ./mahout seq2sparse -s 2 -a
>> org.apache.lucene.anlysis.snowball.SnowballAnlyzer -chunk 100 -i
>> /home/hadoop/tmp/trecdata-seqfiles/chunk-0 -o
>> /home/hadoop/tmp/trecdata-vectors -md 1 -x 75 -wt TFIDF -n 0
>>
>> 1) i've placed the lucene-snowball jar in the  m2 repository
>> /home/delroy/.m2/repository/org/apache/lucene/lucene-snowball/2.9.1
>>
>> 2) and i also updated the Mahout_CORE/pom xml to reflect the dependency
>> <!-- updated by Delroy to use Snowball Anlyzer -->
>>    <dependency>
>>      <groupId>org.apache.lucene</groupId>
>>      <artifactId>lucene-snowball</artifactId>
>>      <version>2.9.1</version>
>>    </dependency>
>>
>> 3) then i did a mvn install on the Mahout_CORE and on Mahout_ROOT, which
>> downloaded the lucene-snowball pom and lucene-snowball pom sha1 to the m2
>> repository
>>
>> this error seems to stem from developer code, which incidentally notes
>> that
>> you should not instantiate the anlyzer at
>> SparseVectorsFromSequenceFiles.java:176 any suggestions here?
>>
>> Output:
>> Exception in thread "main" java.lang.InstantiationException:
>> org.apache.lucene.anlysis.snowball.SnowballAnlyzer
>>        at java.lang.Class.newInstance0(Class.java:357)
>>        at java.lang.Class.newInstance(Class.java:325)
>>        at org.apache.mahout.text.SparseVectorsFromSequenceFiles.main()
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>        at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>        at
>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>        at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>        at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>>
>> PS: I just love the spam filter..won't let me write too many variants of
>> the
>> word Analyzer because it contains the word anal.
>>
>>
>> -----
>> --cheers
>> Delroy
>> --
>> View this message in context:
>> http://n3.nabble.com/SnowballAnalyzer-tp729983p732912.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>
>

Re: SnowballAnalyzer

Posted by Robin Anil <ro...@gmail.com>.
+dev
@Delroy: Well even if you did correct the spelling. I believe
SnowballAnalyzer cannot be instantiated without a parameter like
StandardAnalyzer.

Constructor signature is: SnowballAnalyzer(String name);
@dev: I am not a java reflection expert. But is there a way we can find the
parameters of the constructor and automatically put some dummy values in it?

Robin

On Wed, Apr 21, 2010 at 12:23 AM, Robin Anil <ro...@gmail.com> wrote:

> org.apache.lucene.analysis.snowball.SnowballAnalyzer
>
> Check spelling
>
> On Tue, Apr 20, 2010 at 10:23 PM, Delroy Cameron <delroy.cameron@gmail.com
> > wrote:
>
>>
>> Grant,
>>
>> i'm trying to generate the Sequence Vectors using the SnowballAnlyzer as
>> opposed to the StandardAnlyzer. I've already gone through this process
>> using
>> the StandardAnlyzer and plotted the output clusters using the k-means dump
>> file, so i'm familiar with clustering in Mahout. i'd like to repeat this
>> exercise with the SnowballAnlyzer, running the following command.
>>
>> ./mahout seq2sparse -s 2 -a
>> org.apache.lucene.anlysis.snowball.SnowballAnlyzer -chunk 100 -i
>> /home/hadoop/tmp/trecdata-seqfiles/chunk-0 -o
>> /home/hadoop/tmp/trecdata-vectors -md 1 -x 75 -wt TFIDF -n 0
>>
>> 1) i've placed the lucene-snowball jar in the  m2 repository
>> /home/delroy/.m2/repository/org/apache/lucene/lucene-snowball/2.9.1
>>
>> 2) and i also updated the Mahout_CORE/pom xml to reflect the dependency
>> <!-- updated by Delroy to use Snowball Anlyzer -->
>>    <dependency>
>>      <groupId>org.apache.lucene</groupId>
>>      <artifactId>lucene-snowball</artifactId>
>>      <version>2.9.1</version>
>>    </dependency>
>>
>> 3) then i did a mvn install on the Mahout_CORE and on Mahout_ROOT, which
>> downloaded the lucene-snowball pom and lucene-snowball pom sha1 to the m2
>> repository
>>
>> this error seems to stem from developer code, which incidentally notes
>> that
>> you should not instantiate the anlyzer at
>> SparseVectorsFromSequenceFiles.java:176 any suggestions here?
>>
>> Output:
>> Exception in thread "main" java.lang.InstantiationException:
>> org.apache.lucene.anlysis.snowball.SnowballAnlyzer
>>        at java.lang.Class.newInstance0(Class.java:357)
>>        at java.lang.Class.newInstance(Class.java:325)
>>        at org.apache.mahout.text.SparseVectorsFromSequenceFiles.main()
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>        at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>        at
>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>        at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>        at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>>
>> PS: I just love the spam filter..won't let me write too many variants of
>> the
>> word Analyzer because it contains the word anal.
>>
>>
>> -----
>> --cheers
>> Delroy
>> --
>> View this message in context:
>> http://n3.nabble.com/SnowballAnalyzer-tp729983p732912.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>
>

Re: SnowballAnalyzer

Posted by Robin Anil <ro...@gmail.com>.
org.apache.lucene.analysis.snowball.SnowballAnalyzer

Check spelling

On Tue, Apr 20, 2010 at 10:23 PM, Delroy Cameron
<de...@gmail.com>wrote:

>
> Grant,
>
> i'm trying to generate the Sequence Vectors using the SnowballAnlyzer as
> opposed to the StandardAnlyzer. I've already gone through this process
> using
> the StandardAnlyzer and plotted the output clusters using the k-means dump
> file, so i'm familiar with clustering in Mahout. i'd like to repeat this
> exercise with the SnowballAnlyzer, running the following command.
>
> ./mahout seq2sparse -s 2 -a
> org.apache.lucene.anlysis.snowball.SnowballAnlyzer -chunk 100 -i
> /home/hadoop/tmp/trecdata-seqfiles/chunk-0 -o
> /home/hadoop/tmp/trecdata-vectors -md 1 -x 75 -wt TFIDF -n 0
>
> 1) i've placed the lucene-snowball jar in the  m2 repository
> /home/delroy/.m2/repository/org/apache/lucene/lucene-snowball/2.9.1
>
> 2) and i also updated the Mahout_CORE/pom xml to reflect the dependency
> <!-- updated by Delroy to use Snowball Anlyzer -->
>    <dependency>
>      <groupId>org.apache.lucene</groupId>
>      <artifactId>lucene-snowball</artifactId>
>      <version>2.9.1</version>
>    </dependency>
>
> 3) then i did a mvn install on the Mahout_CORE and on Mahout_ROOT, which
> downloaded the lucene-snowball pom and lucene-snowball pom sha1 to the m2
> repository
>
> this error seems to stem from developer code, which incidentally notes that
> you should not instantiate the anlyzer at
> SparseVectorsFromSequenceFiles.java:176 any suggestions here?
>
> Output:
> Exception in thread "main" java.lang.InstantiationException:
> org.apache.lucene.anlysis.snowball.SnowballAnlyzer
>        at java.lang.Class.newInstance0(Class.java:357)
>        at java.lang.Class.newInstance(Class.java:325)
>        at org.apache.mahout.text.SparseVectorsFromSequenceFiles.main()
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>        at java.lang.reflect.Method.invoke(Method.java:616)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>
> PS: I just love the spam filter..won't let me write too many variants of
> the
> word Analyzer because it contains the word anal.
>
>
> -----
> --cheers
> Delroy
> --
> View this message in context:
> http://n3.nabble.com/SnowballAnalyzer-tp729983p732912.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Re: SnowballAnalyzer

Posted by Delroy Cameron <de...@gmail.com>.
Grant, 

i'm trying to generate the Sequence Vectors using the SnowballAnlyzer as
opposed to the StandardAnlyzer. I've already gone through this process using
the StandardAnlyzer and plotted the output clusters using the k-means dump
file, so i'm familiar with clustering in Mahout. i'd like to repeat this
exercise with the SnowballAnlyzer, running the following command. 

./mahout seq2sparse -s 2 -a
org.apache.lucene.anlysis.snowball.SnowballAnlyzer -chunk 100 -i
/home/hadoop/tmp/trecdata-seqfiles/chunk-0 -o
/home/hadoop/tmp/trecdata-vectors -md 1 -x 75 -wt TFIDF -n 0

1) i've placed the lucene-snowball jar in the  m2 repository
/home/delroy/.m2/repository/org/apache/lucene/lucene-snowball/2.9.1

2) and i also updated the Mahout_CORE/pom xml to reflect the dependency 
<!-- updated by Delroy to use Snowball Anlyzer -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-snowball</artifactId>
      <version>2.9.1</version>
    </dependency>

3) then i did a mvn install on the Mahout_CORE and on Mahout_ROOT, which
downloaded the lucene-snowball pom and lucene-snowball pom sha1 to the m2
repository 

this error seems to stem from developer code, which incidentally notes that
you should not instantiate the anlyzer at
SparseVectorsFromSequenceFiles.java:176 any suggestions here?

Output:
Exception in thread "main" java.lang.InstantiationException:
org.apache.lucene.anlysis.snowball.SnowballAnlyzer
	at java.lang.Class.newInstance0(Class.java:357)
	at java.lang.Class.newInstance(Class.java:325)
	at org.apache.mahout.text.SparseVectorsFromSequenceFiles.main()
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)

PS: I just love the spam filter..won't let me write too many variants of the
word Analyzer because it contains the word anal. 


-----
--cheers
Delroy
-- 
View this message in context: http://n3.nabble.com/SnowballAnalyzer-tp729983p732912.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: SnowballAnalyzer

Posted by Grant Ingersoll <gs...@apache.org>.
On Apr 19, 2010, at 12:01 PM, Delroy Cameron wrote:

> 
> i wanted to use the SnowballAnalyzer to filter my documents for k-means
> clustering. It appears to be part of Lucene 2.9.1 but not in the Mahout job
> or jar file...
> could anyone suggest a solution apart from downloading lucene 2.9.1.jar
> separately and adding it to the Mahout job file...which i'm not sure will
> work anyway..

That would be the route I would go.  The JOB file is just a JAR file, so it should be straightforward to add.