You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pylucene-dev@lucene.apache.org by technology inspired <te...@gmail.com> on 2010/09/03 11:07:48 UTC

Stopwords in StandardAnalyzer; Constructor

Hi,

How can one define the list of allowed stopwords in StandardAnalyzer?
According to Lucene Java API doc, a set should be defined in Constructor to
include the list of allowed Stopwords. I want to avoid skipping few words
like "The", "on", "off" from being not indexed while using StandardAnalyzer.

How one would define such a constructor in PyLucene?


Regards,
Vin

Re: Stopwords in StandardAnalyzer; Constructor

Posted by Andi Vajda <va...@apache.org>.
On Fri, 3 Sep 2010, technology inspired wrote:

> How can one define the list of allowed stopwords in StandardAnalyzer?
> According to Lucene Java API doc, a set should be defined in Constructor to
> include the list of allowed Stopwords. I want to avoid skipping few words
> like "The", "on", "off" from being not indexed while using StandardAnalyzer.
>
> How one would define such a constructor in PyLucene?

Stop words can be passed to StandardAnalyzer via a Set instance.
To do this you can either:

   - add java.util.HashSet to PyLucene's jcc invocation in Makefile,
     rebuild PyLucene and then use a HashSet instance (in the Makefile, look
     for java.util.Arrays and add java.util.HashSet below).

   - use the JavaSet class in the collections.py module that is installed
     with PyLucene. The JavaSet class is a Python class that extends
     PythonSet, a Java class that implements the java.util.Set interface.
     JavaSet takes a set instance, wraps it and makes its elements
     accessible to Java via the java.util.Set interface.
     For example:
         >>> from lucene import *
         >>> from lucene.collections import JavaSet
         >>> initVM()
         <jcc.JCCEnv object at 0x10040a0d8>
         >>> a=set(['foo', 'bar', 'baz'])
         >>> b=JavaSet(a)
         >>> b
         <JavaSet: org.apache.pylucene.util.PythonSet@424ecfdd>
         >>> StandardAnalyzer(Version.LUCENE_CURRENT, b)
         <StandardAnalyzer: org.apache.lucene.analysis.standard.StandardAnalyzer@4430d82d>

Andi..