You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Antony Bowesman <ad...@teamware.com> on 2006/11/24 09:41:56 UTC

Analyzer thread safety; Stop words

Two points about Analyzers:

Does anyone have any experience with thread safety of Analyzer implementations. 
  Apart from PerFieldAnalyzerWrapper, the analyzers seem to be thread safe, but 
is there a requirement that analyzers should be thread safe?

Secondly, has anyone thought that it would be a good idea to extend the Analyzer 
interface (Abstract class) to allow a standard way to set stop words?  There 
seem to be two 'families' of stop word configuration via constructors.

The Set, File and String[] in Analyzers, such as StandardAnalyzer, StopAnalyzer 
where the and then the Russian/Greek variants that do not have the same 
Constructor signature to configure stopwords.

It makes it messy to make analyzers pluggable in a generic way so that stopwords 
can be configurable for any plugged analyzer.

Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzer thread safety; Stop words

Posted by Chris Hostetter <ho...@fucit.org>.
: Something seems confused to me.  Although stop words are use by Filters, they
: are currently exposed via Analyzers which is the granularity used at the
: IndexWriter/Parser levels.  This is what contributors are writing, not Filters.

that's not really true .. if you look at the various contrib packages that
include an Analyzer, you'll find that in most of them the real
"interesting" work that makes hte contrib usefull is being done in either
a FIlter or a Tokenizer living in the same package ... the Anlyzer is
typically just provided as a convinience -- a representation of what
contributor felt the best usage was of that Filter/Tokenizer with other
stock Filters/Tokenizers.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzer thread safety; Stop words

Posted by Antony Bowesman <ad...@teamware.com>.
Yonik Seeley wrote:
> On 11/29/06, Antony Bowesman <ad...@teamware.com> wrote:
>> Yonik Seeley wrote:
> 
> The GreekAnalyzer is just an example of how you can use existing
> Analyzers (as long as they have a default constructor), but it's not
> the recommended approach.
> 
> TokenFilters are preffered over Analyzers.... you can plug them
> together in any way you see fit to solve your analysis problem.  For
> Solr, an added bonus of using chains of filters  is that Solr can
> "know" about the results after each filter and show you the results on
> an analysis web page (very useful for debugging).
> 
> If I were to analyze greek text, I might do something like this:
> 
> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
>         <filter class="solr.SnowballPorterFilterFactory" 
> language="Greek" />
> xt"/>
>      </analyzer>
> </fieldtype>
> 
> If you try to put everything in Analyzer constructors, you get
> combinatorial explosion.

I guess you would use methods rather than, as you say, getting into constructor 
hell.  Anyway, I'll have a deeper look at the solr stuff when I get to phase 2. 
  Right now, I've gone as far with analysis as I need to, but I would like to 
get better configuration than I've currently got.  I know it will come back to 
bite...

Thanks for your comments Yonik
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzer thread safety; Stop words

Posted by Yonik Seeley <yo...@apache.org>.
On 11/29/06, Yonik Seeley <yo...@apache.org> wrote:
> If I were to analyze greek text, I might do something like this:
>
> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>           <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>           <filter class="solr.LowerCaseFilterFactory"/>
>           <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
>          <filter class="solr.SnowballPorterFilterFactory" language="Greek" />
> xt"/>
>       </analyzer>
>  </fieldtype>

Hmm, I just discovered that the Porter2 snowball stemmers don't support greek.
Here is the relevant code of the GreekAnalyzer, so to duplicate this
I'd make a FilterFactory for GreekLowerCaseFilter and reuse existing
factories for the rest.

public TokenStream tokenStream(String fieldName, Reader reader)
{
  TokenStream result = new StandardTokenizer(reader);
  result = new GreekLowerCaseFilter(result, charset);
  result = new StopFilter(result, stopSet);
  return result;
}

At some point I'd like to get to automatic generation of
FilterFactories if none existed so new Lucene filters could be used
without any extra coding.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzer thread safety; Stop words

Posted by Yonik Seeley <yo...@apache.org>.
On 11/29/06, Antony Bowesman <ad...@teamware.com> wrote:
> Yonik Seeley wrote:
> > On 11/29/06, Antony Bowesman <ad...@teamware.com> wrote:
> >>
> >> That's true, but all the existing Analyzers allow the stop set to be
> >> configured
> >> via the analyzer constructors, but in different ways.
> >
> > But you can duplicate most Analyzers (all the ones in Lucene?) with a
> > chain of Tokenizers and TokenFilters (since that is how almost all of
> > them are implemented).  Most Analyzers are simply shortcuts to putting
> > together your own.
>
> Something seems confused to me.  Although stop words are use by Filters, they
> are currently exposed via Analyzers which is the granularity used at the
> IndexWriter/Parser levels.  This is what contributors are writing, not Filters.
>
> There are lots of analysis contributions which deal with stop words that are
> perfectly usable as is.  They shouldn't need to be duplicated to be re-used and
> if that's needed, it points to a deficiency in the design.  If we all have to
> put together our own, again, doesn't this argue that there should be a standard
> way of doing it at the higher Analyzer level.

> Sure, the solr way of using the configurable filters gives great flexibility,
> but in your solrconfig.xml example it shows how the GreekAnalyzer can be
> deployed, but it also highlights the problem that it does not seem to be
> possible to make use of the stopword Hashtable available to the GreekAnalyzer
> constructor.

The GreekAnalyzer is just an example of how you can use existing
Analyzers (as long as they have a default constructor), but it's not
the recommended approach.

TokenFilters are preffered over Analyzers.... you can plug them
together in any way you see fit to solve your analysis problem.  For
Solr, an added bonus of using chains of filters  is that Solr can
"know" about the results after each filter and show you the results on
an analysis web page (very useful for debugging).

If I were to analyze greek text, I might do something like this:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
         <filter class="solr.SnowballPorterFilterFactory" language="Greek" />
xt"/>
      </analyzer>
 </fieldtype>

If you try to put everything in Analyzer constructors, you get
combinatorial explosion.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzer thread safety; Stop words

Posted by Antony Bowesman <ad...@teamware.com>.
Yonik Seeley wrote:
> On 11/29/06, Antony Bowesman <ad...@teamware.com> wrote:
>>
>> That's true, but all the existing Analyzers allow the stop set to be 
>> configured
>> via the analyzer constructors, but in different ways.
> 
> But you can duplicate most Analyzers (all the ones in Lucene?) with a
> chain of Tokenizers and TokenFilters (since that is how almost all of
> them are implemented).  Most Analyzers are simply shortcuts to putting
> together your own.

Something seems confused to me.  Although stop words are use by Filters, they 
are currently exposed via Analyzers which is the granularity used at the 
IndexWriter/Parser levels.  This is what contributors are writing, not Filters.

There are lots of analysis contributions which deal with stop words that are 
perfectly usable as is.  They shouldn't need to be duplicated to be re-used and 
if that's needed, it points to a deficiency in the design.  If we all have to 
put together our own, again, doesn't this argue that there should be a standard 
way of doing it at the higher Analyzer level.

Sure, the solr way of using the configurable filters gives great flexibility, 
but in your solrconfig.xml example it shows how the GreekAnalyzer can be 
deployed, but it also highlights the problem that it does not seem to be 
possible to make use of the stopword Hashtable available to the GreekAnalyzer 
constructor.

It seems to me that Lucene would benefit if there was an Analyzer Interface.  On 
the other hand, maybe your TokenFilterFactory stuff would be useful as part of 
Lucene.

Anyway, just my penny's worth.
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzer thread safety; Stop words

Posted by Yonik Seeley <yo...@apache.org>.
On 11/29/06, Antony Bowesman <ad...@teamware.com> wrote:
> >> seem to be two 'families' of stop word configuration via constructors.
> >
> > That belongs at the TokenFilter level (where it currently is).
>
> That's true, but all the existing Analyzers allow the stop set to be configured
> via the analyzer constructors, but in different ways.

But you can duplicate most Analyzers (all the ones in Lucene?) with a
chain of Tokenizers and TokenFilters (since that is how almost all of
them are implemented).  Most Analyzers are simply shortcuts to putting
together your own.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzer thread safety; Stop words

Posted by Antony Bowesman <ad...@teamware.com>.
Hi Yonik,

Thanks for your comments.

>> Secondly, has anyone thought that it would be a good idea to extend 
>> the Analyzer
>> interface (Abstract class) to allow a standard way to set stop words?  
>> There
>> seem to be two 'families' of stop word configuration via constructors.
> 
> That belongs at the TokenFilter level (where it currently is).

That's true, but all the existing Analyzers allow the stop set to be configured 
via the analyzer constructors, but in different ways.

For example StandardAnalyzer has:

public StandardAnalyzer(String[] stopWords)
public StandardAnalyzer(Set stopWords)
public StandardAnalyzer(File stopwords)

wheras RussianAnalyzer has:

public RussianAnalyzer(char[] charset, Hashtable stopwords)
public RussianAnalyzer(char[] charset, String[] stopwords)

so, this does not make common stop word configuration possible without some 
messy code to look at constructor signatures and make some guesses.

Perhaps the Analyzer class could have some default methods, e.g.

public void setStopWords(File stopWordFile);
public void setStopWords(Set stopWordSet);
public void setStopWords(String[] stopWords);

> Things currently are pluggable: one makes new Analyzers by plugging
> together a Tokenizer followed by several TokeFilters.
> 
> If you are talking about some sort of external configuration, take a
> look at Solr.

Yes, you've done some nice stuff there with Solr.  Unfortunately, I only came 
across it some time after I'd already done a lot of the work for our system.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzer thread safety; Stop words

Posted by Yonik Seeley <yo...@apache.org>.
On 11/24/06, Antony Bowesman <ad...@teamware.com> wrote:
> Two points about Analyzers:
>
> Does anyone have any experience with thread safety of Analyzer implementations.
>   Apart from PerFieldAnalyzerWrapper, the analyzers seem to be thread safe, but
> is there a requirement that analyzers should be thread safe?

Yes, and they normally are thread safe as they create new Tokenizers
and TokenFilters for each field value analyzed.

> Secondly, has anyone thought that it would be a good idea to extend the Analyzer
> interface (Abstract class) to allow a standard way to set stop words?  There
> seem to be two 'families' of stop word configuration via constructors.

That belongs at the TokenFilter level (where it currently is).

> The Set, File and String[] in Analyzers, such as StandardAnalyzer, StopAnalyzer
> where the and then the Russian/Greek variants that do not have the same
> Constructor signature to configure stopwords.
>
> It makes it messy to make analyzers pluggable in a generic way so that stopwords
> can be configurable for any plugged analyzer.

Things currently are pluggable: one makes new Analyzers by plugging
together a Tokenizer followed by several TokeFilters.

If you are talking about some sort of external configuration, take a
look at Solr.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org