You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/03/09 05:02:46 UTC

[lucy-dev] Other Analyzer renames

Greets,

In addition to the renaming of Tokenizer to RegexTokenizer, there are a couple
other Analyzer classes I think we should consider moving.

"Stemmer" should be changed for the same rationale as Tokenizer -- the generic
name should be reserved for the interface, as there are other stemmers out
there besides Snowball's.

    Lucy::Analysis::Stemmer => Lucy::Analysis::SnowballStemmer

Similarly, Lucy::Analysis::Stopalizer depends on materials that originate with
the Snowball project and should probably incorporate "Snowball" into its name.
However, unlike "tokenizer" and "stemmer", the word "stopalizer" isn't
standard terminology.  We don't have to keep it.

Lucene supplies "StopFilter" (which subclasses "TokenFilter") and
"StopAnalyzer" (which subclasses Analyzer).  Those suggest either
"SnowballStopFilter" or "SnowballStopAnalyzer", of which I think
"SnowballStopFilter" is better.

    Lucy::Analysis::Stopalizer => Lucy::Analysis::SnowballStopFilter

Lastly, I'm inclined towards breaking up PolyAnalyzer.  IMO, it should keep
its current behavior when you supply an array of analyzers...

    my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
        analyzers => \@analyzers,
    );

... but I think the PolyAnalyzer's language-specific pre-fab sets
incorporating a regex tokenizer, a Snowball stopalizer, and a Snowball stemmer
shouldn't be core Lucy.  In other words, it should be possible to compile Lucy
under the C API and use PolyAnalyzer's analyzers-in-series capabilities
without requiring linking in a regex engine and the Snowball libraries as
prerequisites.

My impulse is to factor an "EasyAnalyzer" class out of PolyAnalyzer.

    my $analyzer = Lucy::Analysis::EasyAnalyzer->new(
        language => 'en',
    );

However, I don't consider simplifying PolyAnalyzer as important as vacating
the namespaces for Tokenizer and Stemmer prior to release 0.1.0.  There's
likely to be an Analyzer overhaul focusing on speed, opening up the API, and
modularization sometime in the reasonably near future, and there's no
guarantee that "PolyAnalyzer" will survive that overhaul in a recognizable
form, if it survives at all.

Marvin Humphrey


Re: [lucy-dev] Other Analyzer renames

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/10/11 6:33 PM:

> I think we more-or-less have consensus on moving Tokenizer, Stemmer, and
> Stopalizer.  We can take care of those, leaving PolyAnalyzer for another day,
> and mark the checklist item as done.

+1

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Other Analyzer renames

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Mar 09, 2011 at 10:45:03PM -0600, Peter Karman wrote:
> I think moving these class names is probably good. But does it need to happen
> before 0.1?

For simple renaming, as with Tokenizer, Stemmer, and Stopalizer, the consensus
building takes a lot longer than the coding. :)  

I think it's a good idea to get these classes moved now, because we minimize
disruptions to the Lucy userbase hivemind down the road.  Also, if we were to
relase as is then move them later, I'd advocate installing compatibility stubs
so that apps don't crash on update -- but we don't need to install those stubs
if we make the moves now.

For PolyAnalyzer, it's a tad trickier because we're contemplating a refactor
rather than a simple move.  Furthermore, there's another refactoring pass on
the horizon, and I'm not sure that we've nailed the API design with the
proposed breakout of EasyAnalyzer.

This is the checklist item from
<http://wiki.apache.org/lucy/Release-0.1-incubating> that I'm trying to clear
out:

  * Move some classes around (all Analyzers underneath LucyX? Nothing under
    LucyX? Factor SnowballStopalizer out of Stopalizer?) 

LucyX has been addressed in another post.

I think we more-or-less have consensus on moving Tokenizer, Stemmer, and
Stopalizer.  We can take care of those, leaving PolyAnalyzer for another day,
and mark the checklist item as done.

Marvin Humphrey


Re: [lucy-dev] Other Analyzer renames

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/8/11 10:02 PM:
> Greets,
> 
> In addition to the renaming of Tokenizer to RegexTokenizer, there are a couple
> other Analyzer classes I think we should consider moving.

I think moving these class names is probably good. But does it need to happen
before 0.1?

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com