You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2011/12/06 13:51:05 UTC
Re: [lucy-dev] Re: [lucy-commits] svn commit: r1210630 - in /incubator/lucy/branches/LUCY-196-uax-tokenizer:
core/Lucy/Analysis/PolyAnalyzer.c core/Lucy/Analysis/PolyAnalyzer.cfh perl/lib/Lucy/Analysis/PolyAnalyzer.pm
On 05/12/2011 22:38, Marvin Humphrey wrote:
> Hi, Nick,
>
> Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!
>
> On Mon, Dec 05, 2011 at 09:02:42PM -0000, nwellnhof@apache.org wrote:
>> PolyAnalyzer*
>> PolyAnalyzer_new(const CharBuf *language, VArray *analyzers) {
>> @@ -43,7 +43,7 @@ PolyAnalyzer_init(PolyAnalyzer *self, co
>> else if (language) {
>> self->analyzers = VA_new(3);
>> VA_Push(self->analyzers, (Obj*)CaseFolder_new());
>> - VA_Push(self->analyzers, (Obj*)RegexTokenizer_new(NULL));
>> + VA_Push(self->analyzers, (Obj*)StandardTokenizer_new());
>> VA_Push(self->analyzers, (Obj*)SnowStemmer_new(language));
>> }
>
> This will cause a backwards compatibility break. I really want to make your
> StandardTokenizer the default, but I think we might want to go about it
> differently.
I made that change mainly to see if the test suite breaks (and it
didn't). I plan to revert it before committing StandardTokenizer to trunk.
> How about we leave PolyAnalyzer alone, but add a new class called
> "EasyAnalyzer", with the following default stack:
>
> 1. StandardTokenizer
> 2. Normalizer
> 3. SnowballStemmer
>
> This integrates both your recent contributions, plus changes the order to be
> avoid the Highlighter problems you identified and be more in line with the
> potential refactoring you talked about.
>
> It would be nice to benchmark this just to see what sort of performance impact
> changing the order has before we finalize it.
>
> If this works out, we can then swap out PolyAnalyzer for EasyAnalyzer
> throughout the tutorial and other high-level documentation.
Sounds like a good idea.
Nick