You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Robert Muir <rc...@gmail.com> on 2013/03/15 16:29:47 UTC

Re: Migrating SnowballAnalyzer to 4.1

2013/2/28 Steve Rowe <sa...@gmail.com>:

> EnglishAnalyzer has used PorterStemmer instead of the English Snowball stemmer since it was created in 2010 as part of LUCENE-2055[2].  I think this is an oversight: EnglishAnalyzer should incorporate the best English stemmer we've got, and Martin Porter says the Porter2 stemmer is better[1].  Robert Muir (who wrote EnglishAnalyzer), if you're reading, what do you think?

This was intentional actually. The default was a tradeoff of
"benefits" (which affect less than 5% of english vocabulary, if you
read around the snowball site), versus a much more significant
performance difference as a "default".

For example when i did tests of indexing both short and long texts

http://find.searchhub.org/document/c1d3301b71dab5ca#46a8351089a98aec

Thats overall indexing speed, not just text analysis.

It might be that this guy is faster these days (we've done some
improvements) too.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Migrating SnowballAnalyzer to 4.1

Posted by Robert Muir <rc...@gmail.com>.

On Sat, Mar 16, 2013 at 12:57 AM, Steve Rowe <sa...@gmail.com> wrote:
>
> Thanks for the explanation.
>
> I ran a lucene/benchmark alg comparing the two stemmers on trunk on my Macbook Pro with Oracle Java 1.7.0_13, and it looks like the situation hasn't changed much.
>
> The original-algorithm Porter stemmer is 4 times faster than the Porter2/English Snowball stemmer, resulting in 40% higher throughput in a full English analysis pipeline.
>
> So the default English stemmer choice is still valid IMO.
>

Thanks a lot for running this!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Migrating SnowballAnalyzer to 4.1

Posted by Steve Rowe <sa...@gmail.com>.

Hi Robert,

On Mar 15, 2013, at 11:29 AM, Robert Muir <rc...@gmail.com> wrote:
> 2013/2/28 Steve Rowe <sa...@gmail.com>:
>> EnglishAnalyzer has used PorterStemmer instead of the English Snowball stemmer since it was created in 2010 as part of LUCENE-2055[2].  I think this is an oversight: EnglishAnalyzer should incorporate the best English stemmer we've got, and Martin Porter says the Porter2 stemmer is better[1].  Robert Muir (who wrote EnglishAnalyzer), if you're reading, what do you think?
> 
> This was intentional actually. The default was a tradeoff of
> "benefits" (which affect less than 5% of english vocabulary, if you
> read around the snowball site), versus a much more significant
> performance difference as a "default".
> 
> For example when i did tests of indexing both short and long texts
> 
> http://find.searchhub.org/document/c1d3301b71dab5ca#46a8351089a98aec
> 
> Thats overall indexing speed, not just text analysis.
> 
> It might be that this guy is faster these days (we've done some
> improvements) too.


Thanks for the explanation.

I ran a lucene/benchmark alg comparing the two stemmers on trunk on my Macbook Pro with Oracle Java 1.7.0_13, and it looks like the situation hasn't changed much.  

The original-algorithm Porter stemmer is 4 times faster than the Porter2/English Snowball stemmer, resulting in 40% higher throughput in a full English analysis pipeline.

So the default English stemmer choice is still valid IMO.

Here's porter-comparison.alg:

-----
content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
doc.tokenized=false
doc.body.tokenized=true
docs.dir=reuters-out

-AnalyzerFactory(name:original-porter-stemmer,StandardTokenizer,
  StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter,
  PorterStemFilter)

-AnalyzerFactory(name:porter2-stemmer,StandardTokenizer,
  StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter,
  SnowballPorterFilter(language:English))

-AnalyzerFactory(name:no-stemmer,StandardTokenizer,
  StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter)

{ "Rounds"
    -NewAnalyzer(original-porter-stemmer)
    -ResetInputs 
    { "Original Porter Stemmer" { ReadTokens > : 20000 }

    -NewAnalyzer(porter2-stemmer)
    -ResetInputs 
    { "Porter2/English Stemmer" { ReadTokens > : 20000 }

    -NewAnalyzer(no-stemmer)
    -ResetInputs 
    { "No Stemmer" { ReadTokens > : 20000 }

    NewRound
} : 5
RepSumByNameRound
-----

And the results (regrouped; ordered by elapsedSec) - a "rec" is a token:

-----
Operation               round  recsPerRun         rec/s  elapsedSec

No Stemmer                  2     1814029  1,234,873.38        1.47
No Stemmer                  4     1814029  1,234,873.38        1.47
No Stemmer                  1     1814029  1,230,684.50        1.47
No Stemmer                  0     1814029  1,227,353.88        1.48
No Stemmer                  3     1814029  1,226,524.00        1.48

Original Porter Stemmer     1     1814029  1,074,025.50        1.69
Original Porter Stemmer     4     1814029  1,065,196.12        1.70
Original Porter Stemmer     2     1814029  1,056,510.75        1.72
Original Porter Stemmer     3     1814029  1,030,698.31        1.76
Original Porter Stemmer     0     1814029    685,833.25        2.64

Porter2/English Stemmer     4     1814029    768,656.38        2.36
Porter2/English Stemmer     2     1814029    764,123.44        2.37
Porter2/English Stemmer     1     1814029    758,056.44        2.39
Porter2/English Stemmer     3     1814029    758,056.44        2.39
Porter2/English Stemmer     0     1814029    716,158.31        2.53
-----

Best of 5 results:

             No Stemmer: 1.47s
Original Porter Stemmer: 1.69s - 1.47s = 0.22s
Porter2/English Stemmer: 2.36s - 1.47s = 0.89s

Throughput increase: (2.36s-1.69s)/1.69s * 100 = 40%

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org