You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Robert Muir <rc...@gmail.com> on 2013/03/15 16:29:47 UTC
Re: Migrating SnowballAnalyzer to 4.1
2013/2/28 Steve Rowe <sa...@gmail.com>:
> EnglishAnalyzer has used PorterStemmer instead of the English Snowball stemmer since it was created in 2010 as part of LUCENE-2055[2]. I think this is an oversight: EnglishAnalyzer should incorporate the best English stemmer we've got, and Martin Porter says the Porter2 stemmer is better[1]. Robert Muir (who wrote EnglishAnalyzer), if you're reading, what do you think?
This was intentional actually. The default was a tradeoff of
"benefits" (which affect less than 5% of english vocabulary, if you
read around the snowball site), versus a much more significant
performance difference as a "default".
For example when i did tests of indexing both short and long texts
http://find.searchhub.org/document/c1d3301b71dab5ca#46a8351089a98aec
Thats overall indexing speed, not just text analysis.
It might be that this guy is faster these days (we've done some
improvements) too.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Migrating SnowballAnalyzer to 4.1
Posted by Robert Muir <rc...@gmail.com>.
On Sat, Mar 16, 2013 at 12:57 AM, Steve Rowe <sa...@gmail.com> wrote:
>
> Thanks for the explanation.
>
> I ran a lucene/benchmark alg comparing the two stemmers on trunk on my Macbook Pro with Oracle Java 1.7.0_13, and it looks like the situation hasn't changed much.
>
> The original-algorithm Porter stemmer is 4 times faster than the Porter2/English Snowball stemmer, resulting in 40% higher throughput in a full English analysis pipeline.
>
> So the default English stemmer choice is still valid IMO.
>
Thanks a lot for running this!
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Migrating SnowballAnalyzer to 4.1
Posted by Steve Rowe <sa...@gmail.com>.
Hi Robert,
On Mar 15, 2013, at 11:29 AM, Robert Muir <rc...@gmail.com> wrote:
> 2013/2/28 Steve Rowe <sa...@gmail.com>:
>> EnglishAnalyzer has used PorterStemmer instead of the English Snowball stemmer since it was created in 2010 as part of LUCENE-2055[2]. I think this is an oversight: EnglishAnalyzer should incorporate the best English stemmer we've got, and Martin Porter says the Porter2 stemmer is better[1]. Robert Muir (who wrote EnglishAnalyzer), if you're reading, what do you think?
>
> This was intentional actually. The default was a tradeoff of
> "benefits" (which affect less than 5% of english vocabulary, if you
> read around the snowball site), versus a much more significant
> performance difference as a "default".
>
> For example when i did tests of indexing both short and long texts
>
> http://find.searchhub.org/document/c1d3301b71dab5ca#46a8351089a98aec
>
> Thats overall indexing speed, not just text analysis.
>
> It might be that this guy is faster these days (we've done some
> improvements) too.
Thanks for the explanation.
I ran a lucene/benchmark alg comparing the two stemmers on trunk on my Macbook Pro with Oracle Java 1.7.0_13, and it looks like the situation hasn't changed much.
The original-algorithm Porter stemmer is 4 times faster than the Porter2/English Snowball stemmer, resulting in 40% higher throughput in a full English analysis pipeline.
So the default English stemmer choice is still valid IMO.
Here's porter-comparison.alg:
-----
content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
doc.tokenized=false
doc.body.tokenized=true
docs.dir=reuters-out
-AnalyzerFactory(name:original-porter-stemmer,StandardTokenizer,
StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter,
PorterStemFilter)
-AnalyzerFactory(name:porter2-stemmer,StandardTokenizer,
StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter,
SnowballPorterFilter(language:English))
-AnalyzerFactory(name:no-stemmer,StandardTokenizer,
StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter)
{ "Rounds"
-NewAnalyzer(original-porter-stemmer)
-ResetInputs
{ "Original Porter Stemmer" { ReadTokens > : 20000 }
-NewAnalyzer(porter2-stemmer)
-ResetInputs
{ "Porter2/English Stemmer" { ReadTokens > : 20000 }
-NewAnalyzer(no-stemmer)
-ResetInputs
{ "No Stemmer" { ReadTokens > : 20000 }
NewRound
} : 5
RepSumByNameRound
-----
And the results (regrouped; ordered by elapsedSec) - a "rec" is a token:
-----
Operation round recsPerRun rec/s elapsedSec
No Stemmer 2 1814029 1,234,873.38 1.47
No Stemmer 4 1814029 1,234,873.38 1.47
No Stemmer 1 1814029 1,230,684.50 1.47
No Stemmer 0 1814029 1,227,353.88 1.48
No Stemmer 3 1814029 1,226,524.00 1.48
Original Porter Stemmer 1 1814029 1,074,025.50 1.69
Original Porter Stemmer 4 1814029 1,065,196.12 1.70
Original Porter Stemmer 2 1814029 1,056,510.75 1.72
Original Porter Stemmer 3 1814029 1,030,698.31 1.76
Original Porter Stemmer 0 1814029 685,833.25 2.64
Porter2/English Stemmer 4 1814029 768,656.38 2.36
Porter2/English Stemmer 2 1814029 764,123.44 2.37
Porter2/English Stemmer 1 1814029 758,056.44 2.39
Porter2/English Stemmer 3 1814029 758,056.44 2.39
Porter2/English Stemmer 0 1814029 716,158.31 2.53
-----
Best of 5 results:
No Stemmer: 1.47s
Original Porter Stemmer: 1.69s - 1.47s = 0.22s
Porter2/English Stemmer: 2.36s - 1.47s = 0.89s
Throughput increase: (2.36s-1.69s)/1.69s * 100 = 40%
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org