You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Robert Muir <rc...@gmail.com> on 2010/01/08 17:17:58 UTC

idea to speed up indexing defaults

Hello,

I have been running some tests with english and I noticed that Solr
uses the very slow Porter2 snowball stemmer by default.
In LUCENE-2194 i have proposed a patch to speed this up, of course it
will never be picked up by solr due to the way snowball is
reimplemented here.
This would increased the default for type text, etc by about 10%, not much.

But actually i would like to propose instead that the PorterStemFilter
(Porter 1) from lucene core be defined as the default instead.
This is significantly faster (my indexing speed was like 2x as fast!)
as this Porter2 snowball stemmer.
I did some relevance tests on a test collection and it actually came
out on top as far as relevance, too.

I suppose the thing blocking the use of PorterStemFilter is protWords
functionality, but in LUCENE-1515 i proposed adding this to all lucene
stemmers, so maybe we could remove the snowball duplication and
possibly change the default stemmer to the faster PorterStemFilter in
lucene core.

so basically, i am asking: is there a specific reason this slower
Snowball("English") Porter2 filter is defined as a default?

If there isn't, i'd like to suggest we move in these directions,
although it will take some time and not really work until solr and
lucene are synced up again.

thanks in advance for any ideas.

-- 
Robert Muir
rcmuir@gmail.com

Re: idea to speed up indexing defaults

Posted by Robert Muir <rc...@gmail.com>.

Grant, thanks for the feedback. yeah i meant example schema when i
said default... :)

i honestly found that the snowball stuff has a lot of overhead: for
example you can compare Snowball("Porter") with the java
PorterStemmer, and its significantly slower... these do the same exact
thing though!

i semi-seriously tried to look for addtl speedups but i don't see any
more easy wins, it appears to me from the profiler that the remaining
difference, once you get rid of this string junk, is just what you
would expect from hand-coded impl versus generated code.

On Fri, Jan 8, 2010 at 12:09 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Jan 8, 2010, at 11:17 AM, Robert Muir wrote:
>
>> Hello,
>>
>> I have been running some tests with english and I noticed that Solr
>> uses the very slow Porter2 snowball stemmer by default.
>> In LUCENE-2194 i have proposed a patch to speed this up, of course it
>> will never be picked up by solr due to the way snowball is
>> reimplemented here.
>> This would increased the default for type text, etc by about 10%, not much.
>>
>> But actually i would like to propose instead that the PorterStemFilter
>> (Porter 1) from lucene core be defined as the default instead.
>> This is significantly faster (my indexing speed was like 2x as fast!)
>> as this Porter2 snowball stemmer.
>> I did some relevance tests on a test collection and it actually came
>> out on top as far as relevance, too.
>>
>> I suppose the thing blocking the use of PorterStemFilter is protWords
>> functionality, but in LUCENE-1515 i proposed adding this to all lucene
>> stemmers, so maybe we could remove the snowball duplication and
>> possibly change the default stemmer to the faster PorterStemFilter in
>> lucene core.
>>
>> so basically, i am asking: is there a specific reason this slower
>> Snowball("English") Porter2 filter is defined as a default?
>
> It's a bit odd, but Solr doesn't really have a "default".  What it has is an example schema.  Unfortunately, everyone treats the example as the default, so...
>
> Yes, it would make sense to speed up the "default" schema as much as possible.  There are probably other token filters in there that could be removed, too.
>
> It's very good that you are doing this, as I've been wondering lately if it doesn't make sense to seriously evaluate speeding up all the snowball stuff.
>
>
>>
>> If there isn't, i'd like to suggest we move in these directions,
>> although it will take some time and not really work until solr and
>> lucene are synced up again.
>
> It shouldn't be that far off, right?  I think there is movement underway to put Solr on 3.x.



-- 
Robert Muir
rcmuir@gmail.com

Re: idea to speed up indexing defaults

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 8, 2010, at 11:17 AM, Robert Muir wrote:

> Hello,
> 
> I have been running some tests with english and I noticed that Solr
> uses the very slow Porter2 snowball stemmer by default.
> In LUCENE-2194 i have proposed a patch to speed this up, of course it
> will never be picked up by solr due to the way snowball is
> reimplemented here.
> This would increased the default for type text, etc by about 10%, not much.
> 
> But actually i would like to propose instead that the PorterStemFilter
> (Porter 1) from lucene core be defined as the default instead.
> This is significantly faster (my indexing speed was like 2x as fast!)
> as this Porter2 snowball stemmer.
> I did some relevance tests on a test collection and it actually came
> out on top as far as relevance, too.
> 
> I suppose the thing blocking the use of PorterStemFilter is protWords
> functionality, but in LUCENE-1515 i proposed adding this to all lucene
> stemmers, so maybe we could remove the snowball duplication and
> possibly change the default stemmer to the faster PorterStemFilter in
> lucene core.
> 
> so basically, i am asking: is there a specific reason this slower
> Snowball("English") Porter2 filter is defined as a default?

It's a bit odd, but Solr doesn't really have a "default".  What it has is an example schema.  Unfortunately, everyone treats the example as the default, so...

Yes, it would make sense to speed up the "default" schema as much as possible.  There are probably other token filters in there that could be removed, too.

It's very good that you are doing this, as I've been wondering lately if it doesn't make sense to seriously evaluate speeding up all the snowball stuff.

> 
> If there isn't, i'd like to suggest we move in these directions,
> although it will take some time and not really work until solr and
> lucene are synced up again.

It shouldn't be that far off, right?  I think there is movement underway to put Solr on 3.x.