You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Yonik Seeley <yo...@lucidimagination.com> on 2009/08/09 16:29:05 UTC

indexing slowdown with latest lucene udpate

I did some quick indexing performance tests right before and right
after the last lucene jar update - the results are not good... about
30% slower.
The test was an 80 MB text field, 100K documents, 6 short text fields
per document, with the solrconfig/schema from trunk copied to both
environments.

I imagine this has to do with the new TokenStream stuff in Lucene, and
how back compatibility is implemented (which I haven't followed, but
which many now involve reflection).  We've never cached tokenstreams
with everything that involves, but it may be that we will be forced to
do so to recover the performance loss.

-Yonik
http://www.lucidimagination.com

Re: indexing slowdown with latest lucene udpate

Posted by Mark Miller <ma...@gmail.com>.

Mark Miller wrote:
> Looks like there are a couple spots to blame, but mostly, 
> TokenStream$isMethodOverloaded takes most of the blame. Appears very 
> slow.
>
> - Mark
Or its just called too often the way Solr does things for how fast it is.

Here are the profiling results:

Before
r801845
http://myhardshadow.com/images/before.png

After
r802556
http://myhardshadow.com/images/after.png

-- 
- Mark

http://www.lucidimagination.com

Re: indexing slowdown with latest lucene udpate

Posted by Mark Miller <ma...@gmail.com>.

Looks like there are a couple spots to blame, but mostly, 
TokenStream$isMethodOverloaded takes most of the blame. Appears very slow.

- Mark

Re: indexing slowdown with latest lucene udpate

Posted by Robert Muir <rc...@gmail.com>.

I am concerned about this one as well. Especially since the majority
of the language analyzers in lucene-contrib do not implement
reusableTokenStream.

On Sun, Aug 9, 2009 at 5:06 PM, Michael Busch<bu...@gmail.com> wrote:
> Are you sure that the initialization costs of the
> TokenStream/AttributeSource cause the slowdown? With the bw-comp. code now
> every call of a Token method goes through a delegation layer. I'm afraid
> that might cause a slowdown?
>
> The code that figures out what Attributes to put into the map uses
> reflection, but only if the impl wasn't seen before; otherwise the
> attributes are looked up in a cache.
>
> The culprit could also be the reflection code that checks which TokenStream
> methods are implemented.
>
> I can't look at the code right now (writing on my cell).
> Even if this is "fixable", I don't really like the fact that users who
> upgrade to 2.9 will potentially see such a performance hit unless they
> implement incrementToken() and reusableTokenStream.
>
>  Michael
>
> On Aug 9, 2009, at 11:13 AM, Yonik Seeley <yo...@lucidimagination.com>
> wrote:
>
>> FYI
>> https://issues.apache.org/jira/browse/SOLR-1353
>>
>> On Sun, Aug 9, 2009 at 2:02 PM, Yonik Seeley<yo...@lucidimagination.com>
>> wrote:
>>>
>>> It looks like implementing the new attribute stuff will not be enough
>>> - the token architecture has changed enough that it looks like we must
>>> cache tokenstreams to get back to good performance.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>> On Sun, Aug 9, 2009 at 12:57 PM, Yonik Seeley<yo...@lucidimagination.com>
>>> wrote:
>>>>
>>>> OK, I've isolated (magnified) the effect with a test I just checked in.
>>>> Indexing documents directly at the UpdateHandler was 85% faster before
>>>> the latest lucene update.
>>>>
>>>> Run the test like this:
>>>>
>>>> ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
>>>> -Diter=100000"; grep throughput
>>>> build/test-results/*TestIndexingPerformance.xml
>>>>
>>>> To run on an older trunk version, just copy over
>>>> src/test/org/apache/solr/update/TestIndexingPerformance.java
>>>> src/test/test-files/solr/conf/solrconfig_perf.xml
>>>>
>>>> I had a throughput of 10946 docs/sec before the lucene update, and 5849
>>>> after.
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>>
>>>> On Sun, Aug 9, 2009 at 12:10 PM, Yonik
>>>> Seeley<yo...@lucidimagination.com> wrote:
>>>>>
>>>>> On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gs...@apache.org>
>>>>> wrote:
>>>>>>
>>>>>> Or bite the bullet and upgrade to the incrementToken() method.
>>>>>
>>>>> Right - I'm not sure if that would fix it or not - I haven't been
>>>>> involved in the new Token attribute stuff...
>>>>> I'm currently writing a basic indexing unit test that we can use to
>>>>> measure this (the standard solrconfig does stuff that slows down
>>>>> indexing a lot, but helps in catching bugs on edge cases by creating
>>>>> many segments).
>>>>>
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>
>



-- 
Robert Muir
rcmuir@gmail.com

Re: indexing slowdown with latest lucene udpate

Posted by Grant Ingersoll <gs...@apache.org>.

Additionally, I'd say, I think we in Solr land need to setup a Hudson  
build that installs the Lucene jars nightly and runs our tests,  
including performance tests and reports to solr-dev the results of any  
problems, which can then be pushed to java-dev if needed.  This way,  
we have a forward looking view of what's coming down the pike in  
Lucene.  I really think Solr committers need to be a bit more  
proactive in Lucene land, as it can have a significant impact on us.

On Aug 10, 2009, at 10:07 AM, Grant Ingersoll wrote:

> FWIW, seems like these issues should be brought up on java-dev.   
> Even if the changes in Lucene are back compatible, that's not much  
> help if the large majority of users are going to take a similar hit  
> to what Solr is taking.
>
>
> On Aug 9, 2009, at 11:47 PM, Mark Miller wrote:
>
>> isMethodOverriden is just nasty - copying Methods, security checks,  
>> walking the type hierarchy, this, that, some more. I bet cglib has  
>> a really fast version - too bad there is no built in equivalent.
>>
>> Its not nearly as clean, but what if a new TokenStream simply  
>> identified itself as supporting increment, and the default impl  
>> returns false? The developer knows at compile time right? Almost no  
>> reason to keep asking the code over and over again, especially  
>> since its so expensive. Then reusable doubles the cost.
>>
>> Mark Miller wrote:
>>> Michael Busch wrote:
>>>> Are you sure that the initialization costs of the TokenStream/ 
>>>> AttributeSource cause the slowdown? With the bw-comp. code now  
>>>> every call of a Token method goes through a delegation layer. I'm  
>>>> afraid that might cause a slowdown?
>>> Its isMethodOverriden and TokenStream<init>(AttributeSource).
>>>>
>>>> The code that figures out what Attributes to put into the map  
>>>> uses reflection, but only if the impl wasn't seen before;  
>>>> otherwise the attributes are looked up in a cache.
>>>>
>>>> The culprit could also be the reflection code that checks which  
>>>> TokenStream methods are implemented.
>>>>
>>>> I can't look at the code right now (writing on my cell).
>>>> Even if this is "fixable", I don't really like the fact that  
>>>> users who upgrade to 2.9 will potentially see such a performance  
>>>> hit unless they implement incrementToken() and reusableTokenStream.
>>> Looks like you take a good hit, but keep in mind that test is  
>>> almost worst case scenario as well - the Document text is  
>>> extremely short.
>>>>
>>>> Michael
>>>>
>>>> On Aug 9, 2009, at 11:13 AM, Yonik Seeley <yonik@lucidimagination.com 
>>>> > wrote:
>>>>
>>>>> FYI
>>>>> https://issues.apache.org/jira/browse/SOLR-1353
>>>>>
>>>>> On Sun, Aug 9, 2009 at 2:02 PM, Yonik Seeley<yonik@lucidimagination.com 
>>>>> > wrote:
>>>>>> It looks like implementing the new attribute stuff will not be  
>>>>>> enough
>>>>>> - the token architecture has changed enough that it looks like  
>>>>>> we must
>>>>>> cache tokenstreams to get back to good performance.
>>>>>>
>>>>>> -Yonik
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 9, 2009 at 12:57 PM, Yonik Seeley<yonik@lucidimagination.com 
>>>>>> > wrote:
>>>>>>> OK, I've isolated (magnified) the effect with a test I just  
>>>>>>> checked in.
>>>>>>> Indexing documents directly at the UpdateHandler was 85%  
>>>>>>> faster before
>>>>>>> the latest lucene update.
>>>>>>>
>>>>>>> Run the test like this:
>>>>>>>
>>>>>>> ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
>>>>>>> -Diter=100000"; grep throughput
>>>>>>> build/test-results/*TestIndexingPerformance.xml
>>>>>>>
>>>>>>> To run on an older trunk version, just copy over
>>>>>>> src/test/org/apache/solr/update/TestIndexingPerformance.java
>>>>>>> src/test/test-files/solr/conf/solrconfig_perf.xml
>>>>>>>
>>>>>>> I had a throughput of 10946 docs/sec before the lucene update,  
>>>>>>> and 5849 after.
>>>>>>>
>>>>>>> -Yonik
>>>>>>> http://www.lucidimagination.com
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Aug 9, 2009 at 12:10 PM, Yonik Seeley<yonik@lucidimagination.com 
>>>>>>> > wrote:
>>>>>>>> On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gsingers@apache.org 
>>>>>>>> > wrote:
>>>>>>>>> Or bite the bullet and upgrade to the incrementToken() method.
>>>>>>>>
>>>>>>>> Right - I'm not sure if that would fix it or not - I haven't  
>>>>>>>> been
>>>>>>>> involved in the new Token attribute stuff...
>>>>>>>> I'm currently writing a basic indexing unit test that we can  
>>>>>>>> use to
>>>>>>>> measure this (the standard solrconfig does stuff that slows  
>>>>>>>> down
>>>>>>>> indexing a lot, but helps in catching bugs on edge cases by  
>>>>>>>> creating
>>>>>>>> many segments).
>>>>>>>>
>>>>>>>> -Yonik
>>>>>>>> http://www.lucidimagination.com
>>>>>>>>
>>>>>>>
>>>>>>
>>>
>>>
>>
>>
>> -- 
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: indexing slowdown with latest lucene udpate

Posted by Grant Ingersoll <gs...@apache.org>.

FWIW, seems like these issues should be brought up on java-dev.  Even  
if the changes in Lucene are back compatible, that's not much help if  
the large majority of users are going to take a similar hit to what  
Solr is taking.


On Aug 9, 2009, at 11:47 PM, Mark Miller wrote:

> isMethodOverriden is just nasty - copying Methods, security checks,  
> walking the type hierarchy, this, that, some more. I bet cglib has a  
> really fast version - too bad there is no built in equivalent.
>
> Its not nearly as clean, but what if a new TokenStream simply  
> identified itself as supporting increment, and the default impl  
> returns false? The developer knows at compile time right? Almost no  
> reason to keep asking the code over and over again, especially since  
> its so expensive. Then reusable doubles the cost.
>
> Mark Miller wrote:
>> Michael Busch wrote:
>>> Are you sure that the initialization costs of the TokenStream/ 
>>> AttributeSource cause the slowdown? With the bw-comp. code now  
>>> every call of a Token method goes through a delegation layer. I'm  
>>> afraid that might cause a slowdown?
>> Its isMethodOverriden and TokenStream<init>(AttributeSource).
>>>
>>> The code that figures out what Attributes to put into the map uses  
>>> reflection, but only if the impl wasn't seen before; otherwise the  
>>> attributes are looked up in a cache.
>>>
>>> The culprit could also be the reflection code that checks which  
>>> TokenStream methods are implemented.
>>>
>>> I can't look at the code right now (writing on my cell).
>>> Even if this is "fixable", I don't really like the fact that users  
>>> who upgrade to 2.9 will potentially see such a performance hit  
>>> unless they implement incrementToken() and reusableTokenStream.
>> Looks like you take a good hit, but keep in mind that test is  
>> almost worst case scenario as well - the Document text is extremely  
>> short.
>>>
>>> Michael
>>>
>>> On Aug 9, 2009, at 11:13 AM, Yonik Seeley <yonik@lucidimagination.com 
>>> > wrote:
>>>
>>>> FYI
>>>> https://issues.apache.org/jira/browse/SOLR-1353
>>>>
>>>> On Sun, Aug 9, 2009 at 2:02 PM, Yonik Seeley<yonik@lucidimagination.com 
>>>> > wrote:
>>>>> It looks like implementing the new attribute stuff will not be  
>>>>> enough
>>>>> - the token architecture has changed enough that it looks like  
>>>>> we must
>>>>> cache tokenstreams to get back to good performance.
>>>>>
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>>
>>>>>
>>>>> On Sun, Aug 9, 2009 at 12:57 PM, Yonik Seeley<yonik@lucidimagination.com 
>>>>> > wrote:
>>>>>> OK, I've isolated (magnified) the effect with a test I just  
>>>>>> checked in.
>>>>>> Indexing documents directly at the UpdateHandler was 85% faster  
>>>>>> before
>>>>>> the latest lucene update.
>>>>>>
>>>>>> Run the test like this:
>>>>>>
>>>>>> ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
>>>>>> -Diter=100000"; grep throughput
>>>>>> build/test-results/*TestIndexingPerformance.xml
>>>>>>
>>>>>> To run on an older trunk version, just copy over
>>>>>> src/test/org/apache/solr/update/TestIndexingPerformance.java
>>>>>> src/test/test-files/solr/conf/solrconfig_perf.xml
>>>>>>
>>>>>> I had a throughput of 10946 docs/sec before the lucene update,  
>>>>>> and 5849 after.
>>>>>>
>>>>>> -Yonik
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 9, 2009 at 12:10 PM, Yonik Seeley<yonik@lucidimagination.com 
>>>>>> > wrote:
>>>>>>> On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gsingers@apache.org 
>>>>>>> > wrote:
>>>>>>>> Or bite the bullet and upgrade to the incrementToken() method.
>>>>>>>
>>>>>>> Right - I'm not sure if that would fix it or not - I haven't  
>>>>>>> been
>>>>>>> involved in the new Token attribute stuff...
>>>>>>> I'm currently writing a basic indexing unit test that we can  
>>>>>>> use to
>>>>>>> measure this (the standard solrconfig does stuff that slows down
>>>>>>> indexing a lot, but helps in catching bugs on edge cases by  
>>>>>>> creating
>>>>>>> many segments).
>>>>>>>
>>>>>>> -Yonik
>>>>>>> http://www.lucidimagination.com
>>>>>>>
>>>>>>
>>>>>
>>
>>
>
>
> -- 
> - Mark
>
> http://www.lucidimagination.com
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: indexing slowdown with latest lucene udpate

Posted by Mark Miller <ma...@gmail.com>.

isMethodOverriden is just nasty - copying Methods, security checks, 
walking the type hierarchy, this, that, some more. I bet cglib has a 
really fast version - too bad there is no built in equivalent.

Its not nearly as clean, but what if a new TokenStream simply identified 
itself as supporting increment, and the default impl returns false? The 
developer knows at compile time right? Almost no reason to keep asking 
the code over and over again, especially since its so expensive. Then 
reusable doubles the cost.

Mark Miller wrote:
> Michael Busch wrote:
>> Are you sure that the initialization costs of the 
>> TokenStream/AttributeSource cause the slowdown? With the bw-comp. 
>> code now every call of a Token method goes through a delegation 
>> layer. I'm afraid that might cause a slowdown?
> Its isMethodOverriden and TokenStream<init>(AttributeSource).
>>
>> The code that figures out what Attributes to put into the map uses 
>> reflection, but only if the impl wasn't seen before; otherwise the 
>> attributes are looked up in a cache.
>>
>> The culprit could also be the reflection code that checks which 
>> TokenStream methods are implemented.
>>
>> I can't look at the code right now (writing on my cell).
>> Even if this is "fixable", I don't really like the fact that users 
>> who upgrade to 2.9 will potentially see such a performance hit unless 
>> they implement incrementToken() and reusableTokenStream.
> Looks like you take a good hit, but keep in mind that test is almost 
> worst case scenario as well - the Document text is extremely short.
>>
>>  Michael
>>
>> On Aug 9, 2009, at 11:13 AM, Yonik Seeley 
>> <yo...@lucidimagination.com> wrote:
>>
>>> FYI
>>> https://issues.apache.org/jira/browse/SOLR-1353
>>>
>>> On Sun, Aug 9, 2009 at 2:02 PM, Yonik 
>>> Seeley<yo...@lucidimagination.com> wrote:
>>>> It looks like implementing the new attribute stuff will not be enough
>>>> - the token architecture has changed enough that it looks like we must
>>>> cache tokenstreams to get back to good performance.
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>>
>>>> On Sun, Aug 9, 2009 at 12:57 PM, Yonik 
>>>> Seeley<yo...@lucidimagination.com> wrote:
>>>>> OK, I've isolated (magnified) the effect with a test I just 
>>>>> checked in.
>>>>> Indexing documents directly at the UpdateHandler was 85% faster 
>>>>> before
>>>>> the latest lucene update.
>>>>>
>>>>> Run the test like this:
>>>>>
>>>>> ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
>>>>> -Diter=100000"; grep throughput
>>>>> build/test-results/*TestIndexingPerformance.xml
>>>>>
>>>>> To run on an older trunk version, just copy over
>>>>> src/test/org/apache/solr/update/TestIndexingPerformance.java
>>>>> src/test/test-files/solr/conf/solrconfig_perf.xml
>>>>>
>>>>> I had a throughput of 10946 docs/sec before the lucene update, and 
>>>>> 5849 after.
>>>>>
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>>
>>>>>
>>>>> On Sun, Aug 9, 2009 at 12:10 PM, Yonik 
>>>>> Seeley<yo...@lucidimagination.com> wrote:
>>>>>> On Sun, Aug 9, 2009 at 12:01 PM, Grant 
>>>>>> Ingersoll<gs...@apache.org> wrote:
>>>>>>> Or bite the bullet and upgrade to the incrementToken() method.
>>>>>>
>>>>>> Right - I'm not sure if that would fix it or not - I haven't been
>>>>>> involved in the new Token attribute stuff...
>>>>>> I'm currently writing a basic indexing unit test that we can use to
>>>>>> measure this (the standard solrconfig does stuff that slows down
>>>>>> indexing a lot, but helps in catching bugs on edge cases by creating
>>>>>> many segments).
>>>>>>
>>>>>> -Yonik
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>
>>>>
>
>


-- 
- Mark

http://www.lucidimagination.com

Re: indexing slowdown with latest lucene udpate

Posted by Mark Miller <ma...@gmail.com>.

Michael Busch wrote:
> Are you sure that the initialization costs of the 
> TokenStream/AttributeSource cause the slowdown? With the bw-comp. code 
> now every call of a Token method goes through a delegation layer. I'm 
> afraid that might cause a slowdown?
Its isMethodOverriden and TokenStream<init>(AttributeSource).
>
> The code that figures out what Attributes to put into the map uses 
> reflection, but only if the impl wasn't seen before; otherwise the 
> attributes are looked up in a cache.
>
> The culprit could also be the reflection code that checks which 
> TokenStream methods are implemented.
>
> I can't look at the code right now (writing on my cell).
> Even if this is "fixable", I don't really like the fact that users who 
> upgrade to 2.9 will potentially see such a performance hit unless they 
> implement incrementToken() and reusableTokenStream.
Looks like you take a good hit, but keep in mind that test is almost 
worst case scenario as well - the Document text is extremely short.
>
>  Michael
>
> On Aug 9, 2009, at 11:13 AM, Yonik Seeley <yo...@lucidimagination.com> 
> wrote:
>
>> FYI
>> https://issues.apache.org/jira/browse/SOLR-1353
>>
>> On Sun, Aug 9, 2009 at 2:02 PM, Yonik 
>> Seeley<yo...@lucidimagination.com> wrote:
>>> It looks like implementing the new attribute stuff will not be enough
>>> - the token architecture has changed enough that it looks like we must
>>> cache tokenstreams to get back to good performance.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>> On Sun, Aug 9, 2009 at 12:57 PM, Yonik 
>>> Seeley<yo...@lucidimagination.com> wrote:
>>>> OK, I've isolated (magnified) the effect with a test I just checked 
>>>> in.
>>>> Indexing documents directly at the UpdateHandler was 85% faster before
>>>> the latest lucene update.
>>>>
>>>> Run the test like this:
>>>>
>>>> ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
>>>> -Diter=100000"; grep throughput
>>>> build/test-results/*TestIndexingPerformance.xml
>>>>
>>>> To run on an older trunk version, just copy over
>>>> src/test/org/apache/solr/update/TestIndexingPerformance.java
>>>> src/test/test-files/solr/conf/solrconfig_perf.xml
>>>>
>>>> I had a throughput of 10946 docs/sec before the lucene update, and 
>>>> 5849 after.
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>>
>>>> On Sun, Aug 9, 2009 at 12:10 PM, Yonik 
>>>> Seeley<yo...@lucidimagination.com> wrote:
>>>>> On Sun, Aug 9, 2009 at 12:01 PM, Grant 
>>>>> Ingersoll<gs...@apache.org> wrote:
>>>>>> Or bite the bullet and upgrade to the incrementToken() method.
>>>>>
>>>>> Right - I'm not sure if that would fix it or not - I haven't been
>>>>> involved in the new Token attribute stuff...
>>>>> I'm currently writing a basic indexing unit test that we can use to
>>>>> measure this (the standard solrconfig does stuff that slows down
>>>>> indexing a lot, but helps in catching bugs on edge cases by creating
>>>>> many segments).
>>>>>
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>


-- 
- Mark

http://www.lucidimagination.com

Re: indexing slowdown with latest lucene udpate

Posted by Michael Busch <bu...@gmail.com>.

Are you sure that the initialization costs of the TokenStream/ 
AttributeSource cause the slowdown? With the bw-comp. code now every  
call of a Token method goes through a delegation layer. I'm afraid  
that might cause a slowdown?

The code that figures out what Attributes to put into the map uses  
reflection, but only if the impl wasn't seen before; otherwise the  
attributes are looked up in a cache.

The culprit could also be the reflection code that checks which  
TokenStream methods are implemented.

I can't look at the code right now (writing on my cell).
Even if this is "fixable", I don't really like the fact that users who  
upgrade to 2.9 will potentially see such a performance hit unless they  
implement incrementToken() and reusableTokenStream.

  Michael

On Aug 9, 2009, at 11:13 AM, Yonik Seeley <yo...@lucidimagination.com>  
wrote:

> FYI
> https://issues.apache.org/jira/browse/SOLR-1353
>
> On Sun, Aug 9, 2009 at 2:02 PM, Yonik Seeley<yonik@lucidimagination.com 
> > wrote:
>> It looks like implementing the new attribute stuff will not be enough
>> - the token architecture has changed enough that it looks like we  
>> must
>> cache tokenstreams to get back to good performance.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>> On Sun, Aug 9, 2009 at 12:57 PM, Yonik Seeley<yonik@lucidimagination.com 
>> > wrote:
>>> OK, I've isolated (magnified) the effect with a test I just  
>>> checked in.
>>> Indexing documents directly at the UpdateHandler was 85% faster  
>>> before
>>> the latest lucene update.
>>>
>>> Run the test like this:
>>>
>>> ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
>>> -Diter=100000"; grep throughput
>>> build/test-results/*TestIndexingPerformance.xml
>>>
>>> To run on an older trunk version, just copy over
>>> src/test/org/apache/solr/update/TestIndexingPerformance.java
>>> src/test/test-files/solr/conf/solrconfig_perf.xml
>>>
>>> I had a throughput of 10946 docs/sec before the lucene update, and  
>>> 5849 after.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>> On Sun, Aug 9, 2009 at 12:10 PM, Yonik Seeley<yonik@lucidimagination.com 
>>> > wrote:
>>>> On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gsingers@apache.org 
>>>> > wrote:
>>>>> Or bite the bullet and upgrade to the incrementToken() method.
>>>>
>>>> Right - I'm not sure if that would fix it or not - I haven't been
>>>> involved in the new Token attribute stuff...
>>>> I'm currently writing a basic indexing unit test that we can use to
>>>> measure this (the standard solrconfig does stuff that slows down
>>>> indexing a lot, but helps in catching bugs on edge cases by  
>>>> creating
>>>> many segments).
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>
>>

Re: indexing slowdown with latest lucene udpate

Posted by Yonik Seeley <yo...@lucidimagination.com>.

FYI
https://issues.apache.org/jira/browse/SOLR-1353

On Sun, Aug 9, 2009 at 2:02 PM, Yonik Seeley<yo...@lucidimagination.com> wrote:
> It looks like implementing the new attribute stuff will not be enough
> - the token architecture has changed enough that it looks like we must
> cache tokenstreams to get back to good performance.
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Sun, Aug 9, 2009 at 12:57 PM, Yonik Seeley<yo...@lucidimagination.com> wrote:
>> OK, I've isolated (magnified) the effect with a test I just checked in.
>> Indexing documents directly at the UpdateHandler was 85% faster before
>> the latest lucene update.
>>
>> Run the test like this:
>>
>> ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
>> -Diter=100000"; grep throughput
>> build/test-results/*TestIndexingPerformance.xml
>>
>> To run on an older trunk version, just copy over
>> src/test/org/apache/solr/update/TestIndexingPerformance.java
>> src/test/test-files/solr/conf/solrconfig_perf.xml
>>
>> I had a throughput of 10946 docs/sec before the lucene update, and 5849 after.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>> On Sun, Aug 9, 2009 at 12:10 PM, Yonik Seeley<yo...@lucidimagination.com> wrote:
>>> On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gs...@apache.org> wrote:
>>>> Or bite the bullet and upgrade to the incrementToken() method.
>>>
>>> Right - I'm not sure if that would fix it or not - I haven't been
>>> involved in the new Token attribute stuff...
>>> I'm currently writing a basic indexing unit test that we can use to
>>> measure this (the standard solrconfig does stuff that slows down
>>> indexing a lot, but helps in catching bugs on edge cases by creating
>>> many segments).
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>

Re: indexing slowdown with latest lucene udpate

Posted by Yonik Seeley <yo...@lucidimagination.com>.

It looks like implementing the new attribute stuff will not be enough
- the token architecture has changed enough that it looks like we must
cache tokenstreams to get back to good performance.

-Yonik
http://www.lucidimagination.com


On Sun, Aug 9, 2009 at 12:57 PM, Yonik Seeley<yo...@lucidimagination.com> wrote:
> OK, I've isolated (magnified) the effect with a test I just checked in.
> Indexing documents directly at the UpdateHandler was 85% faster before
> the latest lucene update.
>
> Run the test like this:
>
> ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
> -Diter=100000"; grep throughput
> build/test-results/*TestIndexingPerformance.xml
>
> To run on an older trunk version, just copy over
> src/test/org/apache/solr/update/TestIndexingPerformance.java
> src/test/test-files/solr/conf/solrconfig_perf.xml
>
> I had a throughput of 10946 docs/sec before the lucene update, and 5849 after.
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Sun, Aug 9, 2009 at 12:10 PM, Yonik Seeley<yo...@lucidimagination.com> wrote:
>> On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gs...@apache.org> wrote:
>>> Or bite the bullet and upgrade to the incrementToken() method.
>>
>> Right - I'm not sure if that would fix it or not - I haven't been
>> involved in the new Token attribute stuff...
>> I'm currently writing a basic indexing unit test that we can use to
>> measure this (the standard solrconfig does stuff that slows down
>> indexing a lot, but helps in catching bugs on edge cases by creating
>> many segments).
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>

Re: indexing slowdown with latest lucene udpate

Posted by Yonik Seeley <yo...@lucidimagination.com>.

OK, I've isolated (magnified) the effect with a test I just checked in.
Indexing documents directly at the UpdateHandler was 85% faster before
the latest lucene update.

Run the test like this:

ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
-Diter=100000"; grep throughput
build/test-results/*TestIndexingPerformance.xml

To run on an older trunk version, just copy over
src/test/org/apache/solr/update/TestIndexingPerformance.java
src/test/test-files/solr/conf/solrconfig_perf.xml

I had a throughput of 10946 docs/sec before the lucene update, and 5849 after.

-Yonik
http://www.lucidimagination.com


On Sun, Aug 9, 2009 at 12:10 PM, Yonik Seeley<yo...@lucidimagination.com> wrote:
> On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gs...@apache.org> wrote:
>> Or bite the bullet and upgrade to the incrementToken() method.
>
> Right - I'm not sure if that would fix it or not - I haven't been
> involved in the new Token attribute stuff...
> I'm currently writing a basic indexing unit test that we can use to
> measure this (the standard solrconfig does stuff that slows down
> indexing a lot, but helps in catching bugs on edge cases by creating
> many segments).
>
> -Yonik
> http://www.lucidimagination.com
>

Re: indexing slowdown with latest lucene udpate

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gs...@apache.org> wrote:
> Or bite the bullet and upgrade to the incrementToken() method.

Right - I'm not sure if that would fix it or not - I haven't been
involved in the new Token attribute stuff...
I'm currently writing a basic indexing unit test that we can use to
measure this (the standard solrconfig does stuff that slows down
indexing a lot, but helps in catching bugs on edge cases by creating
many segments).

-Yonik
http://www.lucidimagination.com

Re: indexing slowdown with latest lucene udpate

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 9, 2009, at 10:29 AM, Yonik Seeley wrote:

> I did some quick indexing performance tests right before and right
> after the last lucene jar update - the results are not good... about
> 30% slower.
> The test was an 80 MB text field, 100K documents, 6 short text fields
> per document, with the solrconfig/schema from trunk copied to both
> environments.
>
> I imagine this has to do with the new TokenStream stuff in Lucene, and
> how back compatibility is implemented (which I haven't followed, but
> which many now involve reflection).  We've never cached tokenstreams
> with everything that involves, but it may be that we will be forced to
> do so to recover the performance loss.

Or bite the bullet and upgrade to the incrementToken() method.  It  
likely isn't that bad, maybe a few hours of work.

Still, we should try to isolate down where exactly it is happening.