You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mark Miller <ma...@gmail.com> on 2009/08/10 16:35:58 UTC

RE: indexing_slowdown_with_latest_lucene_udpate

Discussion on speed of new TokenStream API in Solr.

see: 
http://search.lucidimagination.com/search/document/d0040ebe6addad4b/indexing_slowdown_with_latest_lucene_udpate

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: indexing_slowdown_with_latest_lucene_udpate

Posted by Uwe Schindler <uw...@thetaphi.de>.

Also AttributeSource.addAttributeImpl() has such a cache which helped very
much. This isMethodOverridden is the only place, where no cache is used.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Monday, August 10, 2009 5:02 PM
> To: java-dev@lucene.apache.org
> Subject: RE: indexing_slowdown_with_latest_lucene_udpate
> 
> The question is, if that would get better if the reflection calls are only
> done one time per class using a IdentityHashMap<Class,Boolean>. The other
> reflection code in AttributeSource uses a static cache for such type of
> things (e.g. the Attribute -> AttributeImpl mappings in AttributeSource.
> DefaultAttributeFactory.getClassForInterface()).
> 
> I could do some tests about that and supply a patch. I was thinking about
> that but throwed it away (as it needs some synchronization on the cache
> Map
> which may also overweigh).
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> > -----Original Message-----
> > From: Mark Miller [mailto:markrmiller@gmail.com]
> > Sent: Monday, August 10, 2009 4:48 PM
> > To: java-dev@lucene.apache.org
> > Subject: Re: indexing_slowdown_with_latest_lucene_udpate
> >
> > Robert Muir wrote:
> > > This is real and not just for very short docs.
> > Yes, you still pay the cost for longer docs, but it just becomes less
> > important the longer the docs, as it plays a smaller role. Load a ton of
> > one term docs, and it might be 50-60% slower - add a bunch of articles,
> > and it might be closer to 20%-15% (I don't know the numbers, but the
> > longer I made the docs, the less % slowdown, obviously). Still a good
> hit,
> > but a short doc test magnafies the problem.
> >
> > It affects things no matter what, but when you don't do much tokenizing,
> > normalizing, the cost of the reflection/tokenstream init dominates.
> >
> > - Mark
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: indexing_slowdown_with_latest_lucene_udpate

Posted by Earwin Burrfoot <ea...@gmail.com>.

Or, we can just throw that detection out of the window, for less
smooth back-compat experience, less hacky code and no slowdown.

On Mon, Aug 10, 2009 at 19:02, Uwe Schindler<uw...@thetaphi.de> wrote:
> The question is, if that would get better if the reflection calls are only
> done one time per class using a IdentityHashMap<Class,Boolean>. The other
> reflection code in AttributeSource uses a static cache for such type of
> things (e.g. the Attribute -> AttributeImpl mappings in AttributeSource.
> DefaultAttributeFactory.getClassForInterface()).
>
> I could do some tests about that and supply a patch. I was thinking about
> that but throwed it away (as it needs some synchronization on the cache Map
> which may also overweigh).
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: Mark Miller [mailto:markrmiller@gmail.com]
>> Sent: Monday, August 10, 2009 4:48 PM
>> To: java-dev@lucene.apache.org
>> Subject: Re: indexing_slowdown_with_latest_lucene_udpate
>>
>> Robert Muir wrote:
>> > This is real and not just for very short docs.
>> Yes, you still pay the cost for longer docs, but it just becomes less
>> important the longer the docs, as it plays a smaller role. Load a ton of
>> one term docs, and it might be 50-60% slower - add a bunch of articles,
>> and it might be closer to 20%-15% (I don't know the numbers, but the
>> longer I made the docs, the less % slowdown, obviously). Still a good hit,
>> but a short doc test magnafies the problem.
>>
>> It affects things no matter what, but when you don't do much tokenizing,
>> normalizing, the cost of the reflection/tokenstream init dominates.
>>
>> - Mark
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: indexing_slowdown_with_latest_lucene_udpate

Posted by Uwe Schindler <uw...@thetaphi.de>.

I already started to prepare a patch... Let's open an issue! You could try
it out with your corpus and post numbers.

There are some additional slowdowns with the new API if you do not reuse
TokenStreams, as the setup of the Attribute maps is an additional small
cost.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: Monday, August 10, 2009 5:08 PM
> To: java-dev@lucene.apache.org
> Subject: Re: indexing_slowdown_with_latest_lucene_udpate
> 
> My bet is that that would still be much faster - uncontentious sync is
> generally very fast and the check method call is extremely slow.
> 
> - Mark
> 
> Uwe Schindler wrote:
> > The question is, if that would get better if the reflection calls are
> only
> > done one time per class using a IdentityHashMap<Class,Boolean>. The
> other
> > reflection code in AttributeSource uses a static cache for such type of
> > things (e.g. the Attribute -> AttributeImpl mappings in AttributeSource.
> > DefaultAttributeFactory.getClassForInterface()).
> >
> > I could do some tests about that and supply a patch. I was thinking
> about
> > that but throwed it away (as it needs some synchronization on the cache
> Map
> > which may also overweigh).
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> >> -----Original Message-----
> >> From: Mark Miller [mailto:markrmiller@gmail.com]
> >> Sent: Monday, August 10, 2009 4:48 PM
> >> To: java-dev@lucene.apache.org
> >> Subject: Re: indexing_slowdown_with_latest_lucene_udpate
> >>
> >> Robert Muir wrote:
> >>
> >>> This is real and not just for very short docs.
> >>>
> >> Yes, you still pay the cost for longer docs, but it just becomes less
> >> important the longer the docs, as it plays a smaller role. Load a ton
> of
> >> one term docs, and it might be 50-60% slower - add a bunch of articles,
> >> and it might be closer to 20%-15% (I don't know the numbers, but the
> >> longer I made the docs, the less % slowdown, obviously). Still a good
> hit,
> >> but a short doc test magnafies the problem.
> >>
> >> It affects things no matter what, but when you don't do much
> tokenizing,
> >> normalizing, the cost of the reflection/tokenstream init dominates.
> >>
> >> - Mark
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> 
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: indexing_slowdown_with_latest_lucene_udpate

Posted by Mark Miller <ma...@gmail.com>.

My bet is that that would still be much faster - uncontentious sync is 
generally very fast and the check method call is extremely slow.

- Mark

Uwe Schindler wrote:
> The question is, if that would get better if the reflection calls are only
> done one time per class using a IdentityHashMap<Class,Boolean>. The other
> reflection code in AttributeSource uses a static cache for such type of
> things (e.g. the Attribute -> AttributeImpl mappings in AttributeSource.
> DefaultAttributeFactory.getClassForInterface()).
>
> I could do some tests about that and supply a patch. I was thinking about
> that but throwed it away (as it needs some synchronization on the cache Map
> which may also overweigh).
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>   
>> -----Original Message-----
>> From: Mark Miller [mailto:markrmiller@gmail.com]
>> Sent: Monday, August 10, 2009 4:48 PM
>> To: java-dev@lucene.apache.org
>> Subject: Re: indexing_slowdown_with_latest_lucene_udpate
>>
>> Robert Muir wrote:
>>     
>>> This is real and not just for very short docs.
>>>       
>> Yes, you still pay the cost for longer docs, but it just becomes less
>> important the longer the docs, as it plays a smaller role. Load a ton of
>> one term docs, and it might be 50-60% slower - add a bunch of articles,
>> and it might be closer to 20%-15% (I don't know the numbers, but the
>> longer I made the docs, the less % slowdown, obviously). Still a good hit,
>> but a short doc test magnafies the problem.
>>
>> It affects things no matter what, but when you don't do much tokenizing,
>> normalizing, the cost of the reflection/tokenstream init dominates.
>>
>> - Mark
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>     
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: indexing_slowdown_with_latest_lucene_udpate

Posted by Uwe Schindler <uw...@thetaphi.de>.

The question is, if that would get better if the reflection calls are only
done one time per class using a IdentityHashMap<Class,Boolean>. The other
reflection code in AttributeSource uses a static cache for such type of
things (e.g. the Attribute -> AttributeImpl mappings in AttributeSource.
DefaultAttributeFactory.getClassForInterface()).

I could do some tests about that and supply a patch. I was thinking about
that but throwed it away (as it needs some synchronization on the cache Map
which may also overweigh).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: Monday, August 10, 2009 4:48 PM
> To: java-dev@lucene.apache.org
> Subject: Re: indexing_slowdown_with_latest_lucene_udpate
> 
> Robert Muir wrote:
> > This is real and not just for very short docs.
> Yes, you still pay the cost for longer docs, but it just becomes less
> important the longer the docs, as it plays a smaller role. Load a ton of
> one term docs, and it might be 50-60% slower - add a bunch of articles,
> and it might be closer to 20%-15% (I don't know the numbers, but the
> longer I made the docs, the less % slowdown, obviously). Still a good hit,
> but a short doc test magnafies the problem.
> 
> It affects things no matter what, but when you don't do much tokenizing,
> normalizing, the cost of the reflection/tokenstream init dominates.
> 
> - Mark
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: indexing_slowdown_with_latest_lucene_udpate

Posted by Robert Muir <rc...@gmail.com>.

btw my lucene 2.4 numbers for this corpus (running many times) average
around 41s versus 44s,
so its still a small hit even for reasonably large docs, using simple
analyzers with reuse and all that.

so reusableTokenStream takes care of a lot of it, but not all of it.
On Mon, Aug 10, 2009 at 10:48 AM, Mark Miller<ma...@gmail.com> wrote:
> Robert Muir wrote:
>>
>> This is real and not just for very short docs.
>
> Yes, you still pay the cost for longer docs, but it just becomes less
> important the longer the docs, as it plays a smaller role. Load a ton of one
> term docs, and it might be 50-60% slower - add a bunch of articles, and it
> might be closer to 20%-15% (I don't know the numbers, but the longer I made
> the docs, the less % slowdown, obviously). Still a good hit, but a short doc
> test magnafies the problem.
>
> It affects things no matter what, but when you don't do much tokenizing,
> normalizing, the cost of the reflection/tokenstream init dominates.
>
> - Mark
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: indexing_slowdown_with_latest_lucene_udpate

Posted by Mark Miller <ma...@gmail.com>.

Robert Muir wrote:
> This is real and not just for very short docs. 
Yes, you still pay the cost for longer docs, but it just becomes less important the longer the docs, as it plays a smaller role. Load a ton of one term docs, and it might be 50-60% slower - add a bunch of articles, and it might be closer to 20%-15% (I don't know the numbers, but the longer I made the docs, the less % slowdown, obviously). Still a good hit, but a short doc test magnafies the problem.

It affects things no matter what, but when you don't do much tokenizing, normalizing, the cost of the reflection/tokenstream init dominates.

- Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: indexing_slowdown_with_latest_lucene_udpate

Posted by Robert Muir <rc...@gmail.com>.

This is real and not just for very short docs. The reflection overhead
is pretty expensive I think.
here are some stats from the hamshari corpus (i have been trec testing
persian just to ensure everything is ok)

SimpleAnalyzer: (has reusableTokenStream)
Total time: 47816 ms
Unique tokens: 441660

PersianAnalyzer (no reuse):
Total time: 53928 ms
Unique tokens: 438286

PersianAnalyzer (with reusableTokenStream)
Total time: 47704 ms
Unique tokens: 438286

On Mon, Aug 10, 2009 at 10:35 AM, Mark Miller<ma...@gmail.com> wrote:
> Discussion on speed of new TokenStream API in Solr.
>
> see:
> http://search.lucidimagination.com/search/document/d0040ebe6addad4b/indexing_slowdown_with_latest_lucene_udpate
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org