You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Richard Marr (JIRA)" <ji...@apache.org> on 2009/06/13 07:37:07 UTC

[jira] Created: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Morelikethis queries are very slow compared to other search types
-----------------------------------------------------------------

                 Key: LUCENE-1690
                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/*
    Affects Versions: 2.4.1
            Reporter: Richard Marr
            Priority: Minor


The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  

For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.

I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by Richard Marr <ri...@gmail.com>.
2009/7/30 Michael McCandless <lu...@mikemccandless.com>:
> Good question...

Good answer. Thanks.

I guess the next step then is to understand why the TermInfo cache
isn't getting the performance to where it could be. It'll take me a
while to get to the point where I can answer that question. If
anyone's in a hurry it'd probably be worth someone looking at it.

Rich

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by Michael Busch <bu...@gmail.com>.
On 7/30/09 4:10 AM, Michael McCandless wrote:
> Plus, the original motivation for this (LUCENE-1195) was because
> queries in general look up the same term at least 2 times during their
> execution (weight (idf computation), get postings), and so I think we
> wanted to ensure that a single thread doing its query would not see
> its terms evicted (due to many other threads coming through) by the
> 2nd time it needed to use them.  But if we made the central cache
> "large enough", perhaps growing if it detects many threads, then this
> (other threads evicted my entries before I finished my query)
> shouldn't be a problem in practice.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>    

Yes this was part of the motivation. Especially wildcard or range 
queries could wipe out the entire cache before another thread does its 
second term lookup.

If we had a lock-less cache then I agree simply making it larger would 
probably be better than having separate caches per thread.
Also we should probably optimize the most common cases... if in rare 
situations certain queries wipe out the cache it might not be such a big 
deal.

  Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Jul 30, 2009 at 6:28 AM, Richard Marr<ri...@gmail.com> wrote:
> Yeah, having this stuff stored centrally behind the IndexReader seems
> like a better idea than having it in client classes. My shallow
> knowledge of the code isn't helping me explain why it's not performing
> though.
>
> Out of interest, how come it's a per-thread cache? I don't understand
> all the issues involved but that surprised me.

Good question... making it thread private seems rather wasteful since
at heart this information (Term -> TermInfo) is constant across
threads and so we're wasting RAM.

Also, it's a non-trivial amount of RAM that we're tying up once the
cache is full: 1024 times maybe ~120 bytes per TermInfo on a 64bit jre
= ~120 KB, and it's somewhat devilish/unexpected ("principle of least
surprise") for Lucene to "do this" to any threads that come through
it.

I think one reason was to avoid having to synchronize on the lookups,
though with magic similar to LUCENE-1607 we could presumably make it
lockless.

Plus, the original motivation for this (LUCENE-1195) was because
queries in general look up the same term at least 2 times during their
execution (weight (idf computation), get postings), and so I think we
wanted to ensure that a single thread doing its query would not see
its terms evicted (due to many other threads coming through) by the
2nd time it needed to use them.  But if we made the central cache
"large enough", perhaps growing if it detects many threads, then this
(other threads evicted my entries before I finished my query)
shouldn't be a problem in practice.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by Richard Marr <ri...@gmail.com>.
Yeah, having this stuff stored centrally behind the IndexReader seems
like a better idea than having it in client classes. My shallow
knowledge of the code isn't helping me explain why it's not performing
though.

Out of interest, how come it's a per-thread cache? I don't understand
all the issues involved but that surprised me.




2009/7/30 Michael McCandless (JIRA) <ji...@apache.org>:
>
>    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737059#action_12737059 ]
>
> Michael McCandless commented on LUCENE-1690:
> --------------------------------------------
>
> OK now I feel silly -- this cache is in fact very similar to the caching that Lucene already does, internally!  Sorry I didn't catch this overlap sooner.
>
> In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, that holds recently retrieved terms and their TermInfo.  It uses oal.util.cache.SimpleLRUCache.
>
> There are some important differences from this new cache in MLT.  EG, it holds the entire TermInfo, not just the docFreq.  Plus, it's a central cache for any & all term lookups that go through the SegmentReader.  Also, it's stored in thread-private storage, so each thread has its own cache.
>
> But, now I'm confused: how come you are not already seeing the benefits of this cache?  You ought to see MLT queries going faster.  This core cache was first added in 2.4.x; it looks like you were testing against 2.4.1 (from the "Affects Version" on this issue).
>
>> Morelikethis queries are very slow compared to other search types
>> -----------------------------------------------------------------
>>
>>                 Key: LUCENE-1690
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: contrib/*
>>    Affects Versions: 2.4.1
>>            Reporter: Richard Marr
>>            Priority: Minor
>>         Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch
>>
>>   Original Estimate: 2h
>>  Remaining Estimate: 2h
>>
>> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.
>> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
>> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Richard Marr
richard.marr@gmail.com
07976 910 515

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Richard Marr (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Marr updated LUCENE-1690:
---------------------------------

    Attachment: LruCache.patch

Attached is a draft of an implementation that uses a WeakHashMap to bind the cache to the IndexReader instance, and a LinkedHashMap to provide LRU functionality.

Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or bad style in this implementation. I also need to check it meets the project coding standards.

Anybody up for giving me some feedback in the meantime?

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LruCache.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Richard Marr (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736525#action_12736525 ] 

Richard Marr commented on LUCENE-1690:
--------------------------------------

There's also another problem I've just noticed. Please ignore the latest patch.

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LruCache.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737059#action_12737059 ] 

Michael McCandless commented on LUCENE-1690:
--------------------------------------------

OK now I feel silly -- this cache is in fact very similar to the caching that Lucene already does, internally!  Sorry I didn't catch this overlap sooner.

In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, that holds recently retrieved terms and their TermInfo.  It uses oal.util.cache.SimpleLRUCache.

There are some important differences from this new cache in MLT.  EG, it holds the entire TermInfo, not just the docFreq.  Plus, it's a central cache for any & all term lookups that go through the SegmentReader.  Also, it's stored in thread-private storage, so each thread has its own cache.

But, now I'm confused: how come you are not already seeing the benefits of this cache?  You ought to see MLT queries going faster.  This core cache was first added in 2.4.x; it looks like you were testing against 2.4.1 (from the "Affects Version" on this issue).

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Carl Austin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737107#action_12737107 ] 

Carl Austin commented on LUCENE-1690:
-------------------------------------

The cache in terminfosreader is for everything as you say. I do a lot of stuff with terms, and those terms will get pushed out of this LRU cache very quickly. 
I have a separate cache on my version of the MLT. This has the advantage of those terms only being pushed out by other MLT queries, and not by everything else I am doing that is not MLT related. 
A lot of MLTs use the same terms, and I have a good size cache for it, meaning most terms I use in MLT can be retrieved from there. Seeing as MLT in my circumstance is one of the slower bits, this can give me a good advantage.

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Richard Marr (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Marr updated LUCENE-1690:
---------------------------------

    Attachment: LUCENE-1690.patch

This is the latest version. I wasn't working on it at quite such a rediculous hour this time so it should be better.

It includes - fixed cache logic, a few comments, LRU object applied in the right place, and some test cases demonstrating things behave as expected. I'll do some more testing when I have a free evening.

I have some questions:

 a) org.apache.lucene.search.similar doesn't seem like the right place for a generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead?

 b) Having the cache dependent on both the MLT object and the IndexReader object seems a bit... odd. I suspect the right place for this cache is in the IndexReader, but suspect that would be a can of worms. Comments?



> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736149#action_12736149 ] 

Michael McCandless commented on LUCENE-1690:
--------------------------------------------

The getTermFrequency method looks like it'll incorrectly put 0 into the cache, when the field was in the top-level cache but the term text wasn't in the 2nd level cache?

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LruCache.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Richard Marr (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Marr updated LUCENE-1690:
---------------------------------

    Attachment: LUCENE-1690.patch

This patch implements a basic hashmap term frequency cache. It shouldn't affect any applications that don't opt-in to using it, and applications that do should see an order of magnitude performance improvement for MLT queries.

This cache implementation is tied to the MLT object but can be cleared on demand.

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Carl Austin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733234#action_12733234 ] 

Carl Austin commented on LUCENE-1690:
-------------------------------------

The cache used for this is a HashMap and this is unbounded.  Perhaps this should be an LRU cache with a settable maximum number of entries to stop it growing forever if you do a lot of like this queries on large indexes with many unique terms.
Otherwise nice addition, has sped up my more like this queries a bit.

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719103#action_12719103 ] 

Michael McCandless commented on LUCENE-1690:
--------------------------------------------

This sounds good!

Could we include the IndexReader in the cache key?  Then it'd be functionally equivalent we could enable it by default?



> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Richard Marr (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719653#action_12719653 ] 

Richard Marr commented on LUCENE-1690:
--------------------------------------

Sounds reasonable although that'll take a little longer for me to do. I'll have a think about it.

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Richard Marr (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733237#action_12733237 ] 

Richard Marr commented on LUCENE-1690:
--------------------------------------

Okay, so the ideal solution is an LRU cache binding to a specific IndexReader instance. I think I can handle that.

Carl, do you have any data on how this has changed performance in your system?  My use case is a limited vocabulary so the performance gain was large.

> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

Posted by "Carl Austin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733238#action_12733238 ] 

Carl Austin commented on LUCENE-1690:
-------------------------------------

I wasn't all that scientific I am afraid, just noting that it improved performace enough once warmed up to keep on using it. Sorry.
However, after just 3 or 4 more like this queries I am seeing a definate improvement, as the majority of freetext is standard vocab, and the unique terms only make up a small amount of the rest of the text.


> Morelikethis queries are very slow compared to other search types
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Richard Marr
>            Priority: Minor
>         Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  From my testing that's what seems to take up the majority of time for MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org