You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Richard Marr <ri...@gmail.com> on 2009/04/07 08:28:28 UTC

MoreLikeThisQuery term frequency caching

Hi all,

I've been exploring MoreLikeThisQuery as part of a recent project and
something that came out of that might be useful to others here.

I found that using MoreLikeThisQuery could be quite slow for my use
case, but that most of the time involved was spent looking up term
frequencies to calculate weightings. Since those term frequencies
usually don't need to be anywhere near real-time I found that caching
them in a hashmap had a very good cost/benefit ratio for my
application, speeding up MLT queries by an order of magnitude.

My use case was possibly unusual in that I was looking at a limited
vocabulary rather than full English, but in theory other applications
that make use of the MLT class could benefit.

So at this point I have some questions: (1) Have others experienced
similar performance characteristics for MLT code? (2) Am I missing
some fatal flaw in this approach? (3) Are the modifications worth
sharing?

Cheers,

Rich

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: MoreLikeThisQuery term frequency caching

Posted by Richard Marr <ri...@gmail.com>.
The cache is currently being stored as a static HashMap on the MLT
object and expired at the discretion of the application code using a
static MLT.flushCache() method. Use of the cache at all is opt-in,
using a non-static MLT.setCache(true) and a new constructor signature
on MLTQuery that includes a useCache parameter.

It's not pretty but it's enough for our use case.

Feel free to suggest nicer solutions if you've got them.



2009/4/10 Grant Ingersoll <gs...@apache.org>:
> What was your approach to handling stale cache entries?  Did you flush it
> when you opened a new reader?
>
> On Apr 7, 2009, at 2:28 AM, Richard Marr wrote:
>
>> Hi all,
>>
>> I've been exploring MoreLikeThisQuery as part of a recent project and
>> something that came out of that might be useful to others here.
>>
>> I found that using MoreLikeThisQuery could be quite slow for my use
>> case, but that most of the time involved was spent looking up term
>> frequencies to calculate weightings. Since those term frequencies
>> usually don't need to be anywhere near real-time I found that caching
>> them in a hashmap had a very good cost/benefit ratio for my
>> application, speeding up MLT queries by an order of magnitude.
>>
>> My use case was possibly unusual in that I was looking at a limited
>> vocabulary rather than full English, but in theory other applications
>> that make use of the MLT class could benefit.
>>
>> So at this point I have some questions: (1) Have others experienced
>> similar performance characteristics for MLT code? (2) Am I missing
>> some fatal flaw in this approach? (3) Are the modifications worth
>> sharing?
>>
>> Cheers,
>>
>> Rich
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Richard Marr
richard.marr@gmail.com
07976 910 515

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: MoreLikeThisQuery term frequency caching

Posted by Grant Ingersoll <gs...@apache.org>.
What was your approach to handling stale cache entries?  Did you flush  
it when you opened a new reader?

On Apr 7, 2009, at 2:28 AM, Richard Marr wrote:

> Hi all,
>
> I've been exploring MoreLikeThisQuery as part of a recent project and
> something that came out of that might be useful to others here.
>
> I found that using MoreLikeThisQuery could be quite slow for my use
> case, but that most of the time involved was spent looking up term
> frequencies to calculate weightings. Since those term frequencies
> usually don't need to be anywhere near real-time I found that caching
> them in a hashmap had a very good cost/benefit ratio for my
> application, speeding up MLT queries by an order of magnitude.
>
> My use case was possibly unusual in that I was looking at a limited
> vocabulary rather than full English, but in theory other applications
> that make use of the MLT class could benefit.
>
> So at this point I have some questions: (1) Have others experienced
> similar performance characteristics for MLT code? (2) Am I missing
> some fatal flaw in this approach? (3) Are the modifications worth
> sharing?
>
> Cheers,
>
> Rich
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: MoreLikeThisQuery term frequency caching

Posted by Richard Marr <ri...@gmail.com>.
Thanks Mike,

I'll leave it a few days to give people time to respond then start
looking into creating a Jira ticket and a patch.


2009/4/7 Michael McCandless <lu...@mikemccandless.com>:
> I don't have direct experience with MLT, but this sounds like a great
> improvement, so in answer to (3) I would say "definitely!".
>
> Mike
>
> On Tue, Apr 7, 2009 at 2:28 AM, Richard Marr <ri...@gmail.com> wrote:
>> Hi all,
>>
>> I've been exploring MoreLikeThisQuery as part of a recent project and
>> something that came out of that might be useful to others here.
>>
>> I found that using MoreLikeThisQuery could be quite slow for my use
>> case, but that most of the time involved was spent looking up term
>> frequencies to calculate weightings. Since those term frequencies
>> usually don't need to be anywhere near real-time I found that caching
>> them in a hashmap had a very good cost/benefit ratio for my
>> application, speeding up MLT queries by an order of magnitude.
>>
>> My use case was possibly unusual in that I was looking at a limited
>> vocabulary rather than full English, but in theory other applications
>> that make use of the MLT class could benefit.
>>
>> So at this point I have some questions: (1) Have others experienced
>> similar performance characteristics for MLT code? (2) Am I missing
>> some fatal flaw in this approach? (3) Are the modifications worth
>> sharing?
>>
>> Cheers,
>>
>> Rich
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Richard Marr
richard.marr@gmail.com
07976 910 515

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: MoreLikeThisQuery term frequency caching

Posted by Michael McCandless <lu...@mikemccandless.com>.
I don't have direct experience with MLT, but this sounds like a great
improvement, so in answer to (3) I would say "definitely!".

Mike

On Tue, Apr 7, 2009 at 2:28 AM, Richard Marr <ri...@gmail.com> wrote:
> Hi all,
>
> I've been exploring MoreLikeThisQuery as part of a recent project and
> something that came out of that might be useful to others here.
>
> I found that using MoreLikeThisQuery could be quite slow for my use
> case, but that most of the time involved was spent looking up term
> frequencies to calculate weightings. Since those term frequencies
> usually don't need to be anywhere near real-time I found that caching
> them in a hashmap had a very good cost/benefit ratio for my
> application, speeding up MLT queries by an order of magnitude.
>
> My use case was possibly unusual in that I was looking at a limited
> vocabulary rather than full English, but in theory other applications
> that make use of the MLT class could benefit.
>
> So at this point I have some questions: (1) Have others experienced
> similar performance characteristics for MLT code? (2) Am I missing
> some fatal flaw in this approach? (3) Are the modifications worth
> sharing?
>
> Cheers,
>
> Rich
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org