You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Mike Drob <md...@apache.org> on 2021/07/13 20:30:52 UTC

[DISCUSS] Solr Cache with Futures as values

Hi folks,

This is an idea based on a recent prod issue, and while we found
another workaround I think there is some merit to discuss here.

Currently our filter cache is a mapping from queries to docs, and the
result cache is similar although slightly more abstract. When we have
a lot of similar queries come in at the same time, if a particular
filter hasn't been cached yet then it will be computed a bunch of
times in parallel as each query tries to be the one to insert into the
cache.

One option that I've thought about is if instead of inserting results
into the cache directly, we pre-register a future in the cache, and
then use that as a reference to the results. Multiple queries coming
in parallel would all wait for the same result calculation instead of
allocating large arrays each.

The benefits are pretty straightforward - we reduce the amount of
computation done when there are lots of queries coming in, and reduce
the memory allocation pressure.

The complexity might be around handling errors or query timeouts or
cancellations. Or evictions, but I think that would all be manageable.

What do other folks think? Should I write up a SIP for this, since I
think it will be fairly complex, or are there existing solutions that
I should look into first?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: [DISCUSS] Solr Cache with Futures as values

Posted by David Smiley <ds...@apache.org>.
I meant an attempt to add a filter cache entry by executing a query that
may in turn want to do this very thing.  See FilterQuery which is produced
the our standard query parser, e.g. +foo +filter(bar)         "bar" is a
subquery, parsed and cached into FilterQuery.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Jul 14, 2021 at 7:18 PM Mike Drob <md...@apache.org> wrote:

> Thanks for the pointer, David!
>
> While browsing through that issue, I found this comment left by you
> from SOLR-14166
>
>
> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1185
>
> "note: can't use computeIfAbsent because can be recursive"
>
> I don't quite understand what the recursive case means here, can you
> elaborate on that? I didn't see any discussion of it in the JIRA
> either.
>
> Thanks,
> Mike
>
> On Wed, Jul 14, 2021 at 11:03 AM David Smiley <ds...@apache.org> wrote:
> >
> > Ideally, we could use SolrCache.computeIfAbsent [1] for the filter
> cache, as is used for some of the other caches.  The best SolrCache is
> CaffeineCache which works atomically for the same key (just as does
> ConcurrentHashMap).  The problem is that this method on CaffeineCache does
> not support computing a cache entry that is reentrant, i.e. that which can
> produce another cache entry when it is computed.  Really, that limitation
> ought to be elevated to the docs on SolrCache.computeIfAbsent.  Andrzej
> discovered [1] that some queries could do that, and so he did not update
> Solr's use of the filter cache to call it.  Please read the thread there
> and maybe comment further to get the attention of pertinent people.
> >
> > [1]: https://issues.apache.org/jira/browse/SOLR-13898
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Tue, Jul 13, 2021 at 4:31 PM Mike Drob <md...@apache.org> wrote:
> >>
> >> Hi folks,
> >>
> >> This is an idea based on a recent prod issue, and while we found
> >> another workaround I think there is some merit to discuss here.
> >>
> >> Currently our filter cache is a mapping from queries to docs, and the
> >> result cache is similar although slightly more abstract. When we have
> >> a lot of similar queries come in at the same time, if a particular
> >> filter hasn't been cached yet then it will be computed a bunch of
> >> times in parallel as each query tries to be the one to insert into the
> >> cache.
> >>
> >> One option that I've thought about is if instead of inserting results
> >> into the cache directly, we pre-register a future in the cache, and
> >> then use that as a reference to the results. Multiple queries coming
> >> in parallel would all wait for the same result calculation instead of
> >> allocating large arrays each.
> >>
> >> The benefits are pretty straightforward - we reduce the amount of
> >> computation done when there are lots of queries coming in, and reduce
> >> the memory allocation pressure.
> >>
> >> The complexity might be around handling errors or query timeouts or
> >> cancellations. Or evictions, but I think that would all be manageable.
> >>
> >> What do other folks think? Should I write up a SIP for this, since I
> >> think it will be fairly complex, or are there existing solutions that
> >> I should look into first?
> >>
> >> Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> >> For additional commands, e-mail: dev-help@solr.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
>

Re: [DISCUSS] Solr Cache with Futures as values

Posted by Mike Drob <md...@apache.org>.
> It also has has a bulk load API

We currently use this when we need to resize a cache because those
will have the same results but maybe a smaller number. I don't think
we can do it when starting a new cache that needs warming, since the
bulk putAll takes both keys and values rather than keys and a
computational unit. If we move to futures as values then something
like this becomes more possible, I think.

Unrelated, I really struggle figuring out how to test this in a
reproducible fashion. We'd need a filter query that takes a long time
to execute, or even an injectable latch to stall all of the queries
that we can release from the test code. Will fiddle with this some
more.

On Wed, Jul 14, 2021 at 5:46 PM Mark Miller <ma...@gmail.com> wrote:
>
> If Caffeine is being used, it might be worthwhile to look into using it’s feature set to do this.
>
> It has the ability to do either async or sync loading - if using sync, modifications will block while an entry is loading.
>
> It also has has a bulk load API, might be interesting for things like auto warming.
>
> - MRM
>
> On Wed, Jul 14, 2021 at 6:18 PM Mike Drob <md...@apache.org> wrote:
>>
>> Thanks for the pointer, David!
>>
>> While browsing through that issue, I found this comment left by you
>> from SOLR-14166
>>
>> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1185
>>
>> "note: can't use computeIfAbsent because can be recursive"
>>
>> I don't quite understand what the recursive case means here, can you
>> elaborate on that? I didn't see any discussion of it in the JIRA
>> either.
>>
>> Thanks,
>> Mike
>>
>> On Wed, Jul 14, 2021 at 11:03 AM David Smiley <ds...@apache.org> wrote:
>> >
>> > Ideally, we could use SolrCache.computeIfAbsent [1] for the filter cache, as is used for some of the other caches.  The best SolrCache is CaffeineCache which works atomically for the same key (just as does ConcurrentHashMap).  The problem is that this method on CaffeineCache does not support computing a cache entry that is reentrant, i.e. that which can produce another cache entry when it is computed.  Really, that limitation ought to be elevated to the docs on SolrCache.computeIfAbsent.  Andrzej discovered [1] that some queries could do that, and so he did not update Solr's use of the filter cache to call it.  Please read the thread there and maybe comment further to get the attention of pertinent people.
>> >
>> > [1]: https://issues.apache.org/jira/browse/SOLR-13898
>> >
>> > ~ David Smiley
>> > Apache Lucene/Solr Search Developer
>> > http://www.linkedin.com/in/davidwsmiley
>> >
>> >
>> > On Tue, Jul 13, 2021 at 4:31 PM Mike Drob <md...@apache.org> wrote:
>> >>
>> >> Hi folks,
>> >>
>> >> This is an idea based on a recent prod issue, and while we found
>> >> another workaround I think there is some merit to discuss here.
>> >>
>> >> Currently our filter cache is a mapping from queries to docs, and the
>> >> result cache is similar although slightly more abstract. When we have
>> >> a lot of similar queries come in at the same time, if a particular
>> >> filter hasn't been cached yet then it will be computed a bunch of
>> >> times in parallel as each query tries to be the one to insert into the
>> >> cache.
>> >>
>> >> One option that I've thought about is if instead of inserting results
>> >> into the cache directly, we pre-register a future in the cache, and
>> >> then use that as a reference to the results. Multiple queries coming
>> >> in parallel would all wait for the same result calculation instead of
>> >> allocating large arrays each.
>> >>
>> >> The benefits are pretty straightforward - we reduce the amount of
>> >> computation done when there are lots of queries coming in, and reduce
>> >> the memory allocation pressure.
>> >>
>> >> The complexity might be around handling errors or query timeouts or
>> >> cancellations. Or evictions, but I think that would all be manageable.
>> >>
>> >> What do other folks think? Should I write up a SIP for this, since I
>> >> think it will be fairly complex, or are there existing solutions that
>> >> I should look into first?
>> >>
>> >> Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
>> >> For additional commands, e-mail: dev-help@solr.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
>> For additional commands, e-mail: dev-help@solr.apache.org
>>
> --
> - Mark
>
> http://about.me/markrmiller

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: [DISCUSS] Solr Cache with Futures as values

Posted by Mark Miller <ma...@gmail.com>.
If Caffeine is being used, it might be worthwhile to look into using it’s
feature set to do this.

It has the ability to do either async or sync loading - if using sync,
modifications will block while an entry is loading.

It also has has a bulk load API, might be interesting for things like auto
warming.

- MRM

On Wed, Jul 14, 2021 at 6:18 PM Mike Drob <md...@apache.org> wrote:

> Thanks for the pointer, David!
>
> While browsing through that issue, I found this comment left by you
> from SOLR-14166
>
>
> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1185
>
> "note: can't use computeIfAbsent because can be recursive"
>
> I don't quite understand what the recursive case means here, can you
> elaborate on that? I didn't see any discussion of it in the JIRA
> either.
>
> Thanks,
> Mike
>
> On Wed, Jul 14, 2021 at 11:03 AM David Smiley <ds...@apache.org> wrote:
> >
> > Ideally, we could use SolrCache.computeIfAbsent [1] for the filter
> cache, as is used for some of the other caches.  The best SolrCache is
> CaffeineCache which works atomically for the same key (just as does
> ConcurrentHashMap).  The problem is that this method on CaffeineCache does
> not support computing a cache entry that is reentrant, i.e. that which can
> produce another cache entry when it is computed.  Really, that limitation
> ought to be elevated to the docs on SolrCache.computeIfAbsent.  Andrzej
> discovered [1] that some queries could do that, and so he did not update
> Solr's use of the filter cache to call it.  Please read the thread there
> and maybe comment further to get the attention of pertinent people.
> >
> > [1]: https://issues.apache.org/jira/browse/SOLR-13898
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Tue, Jul 13, 2021 at 4:31 PM Mike Drob <md...@apache.org> wrote:
> >>
> >> Hi folks,
> >>
> >> This is an idea based on a recent prod issue, and while we found
> >> another workaround I think there is some merit to discuss here.
> >>
> >> Currently our filter cache is a mapping from queries to docs, and the
> >> result cache is similar although slightly more abstract. When we have
> >> a lot of similar queries come in at the same time, if a particular
> >> filter hasn't been cached yet then it will be computed a bunch of
> >> times in parallel as each query tries to be the one to insert into the
> >> cache.
> >>
> >> One option that I've thought about is if instead of inserting results
> >> into the cache directly, we pre-register a future in the cache, and
> >> then use that as a reference to the results. Multiple queries coming
> >> in parallel would all wait for the same result calculation instead of
> >> allocating large arrays each.
> >>
> >> The benefits are pretty straightforward - we reduce the amount of
> >> computation done when there are lots of queries coming in, and reduce
> >> the memory allocation pressure.
> >>
> >> The complexity might be around handling errors or query timeouts or
> >> cancellations. Or evictions, but I think that would all be manageable.
> >>
> >> What do other folks think? Should I write up a SIP for this, since I
> >> think it will be fairly complex, or are there existing solutions that
> >> I should look into first?
> >>
> >> Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> >> For additional commands, e-mail: dev-help@solr.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
> --
- Mark

http://about.me/markrmiller

Re: [DISCUSS] Solr Cache with Futures as values

Posted by Mike Drob <md...@apache.org>.
Thanks for the pointer, David!

While browsing through that issue, I found this comment left by you
from SOLR-14166

https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1185

"note: can't use computeIfAbsent because can be recursive"

I don't quite understand what the recursive case means here, can you
elaborate on that? I didn't see any discussion of it in the JIRA
either.

Thanks,
Mike

On Wed, Jul 14, 2021 at 11:03 AM David Smiley <ds...@apache.org> wrote:
>
> Ideally, we could use SolrCache.computeIfAbsent [1] for the filter cache, as is used for some of the other caches.  The best SolrCache is CaffeineCache which works atomically for the same key (just as does ConcurrentHashMap).  The problem is that this method on CaffeineCache does not support computing a cache entry that is reentrant, i.e. that which can produce another cache entry when it is computed.  Really, that limitation ought to be elevated to the docs on SolrCache.computeIfAbsent.  Andrzej discovered [1] that some queries could do that, and so he did not update Solr's use of the filter cache to call it.  Please read the thread there and maybe comment further to get the attention of pertinent people.
>
> [1]: https://issues.apache.org/jira/browse/SOLR-13898
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Jul 13, 2021 at 4:31 PM Mike Drob <md...@apache.org> wrote:
>>
>> Hi folks,
>>
>> This is an idea based on a recent prod issue, and while we found
>> another workaround I think there is some merit to discuss here.
>>
>> Currently our filter cache is a mapping from queries to docs, and the
>> result cache is similar although slightly more abstract. When we have
>> a lot of similar queries come in at the same time, if a particular
>> filter hasn't been cached yet then it will be computed a bunch of
>> times in parallel as each query tries to be the one to insert into the
>> cache.
>>
>> One option that I've thought about is if instead of inserting results
>> into the cache directly, we pre-register a future in the cache, and
>> then use that as a reference to the results. Multiple queries coming
>> in parallel would all wait for the same result calculation instead of
>> allocating large arrays each.
>>
>> The benefits are pretty straightforward - we reduce the amount of
>> computation done when there are lots of queries coming in, and reduce
>> the memory allocation pressure.
>>
>> The complexity might be around handling errors or query timeouts or
>> cancellations. Or evictions, but I think that would all be manageable.
>>
>> What do other folks think? Should I write up a SIP for this, since I
>> think it will be fairly complex, or are there existing solutions that
>> I should look into first?
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
>> For additional commands, e-mail: dev-help@solr.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: [DISCUSS] Solr Cache with Futures as values

Posted by David Smiley <ds...@apache.org>.
Ideally, we could use SolrCache.computeIfAbsent [1] for the filter cache,
as is used for some of the other caches.  The best SolrCache is
CaffeineCache which works atomically for the same key (just as does
ConcurrentHashMap).  The problem is that this method on CaffeineCache does
not support computing a cache entry that is reentrant, i.e. that which can
produce another cache entry when it is computed.  Really, that limitation
ought to be elevated to the docs on SolrCache.computeIfAbsent.  Andrzej
discovered [1] that some queries could do that, and so he did not update
Solr's use of the filter cache to call it.  Please read the thread there
and maybe comment further to get the attention of pertinent people.

[1]: https://issues.apache.org/jira/browse/SOLR-13898

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Jul 13, 2021 at 4:31 PM Mike Drob <md...@apache.org> wrote:

> Hi folks,
>
> This is an idea based on a recent prod issue, and while we found
> another workaround I think there is some merit to discuss here.
>
> Currently our filter cache is a mapping from queries to docs, and the
> result cache is similar although slightly more abstract. When we have
> a lot of similar queries come in at the same time, if a particular
> filter hasn't been cached yet then it will be computed a bunch of
> times in parallel as each query tries to be the one to insert into the
> cache.
>
> One option that I've thought about is if instead of inserting results
> into the cache directly, we pre-register a future in the cache, and
> then use that as a reference to the results. Multiple queries coming
> in parallel would all wait for the same result calculation instead of
> allocating large arrays each.
>
> The benefits are pretty straightforward - we reduce the amount of
> computation done when there are lots of queries coming in, and reduce
> the memory allocation pressure.
>
> The complexity might be around handling errors or query timeouts or
> cancellations. Or evictions, but I think that would all be manageable.
>
> What do other folks think? Should I write up a SIP for this, since I
> think it will be fairly complex, or are there existing solutions that
> I should look into first?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
>

Re: [DISCUSS] Solr Cache with Futures as values

Posted by Shawn Heisey <el...@elyograg.org>.
On 7/13/2021 2:30 PM, Mike Drob wrote:
> One option that I've thought about is if instead of inserting results 
> into the cache directly, we pre-register a future in the cache, and 
> then use that as a reference to the results. Multiple queries coming 
> in parallel would all wait for the same result calculation instead of 
> allocating large arrays each. 

That's a really interesting idea.  Sounds like a very good optimization.

As always the devil is in the details ... properly handling all the 
possible corner cases.  I worry that it's going to be a harder problem 
to solve than we think it is ... but I am not going to let that stand in 
the way of making the attempt.

When I first dreamed up LFUCache, I didn't think I was going to be able 
to write it.  But I decided to make the attempt anyway, and I think it 
was Hoss that committed it -- this was before I was asked to join the 
project.  Even though it's been obsoleted by the Caffeine version, I 
really enjoyed working on it.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org