You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Aristedes Maniatis <am...@apache.org> on 2016/11/21 16:08:48 UTC

Solr as am html cache

I'm familiar enough with 7-8 years of Solr usage in how it performs as a full text search index, including spatial coordinates and much more. But for the most part, we've been returning database ids from Solr rather than a full record ready to display. We then grab the data and related records from the database in the usual way and display it.

We are thinking now about improving performance of our app. One option is Reddis to store html pieces for reuse, rather than assembling the html from dozens of queries to the database. We've done what we can with caching in the ORM level, and we can't do too much with varnish because of differences in page rendering per user (eg shopping baskets).

But we are thinking about storing the rendered html directly in Solr. The downsides appear to be:

* adding 2-10kB of html to each record and the performance hit this might have on searching and retrieving
* additional load of ensuring we rebuild Solr's data every time some part of that html changes (but this is minimal in our use case)
* additional cores that we'll want to add to cache other data that isn't yet in Solr

Is this a reasonable approach to avoid running yet another cluster of services? Are there downsides to this I haven't thought of? How does Solr scale with record size?



Cheers
Ari




-- 
-------------------------->
Aristedes Maniatis
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: Solr as am html cache

Posted by Erick Erickson <er...@gmail.com>.
bq: This seems like it might even be a good approach for creating
additional cores primarily for the purpose of caching

I think you're making it too complex, especially for such a small data set ;)

1> All the data is memory mapped anyway, so what's not in the JVM will
be in the OS's
memory eventually (assuming you have enough physical memory). See:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
If you don't have enough physical memory for that to happen adding
another core won't
help.

2> You can set your documentCache in solrconfig.xml high enough that
it'll cache that all your
documents _uncompressed_, memory permitting in 2 minutes of changing
your solrconfig.xml
file.

3> My challenge is always to measure before you code. My intuition is
that if you quantify
the potential gains of going to more complex caching they'll be
insignificant; not worth
the development time. Can't argue with measurements though.

FWIW,
Erick

On Mon, Nov 21, 2016 at 11:56 PM, Aristedes Maniatis <ar...@maniatis.org> wrote:
> Thanks Erick
>
> Very helpful indeed.
>
> Your guesses on data size are about right. There might only be 50,000 items in the whole index. And typically we'd fetch a batch of 10. Disk is cheap and this really isn't taking much room anyway. For such a tiny data set, it seems like this approach will work well.
>
>
> This seems like it might even be a good approach for creating additional cores primarily for the purpose of caching: that is, a core full of records that are only ever queries by some unique key. I wouldn't want to abuse Solr for a purpose it wasn't designed, but since it is already there it appears to be a useful approach. Rather than getting some data from the db, we fetch it from Solr pre-assembled.
>
> Thanks
> Ari
>
>
>
> On 22/11/16 3:28am, Erick Erickson wrote:
>> Searching isn't really going to be impacted much, if at all. You're
>> essentially talking about setting some field with store="true" and
>> stuffing the HTML into that, right? It will probably have indexed="false"
>> and docValues="false".
>>
>> So.. what that means is that very early in the indexing process, the
>> raw data is dumped to the *.fdt and *.fdx extensions for the segment. These
>> are totally irrelevant for querying, they aren't even read from disk to score
>> the docs. So let's say your numFound = 10,000 and rows=10. Those 10,000
>> docs are scored without having to look at the stored data at all. Now, when
>> the 10 docs are assembled for return, the stored data is read off disk
>> decompressed and returned.
>>
>> So the additional cost will be
>> 1> your index is larger on disk
>> 2> merging etc. will be a bit more costly. This doesn't
>>      seem like a problem if your index doesn't change all
>>      that often.
>> 3> there will be some additional load to decompress the data
>>      and return it.
>>
>> This is a perfectly reasonable approach, my guess is that any difference
>> in search speed will be lost in the noise of measuring and that the
>> additional load of decompressing will be more than offset by not having
>> to make a separate service call to actually get the doc, but as always
>> measuring the performance is the proof you need.
>>
>> You haven't indicated how _many_ docs you have in your corpus, but a
>> rough indication of the additional disk space is about half the raw HTML size,
>> we've usually seen about a 2:1 compression ratio. With a zillion docs
>> that could be sizeable, but disk space is cheap.
>>
>>
>> Best,
>> Erick
>>
>> On Mon, Nov 21, 2016 at 8:08 AM, Aristedes Maniatis
>> <am...@apache.org> wrote:
>>> I'm familiar enough with 7-8 years of Solr usage in how it performs as a full text search index, including spatial coordinates and much more. But for the most part, we've been returning database ids from Solr rather than a full record ready to display. We then grab the data and related records from the database in the usual way and display it.
>>>
>>> We are thinking now about improving performance of our app. One option is Reddis to store html pieces for reuse, rather than assembling the html from dozens of queries to the database. We've done what we can with caching in the ORM level, and we can't do too much with varnish because of differences in page rendering per user (eg shopping baskets).
>>>
>>> But we are thinking about storing the rendered html directly in Solr. The downsides appear to be:
>>>
>>> * adding 2-10kB of html to each record and the performance hit this might have on searching and retrieving
>>> * additional load of ensuring we rebuild Solr's data every time some part of that html changes (but this is minimal in our use case)
>>> * additional cores that we'll want to add to cache other data that isn't yet in Solr
>>>
>>> Is this a reasonable approach to avoid running yet another cluster of services? Are there downsides to this I haven't thought of? How does Solr scale with record size?
>>>
>>>
>>>
>>> Cheers
>>> Ari
>>>
>>>
>>>
>>>
>>> --
>>> -------------------------->
>>> Aristedes Maniatis
>>> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A
>
>
> --
> -------------------------->
> Aristedes Maniatis
> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: Solr as am html cache

Posted by Aristedes Maniatis <ar...@maniatis.org>.
Thanks Erick

Very helpful indeed. 

Your guesses on data size are about right. There might only be 50,000 items in the whole index. And typically we'd fetch a batch of 10. Disk is cheap and this really isn't taking much room anyway. For such a tiny data set, it seems like this approach will work well.


This seems like it might even be a good approach for creating additional cores primarily for the purpose of caching: that is, a core full of records that are only ever queries by some unique key. I wouldn't want to abuse Solr for a purpose it wasn't designed, but since it is already there it appears to be a useful approach. Rather than getting some data from the db, we fetch it from Solr pre-assembled.

Thanks
Ari



On 22/11/16 3:28am, Erick Erickson wrote:
> Searching isn't really going to be impacted much, if at all. You're
> essentially talking about setting some field with store="true" and
> stuffing the HTML into that, right? It will probably have indexed="false"
> and docValues="false".
> 
> So.. what that means is that very early in the indexing process, the
> raw data is dumped to the *.fdt and *.fdx extensions for the segment. These
> are totally irrelevant for querying, they aren't even read from disk to score
> the docs. So let's say your numFound = 10,000 and rows=10. Those 10,000
> docs are scored without having to look at the stored data at all. Now, when
> the 10 docs are assembled for return, the stored data is read off disk
> decompressed and returned.
> 
> So the additional cost will be
> 1> your index is larger on disk
> 2> merging etc. will be a bit more costly. This doesn't
>      seem like a problem if your index doesn't change all
>      that often.
> 3> there will be some additional load to decompress the data
>      and return it.
> 
> This is a perfectly reasonable approach, my guess is that any difference
> in search speed will be lost in the noise of measuring and that the
> additional load of decompressing will be more than offset by not having
> to make a separate service call to actually get the doc, but as always
> measuring the performance is the proof you need.
> 
> You haven't indicated how _many_ docs you have in your corpus, but a
> rough indication of the additional disk space is about half the raw HTML size,
> we've usually seen about a 2:1 compression ratio. With a zillion docs
> that could be sizeable, but disk space is cheap.
> 
> 
> Best,
> Erick
> 
> On Mon, Nov 21, 2016 at 8:08 AM, Aristedes Maniatis
> <am...@apache.org> wrote:
>> I'm familiar enough with 7-8 years of Solr usage in how it performs as a full text search index, including spatial coordinates and much more. But for the most part, we've been returning database ids from Solr rather than a full record ready to display. We then grab the data and related records from the database in the usual way and display it.
>>
>> We are thinking now about improving performance of our app. One option is Reddis to store html pieces for reuse, rather than assembling the html from dozens of queries to the database. We've done what we can with caching in the ORM level, and we can't do too much with varnish because of differences in page rendering per user (eg shopping baskets).
>>
>> But we are thinking about storing the rendered html directly in Solr. The downsides appear to be:
>>
>> * adding 2-10kB of html to each record and the performance hit this might have on searching and retrieving
>> * additional load of ensuring we rebuild Solr's data every time some part of that html changes (but this is minimal in our use case)
>> * additional cores that we'll want to add to cache other data that isn't yet in Solr
>>
>> Is this a reasonable approach to avoid running yet another cluster of services? Are there downsides to this I haven't thought of? How does Solr scale with record size?
>>
>>
>>
>> Cheers
>> Ari
>>
>>
>>
>>
>> --
>> -------------------------->
>> Aristedes Maniatis
>> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A


-- 
-------------------------->
Aristedes Maniatis
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: Solr as am html cache

Posted by Erick Erickson <er...@gmail.com>.
Searching isn't really going to be impacted much, if at all. You're
essentially talking about setting some field with store="true" and
stuffing the HTML into that, right? It will probably have indexed="false"
and docValues="false".

So.. what that means is that very early in the indexing process, the
raw data is dumped to the *.fdt and *.fdx extensions for the segment. These
are totally irrelevant for querying, they aren't even read from disk to score
the docs. So let's say your numFound = 10,000 and rows=10. Those 10,000
docs are scored without having to look at the stored data at all. Now, when
the 10 docs are assembled for return, the stored data is read off disk
decompressed and returned.

So the additional cost will be
1> your index is larger on disk
2> merging etc. will be a bit more costly. This doesn't
     seem like a problem if your index doesn't change all
     that often.
3> there will be some additional load to decompress the data
     and return it.

This is a perfectly reasonable approach, my guess is that any difference
in search speed will be lost in the noise of measuring and that the
additional load of decompressing will be more than offset by not having
to make a separate service call to actually get the doc, but as always
measuring the performance is the proof you need.

You haven't indicated how _many_ docs you have in your corpus, but a
rough indication of the additional disk space is about half the raw HTML size,
we've usually seen about a 2:1 compression ratio. With a zillion docs
that could be sizeable, but disk space is cheap.


Best,
Erick

On Mon, Nov 21, 2016 at 8:08 AM, Aristedes Maniatis
<am...@apache.org> wrote:
> I'm familiar enough with 7-8 years of Solr usage in how it performs as a full text search index, including spatial coordinates and much more. But for the most part, we've been returning database ids from Solr rather than a full record ready to display. We then grab the data and related records from the database in the usual way and display it.
>
> We are thinking now about improving performance of our app. One option is Reddis to store html pieces for reuse, rather than assembling the html from dozens of queries to the database. We've done what we can with caching in the ORM level, and we can't do too much with varnish because of differences in page rendering per user (eg shopping baskets).
>
> But we are thinking about storing the rendered html directly in Solr. The downsides appear to be:
>
> * adding 2-10kB of html to each record and the performance hit this might have on searching and retrieving
> * additional load of ensuring we rebuild Solr's data every time some part of that html changes (but this is minimal in our use case)
> * additional cores that we'll want to add to cache other data that isn't yet in Solr
>
> Is this a reasonable approach to avoid running yet another cluster of services? Are there downsides to this I haven't thought of? How does Solr scale with record size?
>
>
>
> Cheers
> Ari
>
>
>
>
> --
> -------------------------->
> Aristedes Maniatis
> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A