You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sebastian Klemke <se...@researchgate.net> on 2017/08/10 13:57:22 UTC

Solr LTR with high rerankDocs

Hi,

we're currently experimenting with LTR reranking on large rerank
windows (rerankDocs=1000+). On a >500M documents SolrCloud collection,
we were only able to get sub-second response times with
FieldValueFeature. Therefore we created a custom feature extractor that
matches field values with constant strings to substitute simple
SolrFeature usages. Apparently, the response time is now dominated by
loading stored fields, more specifically by uncompressing chunks of
stored field data.

We're now wondering how many documents LTR can rerank in practice and
what the bottlenecks are. Do you guys have any experience using it?


Regards,

Sebastian


-- 
Sebastian Klemke
Senior Software Engineer
  
ResearchGate GmbH
Invalidenstr. 115, 10115 Berlin, Germany
  
www.researchgate.net
  
Registered Seat: Hannover, HR B 202837
Managing Directors: Dr Ijad Madisch, Dr Sören Hofmayer VAT-ID: DE258434568
A proud affiliate of: ResearchGate Corporation, 350 Townsend St #754, San Francisco, CA 94107


Re: Solr LTR with high rerankDocs

Posted by Sebastian Klemke <se...@researchgate.net>.
Hi

On Do, 2017-08-10 at 08:30 -0700, Erick Erickson wrote:
> I have to confess that I know very little about the mechanics of LTR, but
> I can talk a little bit about compression.
> 
> When a stored values is retrieved for a document it is read from the
> *.fdt file which is a compressed, verbatim copy of the field. DocValues
> can bypass this stored data and read directly from the DV format.
> There's a discussion of useDocValuesAsStored in solr/CHANGES.txt.
> 
> The restriction of docValues is that they can only be used for
> primitive types, numerics, strings and the like, specifically _not_
> fields with class="solr.TextField".
> 
> WARNING: I have no real clue whether LTR is built to leverage
> docValues fields. If you add docValues="true" to the relevant
> fields you'll have to re-index completely. In fact I'd use a new
> collection.
> 
> And don't be put off by the fact that the index size on disk will grow
> on disk if you add docValues, the memory is MMapped to OS
> disk space and will actually _reduce_ your JVM requirements.

Yes, DocValues are definitely on our list of things to test.


Regards,

Sebastian


-- 
Sebastian Klemke
Senior Software Engineer
  
ResearchGate GmbH
Invalidenstr. 115, 10115 Berlin, Germany
  
www.researchgate.net
  
Registered Seat: Hannover, HR B 202837
Managing Directors: Dr Ijad Madisch, Dr Sören Hofmayer VAT-ID: DE258434568
A proud affiliate of: ResearchGate Corporation, 350 Townsend St #754, San Francisco, CA 94107


Re: Solr LTR with high rerankDocs

Posted by Erick Erickson <er...@gmail.com>.
I have to confess that I know very little about the mechanics of LTR, but
I can talk a little bit about compression.

When a stored values is retrieved for a document it is read from the
*.fdt file which is a compressed, verbatim copy of the field. DocValues
can bypass this stored data and read directly from the DV format.
There's a discussion of useDocValuesAsStored in solr/CHANGES.txt.

The restriction of docValues is that they can only be used for
primitive types, numerics, strings and the like, specifically _not_
fields with class="solr.TextField".

WARNING: I have no real clue whether LTR is built to leverage
docValues fields. If you add docValues="true" to the relevant
fields you'll have to re-index completely. In fact I'd use a new
collection.

And don't be put off by the fact that the index size on disk will grow
on disk if you add docValues, the memory is MMapped to OS
disk space and will actually _reduce_ your JVM requirements.

Best,
Erick



On Thu, Aug 10, 2017 at 6:57 AM, Sebastian Klemke
<se...@researchgate.net> wrote:
> Hi,
>
> we're currently experimenting with LTR reranking on large rerank
> windows (rerankDocs=1000+). On a >500M documents SolrCloud collection,
> we were only able to get sub-second response times with
> FieldValueFeature. Therefore we created a custom feature extractor that
> matches field values with constant strings to substitute simple
> SolrFeature usages. Apparently, the response time is now dominated by
> loading stored fields, more specifically by uncompressing chunks of
> stored field data.
>
> We're now wondering how many documents LTR can rerank in practice and
> what the bottlenecks are. Do you guys have any experience using it?
>
>
> Regards,
>
> Sebastian
>
>
> --
> Sebastian Klemke
> Senior Software Engineer
>
> ResearchGate GmbH
> Invalidenstr. 115, 10115 Berlin, Germany
>
> www.researchgate.net
>
> Registered Seat: Hannover, HR B 202837
> Managing Directors: Dr Ijad Madisch, Dr Sören Hofmayer VAT-ID: DE258434568
> A proud affiliate of: ResearchGate Corporation, 350 Townsend St #754, San Francisco, CA 94107
>