You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Toke Eskildsen <to...@kb.dk> on 2018/09/24 12:40:16 UTC

DocValues, retrieval performance and policy

The Solr 7 switch to iterative API for Doc Values
https://issues.apache.org/jira/browse/LUCENE-7407
meant a severe performance regression for Solr export and document
retrieval with our web archive index, which is distinguished by having
quite large segments (300M docs / 900GB) and using primarily doc values
 to hold field content.

Technically there is a working patch
https://issues.apache.org/jira/browse/LUCENE-8374
but during discussion of performance measurements elsewhere
https://github.com/mikemccand/luceneutil/issues/23
it came up that doc values are not intended for document retrieval and
as such that Lucene should not be optimized towards that.


From my point of view, using doc values to build retrieval documents is
quite natural: The data are there, so making a double representation by
also making them stored seems a waste of space.

If this is somehow a misuse of Doc Values, maybe I could be explained
what the problem is or directed towards more information?

- Toke Eskildsen, Royal Danish Library


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValues, retrieval performance and policy

Posted by Toke Eskildsen <to...@kb.dk>.

Erick Erickson <er...@gmail.com> wrote:
> I think part of it is locality. By that I mean two docValues fields in
> the same document have no relation to each other in terms of their
> location on disk. So _assuming_ all your DocValues can't be contained
> in memory, you may be doing a bunch of disk seeks.

Fair enough: Doc Values overhead scales linear with the number of fields, whereas stored is more constant-ish. As you note with export, Doc Values can be faster than stored with a few fields but using them for hundreds would probably be quite a lot slower.

> And maybe part of it is the notion of stuffing large text fields into
> a DocValues field just to return it seems like abusing DV.

That seems like a reasonable explanation to me. If that is what the talk of misuse is about, I can understand it. It is not a case I have any current interest in optimizing and I agree that "real" compression (as opposed to the light prefix-reuse from Doc Values) is the best choice.

> That said, the Streaming code uses DV fields exclusively and I got
> 200K rows/second returned without tuning a single thing which I doubt
> you're going to get with stored fields!

> So I think as usual, "it depends".

I would like to think so, as that implies that it does make sense to consider if changes to Doc Values codec representation causes a performance regression, when using them to populate documents.

- Toke Eskildsen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValues, retrieval performance and policy

Posted by Erick Erickson <er...@gmail.com>.

Toke:

I think part of it is locality. By that I mean two docValues fields in
the same document have no relation to each other in terms of their
location on disk. So _assuming_ all your DocValues can't be contained
in memory, you may be doing a bunch of disk seeks.

This as opposed to just storing the fields which implies one disk
seek/decompression for all fields for a given doc (assuming the 16K
block read/decompressed holds all the fields).

And maybe part of it is the notion of stuffing large text fields into
a DocValues field just to return it seems like abusing DV.

That said, the Streaming code uses DV fields exclusively and I got
200K rows/second returned without tuning a single thing which I doubt
you're going to get with stored fields!

So I think as usual, "it depends".
On Mon, Sep 24, 2018 at 10:25 AM Toke Eskildsen <to...@kb.dk> wrote:
>
> David Smiley <da...@gmail.com> wrote:
> > I don't think it makes a difference if some people think docValues should
> > never be used for value-retrieval.  When that performance drop occurred
> > due to those changes, I'm sure it would have affected sorting & faceting
> > as well as value-retrieval. Some more than others perhaps.
>
> Yes. The iterative API is fine for relatively small jumps, so it works perfectly for sorting on medium- to large result sets. Depending on the type of faceting it's the same. Grouping and faceting on small result sets is (probably) relatively affected, but as the amount of needed data is small in those cases, the (assumed) impact is not that high.
>
> Retrieving documents is different as there are typically more fields involved and the amount of documents itself is nearly always small, which means large jumps repeated for all the fields.
>
> > I don't see any disagreement about improving docValues in the ways
> > you suggest.
>
> You are right about that. I apologize if I was being unclear: It is not the concrete patch I am asking about, that's just how this started. I am asking for background on why it is considered misuse to use Doc Values for document retrieval.
>
> - Toke Eskildsen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValues, retrieval performance and policy

Posted by Toke Eskildsen <to...@kb.dk>.

David Smiley <da...@gmail.com> wrote:
> I don't think it makes a difference if some people think docValues should
> never be used for value-retrieval.  When that performance drop occurred
> due to those changes, I'm sure it would have affected sorting & faceting
> as well as value-retrieval. Some more than others perhaps.

Yes. The iterative API is fine for relatively small jumps, so it works perfectly for sorting on medium- to large result sets. Depending on the type of faceting it's the same. Grouping and faceting on small result sets is (probably) relatively affected, but as the amount of needed data is small in those cases, the (assumed) impact is not that high.

Retrieving documents is different as there are typically more fields involved and the amount of documents itself is nearly always small, which means large jumps repeated for all the fields.

> I don't see any disagreement about improving docValues in the ways
> you suggest.

You are right about that. I apologize if I was being unclear: It is not the concrete patch I am asking about, that's just how this started. I am asking for background on why it is considered misuse to use Doc Values for document retrieval.

- Toke Eskildsen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocValues, retrieval performance and policy

Posted by David Smiley <da...@gmail.com>.

I don't think it makes a difference if some people think docValues should
never be used for value-retrieval.  When that performance drop occurred due
to those changes, I'm sure it would have affected sorting & faceting as
well as value-retrieval.  Some more than others perhaps.  I don't see any
disagreement about improving docValues in the ways you suggest.  If
theoretically you proposed a change that helped the value-retrieval
use-case and hurt the non-controversial use-cases then I could see why you
want to raise this issue more publicly like you're doing here.

~ David

On Mon, Sep 24, 2018 at 8:40 AM Toke Eskildsen <to...@kb.dk> wrote:

> The Solr 7 switch to iterative API for Doc Values
> https://issues.apache.org/jira/browse/LUCENE-7407
> meant a severe performance regression for Solr export and document
> retrieval with our web archive index, which is distinguished by having
> quite large segments (300M docs / 900GB) and using primarily doc values
>  to hold field content.
>
> Technically there is a working patch
> https://issues.apache.org/jira/browse/LUCENE-8374
> but during discussion of performance measurements elsewhere
> https://github.com/mikemccand/luceneutil/issues/23
> it came up that doc values are not intended for document retrieval and
> as such that Lucene should not be optimized towards that.
>
>
> From my point of view, using doc values to build retrieval documents is
> quite natural: The data are there, so making a double representation by
> also making them stored seems a waste of space.
>
> If this is somehow a misuse of Doc Values, maybe I could be explained
> what the problem is or directed towards more information?
>
> - Toke Eskildsen, Royal Danish Library
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com