You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Wei Wang <we...@gmail.com> on 2013/04/09 17:22:57 UTC

DocValues space usage

DocValues makes fast per doc value lookup possible, which is nice. But it
brings other interesting issues.

Assume there are 100M docs and 200 NumericDocValuesFields, this ends up
with huge number of disk and memory usage, even if there are just thousands
of values for each field. I guess this is because Lucene stores a value for
each DocValues field of each document, with variable-length codec.

So in such scenario, is it possible only store values for the DocValues
field of the docment that actually has a value for that field? Or does
Lucene has a column storage mechanism sort of like hash map for DocValues:

key: the docId that has a value for the DocValues field
value: the value of the DocValues field

I am using Lucene 4.2.1.

Thanks

Re: DocValues space usage

Posted by Wei Wang <we...@gmail.com>.

Adrien and Rober, thanks a lot for the hints. Will try a few options and
see how it goes.

On Tue, Apr 9, 2013 at 9:25 AM, Robert Muir <rc...@gmail.com> wrote:

> On Tue, Apr 9, 2013 at 9:11 AM, Adrien Grand <jp...@gmail.com> wrote:
>
> > The default codec stores numeric doc values by blocks of 4096 values
> > that have independent numbers of bits per values. If you end up having
> > most of these blocks empty, doc values will require little space but
> > in a worst-case scenario where each block contains 1 single value, it
> > is true that memory and disk usage will be very inefficient.
> >
>
> Also the default codec has a performance hack (depending on
> acceptableOverHead) for optimizing the single byte case (e.g. norms or
> other smallfloat scoring factor). In this case it doesn't even use
> blockpackedwriter at all.
>
> Thats why I recommended diskdv codec instead... the concepts are the same
> but its not yet "optimized" so its easier to understand whats going on :)
>

Re: DocValues space usage

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Apr 9, 2013 at 9:11 AM, Adrien Grand <jp...@gmail.com> wrote:

> The default codec stores numeric doc values by blocks of 4096 values
> that have independent numbers of bits per values. If you end up having
> most of these blocks empty, doc values will require little space but
> in a worst-case scenario where each block contains 1 single value, it
> is true that memory and disk usage will be very inefficient.
>

Also the default codec has a performance hack (depending on
acceptableOverHead) for optimizing the single byte case (e.g. norms or
other smallfloat scoring factor). In this case it doesn't even use
blockpackedwriter at all.

Thats why I recommended diskdv codec instead... the concepts are the same
but its not yet "optimized" so its easier to understand whats going on :)

Re: DocValues space usage

Posted by Adrien Grand <jp...@gmail.com>.

Hi,

On Tue, Apr 9, 2013 at 5:22 PM, Wei Wang <we...@gmail.com> wrote:
> DocValues makes fast per doc value lookup possible, which is nice. But it
> brings other interesting issues.
>
> Assume there are 100M docs and 200 NumericDocValuesFields, this ends up
> with huge number of disk and memory usage, even if there are just thousands
> of values for each field. I guess this is because Lucene stores a value for
> each DocValues field of each document, with variable-length codec.

The default codec stores numeric doc values by blocks of 4096 values
that have independent numbers of bits per values. If you end up having
most of these blocks empty, doc values will require little space but
in a worst-case scenario where each block contains 1 single value, it
is true that memory and disk usage will be very inefficient.

> So in such scenario, is it possible only store values for the DocValues
> field of the docment that actually has a value for that field? Or does
> Lucene has a column storage mechanism sort of like hash map for DocValues:
>
> key: the docId that has a value for the DocValues field
> value: the value of the DocValues field

Lucene doesn't have a HashMap-like storage for doc values, although it
would be doable to build a DocValuesFormat that would work this way.

However, for your problem, I would recommend that you encode your
numeric data on top on BinaryDocValues. On the contrary to
NumericDocValues, BinaryDocValues require very little space for
missing values. All you need is to have conversion methods between
your numeric data and byte arrays.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: DocValues space usage

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Apr 9, 2013 at 9:06 AM, Wei Wang <we...@gmail.com> wrote:

> Thanks for the hint. Could you point to some Codec that might do this for
> some types, even just as an side effect as you mentioned? It will be
> helpful to have something to start with.
>

Have a look at diskdv/ codec in the codecs/ module. Its a lot simpler than
the default codec because it doesnt have the "tradeoff speed for space"
performance hacks of the default codec. It might already do something thats
good enough for your needs.

>
> And could you elaborate a bit more for "the facet on tons of sparse
> fields"? I just got a vague idea from the comments.
>

Look at lucene/facet module. As opposed to applications like solr and
elasticsearch which would build fieldcaches/docvalues/whatever on hundreds
of "fields", I think this one uses just a single binary docvalue field to
implement ordinal storage across all "fields" (i think it calls them
dimensions or something else).

Of course you can simulate this yourself with other approaches too.

Re: DocValues space usage

Posted by Wei Wang <we...@gmail.com>.

Thanks for the hint. Could you point to some Codec that might do this for
some types, even just as an side effect as you mentioned? It will be
helpful to have something to start with.

And could you elaborate a bit more for "the facet on tons of sparse
fields"? I just got a vague idea from the comments.

On Tue, Apr 9, 2013 at 8:51 AM, Robert Muir <rc...@gmail.com> wrote:

> On Tue, Apr 9, 2013 at 8:22 AM, Wei Wang <we...@gmail.com> wrote:
>
> > DocValues makes fast per doc value lookup possible, which is nice. But it
> > brings other interesting issues.
> >
> > Assume there are 100M docs and 200 NumericDocValuesFields, this ends up
> > with huge number of disk and memory usage, even if there are just
> thousands
> > of values for each field. I guess this is because Lucene stores a value
> for
> > each DocValues field of each document, with variable-length codec.
> >
> > So in such scenario, is it possible only store values for the DocValues
> > field of the docment that actually has a value for that field? Or does
> > Lucene has a column storage mechanism sort of like hash map for
> DocValues:
> >
>
> This really depends on the details of the Codec's encoding. So if its
> important to you the easiest way is to write a codec that compresses things
> the way you want (e.g. uses two-stage table/block RLE/something like that).
> Maybe this would be a good contribution to add to the codecs/ module of
> lucene so other people could use it too.
>
> I think some of the codecs might do this for some types: maybe just as an
> accidental side effect of their current compression/encoding (e.g. use of
> BlockPackedWriter). But its not something we really optimize for: as it
> doesn't make sense for a lot of use cases like scoring factors or faceting
> for docvalues. For example if you want to facet on tons of sparse fields,
> its probably better to use lucene's faceting module with uses one combined
> docvalues field for the document... I think.
>

Re: DocValues space usage

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Apr 9, 2013 at 8:22 AM, Wei Wang <we...@gmail.com> wrote:

> DocValues makes fast per doc value lookup possible, which is nice. But it
> brings other interesting issues.
>
> Assume there are 100M docs and 200 NumericDocValuesFields, this ends up
> with huge number of disk and memory usage, even if there are just thousands
> of values for each field. I guess this is because Lucene stores a value for
> each DocValues field of each document, with variable-length codec.
>
> So in such scenario, is it possible only store values for the DocValues
> field of the docment that actually has a value for that field? Or does
> Lucene has a column storage mechanism sort of like hash map for DocValues:
>

This really depends on the details of the Codec's encoding. So if its
important to you the easiest way is to write a codec that compresses things
the way you want (e.g. uses two-stage table/block RLE/something like that).
Maybe this would be a good contribution to add to the codecs/ module of
lucene so other people could use it too.

I think some of the codecs might do this for some types: maybe just as an
accidental side effect of their current compression/encoding (e.g. use of
BlockPackedWriter). But its not something we really optimize for: as it
doesn't make sense for a lot of use cases like scoring factors or faceting
for docvalues. For example if you want to facet on tons of sparse fields,
its probably better to use lucene's faceting module with uses one combined
docvalues field for the document... I think.