You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gael Jourdan-Weil <ga...@kelkoogroup.com> on 2020/07/15 17:35:09 UTC

Disk usage with useDocValuesAsStored

Hello,

I was wondering if we can expect significant disk usage reduction (index size) if we move from fields defined as "docValues=true + stored=true" to "docValues=true + stored=false" (with useDocValuesAsStored=true as default in both cases)?

Considering the use case we are targeting is only Streaming Expression with /export handler, I also understand that we might also set useDocValuesAsStored=false from what is described at https://lucene.apache.org/solr/guide/8_4/docvalues.html.
If so, would setting useDocValuesAsStored=false help reduce the index size as well?

We will obviously try it and see by ourselves the results but I was wondering if you already have an idea about it.
Also if you have any good link to how data are physically stored depending on the fields options (indexed/stored/docValues), this could really be interesting.

Thanks,
Gaël

RE: Disk usage with useDocValuesAsStored

Posted by Gael Jourdan-Weil <ga...@kelkoogroup.com>.
Ok, makes sense.
Thanks for your answer Erick.

Gaël

________________________________
De : Erick Erickson <er...@gmail.com>
Envoyé : mercredi 15 juillet 2020 22:53
À : solr-user@lucene.apache.org <so...@lucene.apache.org>
Objet : Re: Disk usage with useDocValuesAsStored

You’re off track a bit. useDocValuesAsStored has no effect on the size on disk. It’s purely a runtime option that pulls the data to return from either the stored or docValues parts of the index. If you change the definition and reindex, you should see significant differences in the size of your index, particularly the “*.fdt/*.fdx” and “*.dvd*I.dvm” files, where stored and docValues are kept respectively.

However, it’s also apples and oranges. Specifically, using docValues as stored will _not_ necessarily return the fields the same way they were sent in the multiValued case. The docValues data is kept as a SORTED_SET, which means it’s both lexically sorted and deduplicated. So input like “a” “z” “h” “a” will return “a” “h” “z”.

Best,
Erick

> On Jul 15, 2020, at 1:35 PM, Gael Jourdan-Weil <ga...@kelkoogroup.com> wrote:
>
> Hello,
>
> I was wondering if we can expect significant disk usage reduction (index size) if we move from fields defined as "docValues=true + stored=true" to "docValues=true + stored=false" (with useDocValuesAsStored=true as default in both cases)?
>
> Considering the use case we are targeting is only Streaming Expression with /export handler, I also understand that we might also set useDocValuesAsStored=false from what is described at https://lucene.apache.org/solr/guide/8_4/docvalues.html.
> If so, would setting useDocValuesAsStored=false help reduce the index size as well?
>
> We will obviously try it and see by ourselves the results but I was wondering if you already have an idea about it.
> Also if you have any good link to how data are physically stored depending on the fields options (indexed/stored/docValues), this could really be interesting.
>
> Thanks,
> Gaël


Re: Disk usage with useDocValuesAsStored

Posted by Erick Erickson <er...@gmail.com>.
You’re off track a bit. useDocValuesAsStored has no effect on the size on disk. It’s purely a runtime option that pulls the data to return from either the stored or docValues parts of the index. If you change the definition and reindex, you should see significant differences in the size of your index, particularly the “*.fdt/*.fdx” and “*.dvd*I.dvm” files, where stored and docValues are kept respectively. 

However, it’s also apples and oranges. Specifically, using docValues as stored will _not_ necessarily return the fields the same way they were sent in the multiValued case. The docValues data is kept as a SORTED_SET, which means it’s both lexically sorted and deduplicated. So input like “a” “z” “h” “a” will return “a” “h” “z”.

Best,
Erick

> On Jul 15, 2020, at 1:35 PM, Gael Jourdan-Weil <ga...@kelkoogroup.com> wrote:
> 
> Hello,
> 
> I was wondering if we can expect significant disk usage reduction (index size) if we move from fields defined as "docValues=true + stored=true" to "docValues=true + stored=false" (with useDocValuesAsStored=true as default in both cases)?
> 
> Considering the use case we are targeting is only Streaming Expression with /export handler, I also understand that we might also set useDocValuesAsStored=false from what is described at https://lucene.apache.org/solr/guide/8_4/docvalues.html.
> If so, would setting useDocValuesAsStored=false help reduce the index size as well?
> 
> We will obviously try it and see by ourselves the results but I was wondering if you already have an idea about it.
> Also if you have any good link to how data are physically stored depending on the fields options (indexed/stored/docValues), this could really be interesting.
> 
> Thanks,
> Gaël