You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gael Jourdan-Weil <ga...@kelkoogroup.com> on 2020/07/15 17:35:09 UTC
Disk usage with useDocValuesAsStored
Hello,
I was wondering if we can expect significant disk usage reduction (index size) if we move from fields defined as "docValues=true + stored=true" to "docValues=true + stored=false" (with useDocValuesAsStored=true as default in both cases)?
Considering the use case we are targeting is only Streaming Expression with /export handler, I also understand that we might also set useDocValuesAsStored=false from what is described at https://lucene.apache.org/solr/guide/8_4/docvalues.html.
If so, would setting useDocValuesAsStored=false help reduce the index size as well?
We will obviously try it and see by ourselves the results but I was wondering if you already have an idea about it.
Also if you have any good link to how data are physically stored depending on the fields options (indexed/stored/docValues), this could really be interesting.
Thanks,
Gaël
RE: Disk usage with useDocValuesAsStored
Posted by Gael Jourdan-Weil <ga...@kelkoogroup.com>.
Ok, makes sense.
Thanks for your answer Erick.
Gaël
________________________________
De : Erick Erickson <er...@gmail.com>
Envoyé : mercredi 15 juillet 2020 22:53
À : solr-user@lucene.apache.org <so...@lucene.apache.org>
Objet : Re: Disk usage with useDocValuesAsStored
You’re off track a bit. useDocValuesAsStored has no effect on the size on disk. It’s purely a runtime option that pulls the data to return from either the stored or docValues parts of the index. If you change the definition and reindex, you should see significant differences in the size of your index, particularly the “*.fdt/*.fdx” and “*.dvd*I.dvm” files, where stored and docValues are kept respectively.
However, it’s also apples and oranges. Specifically, using docValues as stored will _not_ necessarily return the fields the same way they were sent in the multiValued case. The docValues data is kept as a SORTED_SET, which means it’s both lexically sorted and deduplicated. So input like “a” “z” “h” “a” will return “a” “h” “z”.
Best,
Erick
> On Jul 15, 2020, at 1:35 PM, Gael Jourdan-Weil <ga...@kelkoogroup.com> wrote:
>
> Hello,
>
> I was wondering if we can expect significant disk usage reduction (index size) if we move from fields defined as "docValues=true + stored=true" to "docValues=true + stored=false" (with useDocValuesAsStored=true as default in both cases)?
>
> Considering the use case we are targeting is only Streaming Expression with /export handler, I also understand that we might also set useDocValuesAsStored=false from what is described at https://lucene.apache.org/solr/guide/8_4/docvalues.html.
> If so, would setting useDocValuesAsStored=false help reduce the index size as well?
>
> We will obviously try it and see by ourselves the results but I was wondering if you already have an idea about it.
> Also if you have any good link to how data are physically stored depending on the fields options (indexed/stored/docValues), this could really be interesting.
>
> Thanks,
> Gaël
Re: Disk usage with useDocValuesAsStored
Posted by Erick Erickson <er...@gmail.com>.
You’re off track a bit. useDocValuesAsStored has no effect on the size on disk. It’s purely a runtime option that pulls the data to return from either the stored or docValues parts of the index. If you change the definition and reindex, you should see significant differences in the size of your index, particularly the “*.fdt/*.fdx” and “*.dvd*I.dvm” files, where stored and docValues are kept respectively.
However, it’s also apples and oranges. Specifically, using docValues as stored will _not_ necessarily return the fields the same way they were sent in the multiValued case. The docValues data is kept as a SORTED_SET, which means it’s both lexically sorted and deduplicated. So input like “a” “z” “h” “a” will return “a” “h” “z”.
Best,
Erick
> On Jul 15, 2020, at 1:35 PM, Gael Jourdan-Weil <ga...@kelkoogroup.com> wrote:
>
> Hello,
>
> I was wondering if we can expect significant disk usage reduction (index size) if we move from fields defined as "docValues=true + stored=true" to "docValues=true + stored=false" (with useDocValuesAsStored=true as default in both cases)?
>
> Considering the use case we are targeting is only Streaming Expression with /export handler, I also understand that we might also set useDocValuesAsStored=false from what is described at https://lucene.apache.org/solr/guide/8_4/docvalues.html.
> If so, would setting useDocValuesAsStored=false help reduce the index size as well?
>
> We will obviously try it and see by ourselves the results but I was wondering if you already have an idea about it.
> Also if you have any good link to how data are physically stored depending on the fields options (indexed/stored/docValues), this could really be interesting.
>
> Thanks,
> Gaël