You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Davis <jo...@gmail.com> on 2019/06/09 19:50:41 UTC

Enabling/disabling docValues

Hi there,
We recently changed a field from TextField + no docValues to
SortableTextField which has docValues enabled by default. Once I did this I
do not see any facet values for the field. I know that once all the docs
are re-indexed facets should work again, however can someone clarify the
current logic of lucene/solr how facets will be computed when schema is
changed from no docValues to docValues and vice-versa?

1. Until ALL the docs are re-indexed, no facets will be returned?
2. Once certain fraction of docs are re-indexed, those facets will be
returned?
3. Something else?


Varun

Re: Enabling/disabling docValues

Posted by John Davis <jo...@gmail.com>.
There is no way to match case insensitive without TextFields + no
tokenization. Its a long standing limitation of not being able to apply any
analyzers with str fields.

Thanks for pointing out the re-index page I've seen it. However sometimes
it is hard to re-index in a reasonable amount of time & resources, and if
we empower power users to understand the system better it will help making
more informed tradeoffs.

On Tue, Jun 11, 2019 at 6:52 AM Gus Heck <gu...@gmail.com> wrote:

> On Mon, Jun 10, 2019 at 10:53 PM John Davis <jo...@gmail.com>
> wrote:
>
> > You have made many assumptions which might not always be realistic a)
> > TextField is always tokenized
>
>
> Well, you could of course change configuration or code to do something else
> but this would be a very odd and misleading thing to do and we would expect
> you to have mentioned it.
>
>
> > b) Users care about precise counts and
>
>
> This is indeed use case dependent if you are talking about approximately
> correct (150 vs 152 etc), but it's pretty reasonable to say that gross
> errors (75 vs 153 or 0 vs 5 etc) more or less make faceting pointless.
>
>
> > c) Users have the luxury or ability to do a full re-index anytime.
>
>
> This is a state of affairs we consistently advise against. The reason we
> give the advice is precisely because one cannot change the schema out from
> under an existing index safely without rewriting the index. Without
> extremely careful design on your side (not using certain features and high
> storage requirements), your index will not retain enough information to
> re-remake itself. Therefore, it is a long standing bad practice to not have
> a separate canonical copy of the data and a means to re-index it (or a
> design where only the very most recent data is important, and a copy of
> that). There is a whole page dedicated to reindexing in the ref guide:
> https://lucene.apache.org/solr/guide/8_0/reindexing.html Here's a relevant
> bit from the current version:
>
> `There is no process in Solr for programmatically reindexing data. When we
> say "reindex", we mean, literally, "index it again". However you got the
> data into the index the first time, you will run that process again. It is
> strongly recommended that Solr users index their data in a repeatable,
> consistent way, so that the process can be easily repeated when the need
> for reindexing arises.`
>
>
> The ref guide has lots of nice info, maybe you should read it rather than
> snubbing one of the nicest and most knowledgeable committers on the project
> (who is helping you for free) by haughtily saying you'll go ask someone
> else... And if you've been left with this situation (no ability to reindex)
> by your predecessor you have our deepest sympathies, but it still doesn't
> change the fact that you need break it to management the your predecessor
> has lost the data required to maintain the system and you still need
> re-index whatever you can salvage somehow, or start fresh.
>
> When Erick is saying you shouldn't be asking that question... >90% of the
> time you really shouldn't be, and if you do pursue it, you'll just waste a
> lot of your own time.
>
>
> > On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> > > bq. Does lucene look at %docs in each state, or the first doc or
> > something
> > > else?
> > >
> > > Frankly I don’t care since no matter what, the results of faceting
> mixed
> > > definitions is not useful.
> > >
> > > tl;dr;
> > >
> > > “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> > > means just what I choose it to mean — neither more nor less.’
> > >
> > > So “undefined" in this case means “I don’t see any value at all in
> > chasing
> > > that info down” ;).
> > >
> > > Changing from regular text to SortableText means that the results will
> be
> > > inaccurate no matter what. For example, I have a doc with the value “my
> > dog
> > > has fleas”. When NOT using SortableText, there are multiple tokens so
> > facet
> > > counts would be:
> > >
> > > my (1)
> > > dog (1)
> > > has (1)
> > > fleas (1)
> > >
> > > But for SortableText will be:
> > >
> > > my dog has fleas (1)
> > >
> > > Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> > > doc1 was  indexed before switching to SortableText and doc2 after.
> > > Presumably  the output you want is:
> > >
> > > my dog has fleas (1)
> > > my cat has fleas (1)
> > >
> > > But you can’t get that output.  There are three cases:
> > >
> > > 1> Lucene treats all documents as SortableText, faceting on the
> docValues
> > > parts. No facets on doc1
> > >
> > > my  cat has fleas (1)
> > >
> > > 2> Lucene treats all documents as tokenized, faceting on each
> individual
> > > token. Faceting is performed on the tokenized content of both,
> docValues
> > > in doc2  ignored
> > >
> > > my  (2)
> > > dog (1)
> > > has (2)
> > > fleas (2)
> > > cat (1)
> > >
> > >
> > > 3> Lucene does the best it can, faceting on the tokens for docs without
> > > SortableText and docValues if the doc was indexed with Sortable text.
> > doc1
> > > faceted on tokenized, doc2 on docValues
> > >
> > > my  (1)
> > > dog (1)
> > > has (1)
> > > fleas (1)
> > > my cat has fleas (1)
> > >
> > > Since none of those cases is what I want, there’s no point I can see in
> > > chasing down what actually happens….
> > >
> > > Best,
> > > Erick
> > >
> > > P.S. I _think_ Lucene tries to use the definition from the first
> segment,
> > > but since whether the lists of segments to be  merged don’t look at the
> > > field definitions at all. Whether the first segment in the list has
> > > SortableText or not will not be predictable in a general way even
> within
> > a
> > > single run.
> > >
> > >
> > > > On Jun 9, 2019, at 6:53 PM, John Davis <jo...@gmail.com>
> > > wrote:
> > > >
> > > > Understood, however code is rarely random/undefined. Does lucene look
> > at
> > > %
> > > > docs in each state, or the first doc or something else?
> > > >
> > > > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <
> erickerickson@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> It’s basically undefined. When segments are merged that have
> > dissimilar
> > > >> definitions like this what can Lucene do? Consider:
> > > >>
> > > >> Faceting on a text (not sortable) means that each individual token
> in
> > > the
> > > >> index is uninverted on the Java heap and the facets are computed for
> > > each
> > > >> individual term.
> > > >>
> > > >> Faceting on a SortableText field just has a single term per
> document,
> > > and
> > > >> that in the docValues structures as opposed to the inverted index.
> > > >>
> > > >> Now you change the value and start indexing. At some point a segment
> > > >> containing no docValues is merged with a segment containing
> docValues
> > > for
> > > >> the field. The resulting mixed segment is in this state. If you
> facet
> > on
> > > >> the field, should the docs without docValues have each individual
> term
> > > >> counted? Or just the SortableText values in the docValues structure?
> > > >> Neither one is right.
> > > >>
> > > >> Also remember that Lucene has no notion of schema. That’s entirely
> > > imposed
> > > >> on Lucene by Solr carefully constructing low-level analysis chains.
> > > >>
> > > >> So I’d _strongly_ recommend you re-index your corpus to a new
> > collection
> > > >> with the current definition, then perhaps use CREATEALIAS to
> > seamlessly
> > > >> switch.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>> On Jun 9, 2019, at 12:50 PM, John Davis <johndavis925254@gmail.com
> >
> > > >> wrote:
> > > >>>
> > > >>> Hi there,
> > > >>> We recently changed a field from TextField + no docValues to
> > > >>> SortableTextField which has docValues enabled by default. Once I
> did
> > > >> this I
> > > >>> do not see any facet values for the field. I know that once all the
> > > docs
> > > >>> are re-indexed facets should work again, however can someone
> clarify
> > > the
> > > >>> current logic of lucene/solr how facets will be computed when
> schema
> > is
> > > >>> changed from no docValues to docValues and vice-versa?
> > > >>>
> > > >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> > > >>> 2. Once certain fraction of docs are re-indexed, those facets will
> be
> > > >>> returned?
> > > >>> 3. Something else?
> > > >>>
> > > >>>
> > > >>> Varun
> > > >>
> > > >>
> > >
> > >
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: Enabling/disabling docValues

Posted by Gus Heck <gu...@gmail.com>.
On Mon, Jun 10, 2019 at 10:53 PM John Davis <jo...@gmail.com>
wrote:

> You have made many assumptions which might not always be realistic a)
> TextField is always tokenized


Well, you could of course change configuration or code to do something else
but this would be a very odd and misleading thing to do and we would expect
you to have mentioned it.


> b) Users care about precise counts and


This is indeed use case dependent if you are talking about approximately
correct (150 vs 152 etc), but it's pretty reasonable to say that gross
errors (75 vs 153 or 0 vs 5 etc) more or less make faceting pointless.


> c) Users have the luxury or ability to do a full re-index anytime.


This is a state of affairs we consistently advise against. The reason we
give the advice is precisely because one cannot change the schema out from
under an existing index safely without rewriting the index. Without
extremely careful design on your side (not using certain features and high
storage requirements), your index will not retain enough information to
re-remake itself. Therefore, it is a long standing bad practice to not have
a separate canonical copy of the data and a means to re-index it (or a
design where only the very most recent data is important, and a copy of
that). There is a whole page dedicated to reindexing in the ref guide:
https://lucene.apache.org/solr/guide/8_0/reindexing.html Here's a relevant
bit from the current version:

`There is no process in Solr for programmatically reindexing data. When we
say "reindex", we mean, literally, "index it again". However you got the
data into the index the first time, you will run that process again. It is
strongly recommended that Solr users index their data in a repeatable,
consistent way, so that the process can be easily repeated when the need
for reindexing arises.`


The ref guide has lots of nice info, maybe you should read it rather than
snubbing one of the nicest and most knowledgeable committers on the project
(who is helping you for free) by haughtily saying you'll go ask someone
else... And if you've been left with this situation (no ability to reindex)
by your predecessor you have our deepest sympathies, but it still doesn't
change the fact that you need break it to management the your predecessor
has lost the data required to maintain the system and you still need
re-index whatever you can salvage somehow, or start fresh.

When Erick is saying you shouldn't be asking that question... >90% of the
time you really shouldn't be, and if you do pursue it, you'll just waste a
lot of your own time.


> On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson <er...@gmail.com>
> wrote:
>
> > bq. Does lucene look at %docs in each state, or the first doc or
> something
> > else?
> >
> > Frankly I don’t care since no matter what, the results of faceting mixed
> > definitions is not useful.
> >
> > tl;dr;
> >
> > “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> > means just what I choose it to mean — neither more nor less.’
> >
> > So “undefined" in this case means “I don’t see any value at all in
> chasing
> > that info down” ;).
> >
> > Changing from regular text to SortableText means that the results will be
> > inaccurate no matter what. For example, I have a doc with the value “my
> dog
> > has fleas”. When NOT using SortableText, there are multiple tokens so
> facet
> > counts would be:
> >
> > my (1)
> > dog (1)
> > has (1)
> > fleas (1)
> >
> > But for SortableText will be:
> >
> > my dog has fleas (1)
> >
> > Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> > doc1 was  indexed before switching to SortableText and doc2 after.
> > Presumably  the output you want is:
> >
> > my dog has fleas (1)
> > my cat has fleas (1)
> >
> > But you can’t get that output.  There are three cases:
> >
> > 1> Lucene treats all documents as SortableText, faceting on the docValues
> > parts. No facets on doc1
> >
> > my  cat has fleas (1)
> >
> > 2> Lucene treats all documents as tokenized, faceting on each individual
> > token. Faceting is performed on the tokenized content of both,  docValues
> > in doc2  ignored
> >
> > my  (2)
> > dog (1)
> > has (2)
> > fleas (2)
> > cat (1)
> >
> >
> > 3> Lucene does the best it can, faceting on the tokens for docs without
> > SortableText and docValues if the doc was indexed with Sortable text.
> doc1
> > faceted on tokenized, doc2 on docValues
> >
> > my  (1)
> > dog (1)
> > has (1)
> > fleas (1)
> > my cat has fleas (1)
> >
> > Since none of those cases is what I want, there’s no point I can see in
> > chasing down what actually happens….
> >
> > Best,
> > Erick
> >
> > P.S. I _think_ Lucene tries to use the definition from the first segment,
> > but since whether the lists of segments to be  merged don’t look at the
> > field definitions at all. Whether the first segment in the list has
> > SortableText or not will not be predictable in a general way even within
> a
> > single run.
> >
> >
> > > On Jun 9, 2019, at 6:53 PM, John Davis <jo...@gmail.com>
> > wrote:
> > >
> > > Understood, however code is rarely random/undefined. Does lucene look
> at
> > %
> > > docs in each state, or the first doc or something else?
> > >
> > > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <erickerickson@gmail.com
> >
> > > wrote:
> > >
> > >> It’s basically undefined. When segments are merged that have
> dissimilar
> > >> definitions like this what can Lucene do? Consider:
> > >>
> > >> Faceting on a text (not sortable) means that each individual token in
> > the
> > >> index is uninverted on the Java heap and the facets are computed for
> > each
> > >> individual term.
> > >>
> > >> Faceting on a SortableText field just has a single term per document,
> > and
> > >> that in the docValues structures as opposed to the inverted index.
> > >>
> > >> Now you change the value and start indexing. At some point a segment
> > >> containing no docValues is merged with a segment containing docValues
> > for
> > >> the field. The resulting mixed segment is in this state. If you facet
> on
> > >> the field, should the docs without docValues have each individual term
> > >> counted? Or just the SortableText values in the docValues structure?
> > >> Neither one is right.
> > >>
> > >> Also remember that Lucene has no notion of schema. That’s entirely
> > imposed
> > >> on Lucene by Solr carefully constructing low-level analysis chains.
> > >>
> > >> So I’d _strongly_ recommend you re-index your corpus to a new
> collection
> > >> with the current definition, then perhaps use CREATEALIAS to
> seamlessly
> > >> switch.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Jun 9, 2019, at 12:50 PM, John Davis <jo...@gmail.com>
> > >> wrote:
> > >>>
> > >>> Hi there,
> > >>> We recently changed a field from TextField + no docValues to
> > >>> SortableTextField which has docValues enabled by default. Once I did
> > >> this I
> > >>> do not see any facet values for the field. I know that once all the
> > docs
> > >>> are re-indexed facets should work again, however can someone clarify
> > the
> > >>> current logic of lucene/solr how facets will be computed when schema
> is
> > >>> changed from no docValues to docValues and vice-versa?
> > >>>
> > >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> > >>> 2. Once certain fraction of docs are re-indexed, those facets will be
> > >>> returned?
> > >>> 3. Something else?
> > >>>
> > >>>
> > >>> Varun
> > >>
> > >>
> >
> >
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Enabling/disabling docValues

Posted by John Davis <jo...@gmail.com>.
You have made many assumptions which might not always be realistic a)
TextField is always tokenized b) Users care about precise counts and c)
Users have the luxury or ability to do a full re-index anytime. These are
real issues and there is no black/white solution. I will ask Lucene folks
on the actual implementation.

On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson <er...@gmail.com>
wrote:

> bq. Does lucene look at %docs in each state, or the first doc or something
> else?
>
> Frankly I don’t care since no matter what, the results of faceting mixed
> definitions is not useful.
>
> tl;dr;
>
> “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> means just what I choose it to mean — neither more nor less.’
>
> So “undefined" in this case means “I don’t see any value at all in chasing
> that info down” ;).
>
> Changing from regular text to SortableText means that the results will be
> inaccurate no matter what. For example, I have a doc with the value “my dog
> has fleas”. When NOT using SortableText, there are multiple tokens so facet
> counts would be:
>
> my (1)
> dog (1)
> has (1)
> fleas (1)
>
> But for SortableText will be:
>
> my dog has fleas (1)
>
> Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> doc1 was  indexed before switching to SortableText and doc2 after.
> Presumably  the output you want is:
>
> my dog has fleas (1)
> my cat has fleas (1)
>
> But you can’t get that output.  There are three cases:
>
> 1> Lucene treats all documents as SortableText, faceting on the docValues
> parts. No facets on doc1
>
> my  cat has fleas (1)
>
> 2> Lucene treats all documents as tokenized, faceting on each individual
> token. Faceting is performed on the tokenized content of both,  docValues
> in doc2  ignored
>
> my  (2)
> dog (1)
> has (2)
> fleas (2)
> cat (1)
>
>
> 3> Lucene does the best it can, faceting on the tokens for docs without
> SortableText and docValues if the doc was indexed with Sortable text. doc1
> faceted on tokenized, doc2 on docValues
>
> my  (1)
> dog (1)
> has (1)
> fleas (1)
> my cat has fleas (1)
>
> Since none of those cases is what I want, there’s no point I can see in
> chasing down what actually happens….
>
> Best,
> Erick
>
> P.S. I _think_ Lucene tries to use the definition from the first segment,
> but since whether the lists of segments to be  merged don’t look at the
> field definitions at all. Whether the first segment in the list has
> SortableText or not will not be predictable in a general way even within a
> single run.
>
>
> > On Jun 9, 2019, at 6:53 PM, John Davis <jo...@gmail.com>
> wrote:
> >
> > Understood, however code is rarely random/undefined. Does lucene look at
> %
> > docs in each state, or the first doc or something else?
> >
> > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> It’s basically undefined. When segments are merged that have dissimilar
> >> definitions like this what can Lucene do? Consider:
> >>
> >> Faceting on a text (not sortable) means that each individual token in
> the
> >> index is uninverted on the Java heap and the facets are computed for
> each
> >> individual term.
> >>
> >> Faceting on a SortableText field just has a single term per document,
> and
> >> that in the docValues structures as opposed to the inverted index.
> >>
> >> Now you change the value and start indexing. At some point a segment
> >> containing no docValues is merged with a segment containing docValues
> for
> >> the field. The resulting mixed segment is in this state. If you facet on
> >> the field, should the docs without docValues have each individual term
> >> counted? Or just the SortableText values in the docValues structure?
> >> Neither one is right.
> >>
> >> Also remember that Lucene has no notion of schema. That’s entirely
> imposed
> >> on Lucene by Solr carefully constructing low-level analysis chains.
> >>
> >> So I’d _strongly_ recommend you re-index your corpus to a new collection
> >> with the current definition, then perhaps use CREATEALIAS to seamlessly
> >> switch.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 9, 2019, at 12:50 PM, John Davis <jo...@gmail.com>
> >> wrote:
> >>>
> >>> Hi there,
> >>> We recently changed a field from TextField + no docValues to
> >>> SortableTextField which has docValues enabled by default. Once I did
> >> this I
> >>> do not see any facet values for the field. I know that once all the
> docs
> >>> are re-indexed facets should work again, however can someone clarify
> the
> >>> current logic of lucene/solr how facets will be computed when schema is
> >>> changed from no docValues to docValues and vice-versa?
> >>>
> >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> >>> 2. Once certain fraction of docs are re-indexed, those facets will be
> >>> returned?
> >>> 3. Something else?
> >>>
> >>>
> >>> Varun
> >>
> >>
>
>

Re: Enabling/disabling docValues

Posted by Erick Erickson <er...@gmail.com>.
bq. Does lucene look at %docs in each state, or the first doc or something else?

Frankly I don’t care since no matter what, the results of faceting mixed definitions is not useful.

tl;dr;

“When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.’

So “undefined" in this case means “I don’t see any value at all in chasing that info down” ;).

Changing from regular text to SortableText means that the results will be inaccurate no matter what. For example, I have a doc with the value “my dog has fleas”. When NOT using SortableText, there are multiple tokens so facet counts would be:

my (1)
dog (1)
has (1)
fleas (1)

But for SortableText will be:

my dog has fleas (1)

Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”. doc1 was  indexed before switching to SortableText and doc2 after. Presumably  the output you want is:

my dog has fleas (1)
my cat has fleas (1)

But you can’t get that output.  There are three cases:

1> Lucene treats all documents as SortableText, faceting on the docValues parts. No facets on doc1

my  cat has fleas (1) 

2> Lucene treats all documents as tokenized, faceting on each individual token. Faceting is performed on the tokenized content of both,  docValues in doc2  ignored

my  (2)
dog (1)
has (2)
fleas (2)
cat (1)


3> Lucene does the best it can, faceting on the tokens for docs without SortableText and docValues if the doc was indexed with Sortable text. doc1 faceted on tokenized, doc2 on docValues

my  (1)
dog (1)
has (1)
fleas (1)
my cat has fleas (1)

Since none of those cases is what I want, there’s no point I can see in chasing down what actually happens….

Best,
Erick

P.S. I _think_ Lucene tries to use the definition from the first segment, but since whether the lists of segments to be  merged don’t look at the field definitions at all. Whether the first segment in the list has SortableText or not will not be predictable in a general way even within a single run.


> On Jun 9, 2019, at 6:53 PM, John Davis <jo...@gmail.com> wrote:
> 
> Understood, however code is rarely random/undefined. Does lucene look at %
> docs in each state, or the first doc or something else?
> 
> On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> It’s basically undefined. When segments are merged that have dissimilar
>> definitions like this what can Lucene do? Consider:
>> 
>> Faceting on a text (not sortable) means that each individual token in the
>> index is uninverted on the Java heap and the facets are computed for each
>> individual term.
>> 
>> Faceting on a SortableText field just has a single term per document, and
>> that in the docValues structures as opposed to the inverted index.
>> 
>> Now you change the value and start indexing. At some point a segment
>> containing no docValues is merged with a segment containing docValues for
>> the field. The resulting mixed segment is in this state. If you facet on
>> the field, should the docs without docValues have each individual term
>> counted? Or just the SortableText values in the docValues structure?
>> Neither one is right.
>> 
>> Also remember that Lucene has no notion of schema. That’s entirely imposed
>> on Lucene by Solr carefully constructing low-level analysis chains.
>> 
>> So I’d _strongly_ recommend you re-index your corpus to a new collection
>> with the current definition, then perhaps use CREATEALIAS to seamlessly
>> switch.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 9, 2019, at 12:50 PM, John Davis <jo...@gmail.com>
>> wrote:
>>> 
>>> Hi there,
>>> We recently changed a field from TextField + no docValues to
>>> SortableTextField which has docValues enabled by default. Once I did
>> this I
>>> do not see any facet values for the field. I know that once all the docs
>>> are re-indexed facets should work again, however can someone clarify the
>>> current logic of lucene/solr how facets will be computed when schema is
>>> changed from no docValues to docValues and vice-versa?
>>> 
>>> 1. Until ALL the docs are re-indexed, no facets will be returned?
>>> 2. Once certain fraction of docs are re-indexed, those facets will be
>>> returned?
>>> 3. Something else?
>>> 
>>> 
>>> Varun
>> 
>> 


Re: Enabling/disabling docValues

Posted by John Davis <jo...@gmail.com>.
Understood, however code is rarely random/undefined. Does lucene look at %
docs in each state, or the first doc or something else?

On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <er...@gmail.com>
wrote:

> It’s basically undefined. When segments are merged that have dissimilar
> definitions like this what can Lucene do? Consider:
>
> Faceting on a text (not sortable) means that each individual token in the
> index is uninverted on the Java heap and the facets are computed for each
> individual term.
>
> Faceting on a SortableText field just has a single term per document, and
> that in the docValues structures as opposed to the inverted index.
>
> Now you change the value and start indexing. At some point a segment
> containing no docValues is merged with a segment containing docValues for
> the field. The resulting mixed segment is in this state. If you facet on
> the field, should the docs without docValues have each individual term
> counted? Or just the SortableText values in the docValues structure?
> Neither one is right.
>
> Also remember that Lucene has no notion of schema. That’s entirely imposed
> on Lucene by Solr carefully constructing low-level analysis chains.
>
> So I’d _strongly_ recommend you re-index your corpus to a new collection
> with the current definition, then perhaps use CREATEALIAS to seamlessly
> switch.
>
> Best,
> Erick
>
> > On Jun 9, 2019, at 12:50 PM, John Davis <jo...@gmail.com>
> wrote:
> >
> > Hi there,
> > We recently changed a field from TextField + no docValues to
> > SortableTextField which has docValues enabled by default. Once I did
> this I
> > do not see any facet values for the field. I know that once all the docs
> > are re-indexed facets should work again, however can someone clarify the
> > current logic of lucene/solr how facets will be computed when schema is
> > changed from no docValues to docValues and vice-versa?
> >
> > 1. Until ALL the docs are re-indexed, no facets will be returned?
> > 2. Once certain fraction of docs are re-indexed, those facets will be
> > returned?
> > 3. Something else?
> >
> >
> > Varun
>
>

Re: Enabling/disabling docValues

Posted by Erick Erickson <er...@gmail.com>.
It’s basically undefined. When segments are merged that have dissimilar definitions like this what can Lucene do? Consider:

Faceting on a text (not sortable) means that each individual token in the index is uninverted on the Java heap and the facets are computed for each individual term.

Faceting on a SortableText field just has a single term per document, and that in the docValues structures as opposed to the inverted index.

Now you change the value and start indexing. At some point a segment containing no docValues is merged with a segment containing docValues for the field. The resulting mixed segment is in this state. If you facet on the field, should the docs without docValues have each individual term counted? Or just the SortableText values in the docValues structure? Neither one is right.

Also remember that Lucene has no notion of schema. That’s entirely imposed on Lucene by Solr carefully constructing low-level analysis chains.

So I’d _strongly_ recommend you re-index your corpus to a new collection with the current definition, then perhaps use CREATEALIAS to seamlessly switch.

Best,
Erick

> On Jun 9, 2019, at 12:50 PM, John Davis <jo...@gmail.com> wrote:
> 
> Hi there,
> We recently changed a field from TextField + no docValues to
> SortableTextField which has docValues enabled by default. Once I did this I
> do not see any facet values for the field. I know that once all the docs
> are re-indexed facets should work again, however can someone clarify the
> current logic of lucene/solr how facets will be computed when schema is
> changed from no docValues to docValues and vice-versa?
> 
> 1. Until ALL the docs are re-indexed, no facets will be returned?
> 2. Once certain fraction of docs are re-indexed, those facets will be
> returned?
> 3. Something else?
> 
> 
> Varun