You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Walter Underwood <wu...@wunderwood.org> on 2019/05/30 22:41:07 UTC

Empty rows from /export?

3/4 of the documents I’m getting back from /export are empty. This collection has four shards, so I’m querying the leader core on each shard with /export. The results start like this:

{"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},

The final 1/4 of the results have UUIDs (the ID type). The id field is stored as docValues. This is the URL.

http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc

Running 6.6.2, Solr Cloud. The total number of non-null ids from all four shards is a bit less than 1/4 of the document count.

Any ideas about what is going on?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Empty rows from /export?

Posted by Erick Erickson <er...@gmail.com>.

docValues are indeed, realized in Lucene. It’s just that Lucene has no notion of “schema”. So when you define the schema, Solr carefully constructs the appropriate low-level Lucene calls to take care of all of the options you’ve specified in the schema, things like stored, indexed, docValues etc. when a doc is indexed.

Now we get to optimize. All Solr does is tell Lucene to mash together all the segments and Lucene does its tricks. Lucene assumes it “knows” everything it needs to know by what’s already in the segments it’s merging without reference to Solr’s schema. Therein lies the rub. If one segment has docValues for a field and another segment doesn’t, the result is “interesting”. In general, Lucene can’t reconstruct the original data.

From Robert Muir:
“I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all of the user's data, its not possible to safely migrate some things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index") and compute stats to get back the value, so it can be re-encoded. The function is y = f(x) and if x is not available its not possible, so lucene can't do it.”

DocValues is a special case because all the data necessary to all docValues is already in the index, i.e. the indexed data (assuming you originally put it in with indexed=true). But it requires extra effort, thus the UninvertDocValuesMergePolicyFactory.

>> I was curious if it
>> was safe to change the id field to docValues without reindexing

I’d be very reluctant.  It’s not something that’s explicitly tested or supported so there’e likely edge cases.

Best,
Erick

> On May 31, 2019, at 2:02 PM, David Hastings <ha...@gmail.com> wrote:
> 
>> Ah. So docValues are managed by Solr outside of Lucene. Interesting.
> 
> i was under the impression docValues are in lucene, and he is just saying
> that an optimize is not a re-index, its just taking the actual files that
> already exist in your index and arranging them and removing deletions, an
> optimize doesnt re-read the schema and re-index content
> 
> On Fri, May 31, 2019 at 1:59 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>> Ah. So docValues are managed by Solr outside of Lucene. Interesting.
>> 
>> That actually answers a question I had not asked yet. I was curious if it
>> was safe to change the id field to docValues without reindexing if we never
>> sorted on it. It looks like fetching the value won’t work until everything
>> is reindexed.
>> 
>> It seems like this would be a useful thing to have supported, migrating a
>> field to docValues.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On May 31, 2019, at 5:00 AM, Erick Erickson <er...@gmail.com>
>> wrote:
>>> 
>>> bq. but I optimized all the cores, which should rewrite every segment as
>> docValues.
>>> 
>>> Not true. Optimize is a Lucene level force merge. Dealing with segments,
>> i.e. merging and the like, is a low-level Lucene operation and Lucene has
>> no notion of a schema. So a change you made to the schema is irrelevant to
>> merging.
>>> 
>>> You have to have something at the Solr level that does some magic for
>> this to work. Take a look at UninvertDocValuesMergePolicyFactory if you
>> have Solr 7.0 or later. WARNING: I haven’t used that personally, and I do
>> not know what the behavior would be on an index that is “mixed”, i.e. one
>> that already has segments with some docs having DV entries and some not.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On May 31, 2019, at 12:35 AM, Walter Underwood <wu...@wunderwood.org>
>> wrote:
>>>> 
>>>> That field was changed to docValues, but I optimized all the cores,
>> which should rewrite every segment as docValues.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On May 30, 2019, at 7:37 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>>>>> 
>>>>> This is odd. The only reason I know of that would happen is if there
>> were no docValues for that field in those documents. By any chance were
>> docValues added to an existing index without totally reindexing into a new
>> collection?
>>>>> 
>>>>> What happens if you just query the collection rather than the
>> individual core? I’m thinking using a streaming expression as a check…..
>>>>> 
>>>>>> On May 30, 2019, at 6:41 PM, Walter Underwood <wu...@wunderwood.org>
>> wrote:
>>>>>> 
>>>>>> 3/4 of the documents I’m getting back from /export are empty. This
>> collection has four shards, so I’m querying the leader core on each shard
>> with /export. The results start like this:
>>>>>> 
>>>>>> 
>> {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
>>>>>> 
>>>>>> The final 1/4 of the results have UUIDs (the ID type). The id field
>> is stored as docValues. This is the URL.
>>>>>> 
>>>>>> 
>> http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc
>>>>>> 
>>>>>> Running 6.6.2, Solr Cloud. The total number of non-null ids from all
>> four shards is a bit less than 1/4 of the document count.
>>>>>> 
>>>>>> Any ideas about what is going on?
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Empty rows from /export?

Posted by David Hastings <ha...@gmail.com>.

> Ah. So docValues are managed by Solr outside of Lucene. Interesting.

i was under the impression docValues are in lucene, and he is just saying
that an optimize is not a re-index, its just taking the actual files that
already exist in your index and arranging them and removing deletions, an
optimize doesnt re-read the schema and re-index content

On Fri, May 31, 2019 at 1:59 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> Ah. So docValues are managed by Solr outside of Lucene. Interesting.
>
> That actually answers a question I had not asked yet. I was curious if it
> was safe to change the id field to docValues without reindexing if we never
> sorted on it. It looks like fetching the value won’t work until everything
> is reindexed.
>
> It seems like this would be a useful thing to have supported, migrating a
> field to docValues.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 31, 2019, at 5:00 AM, Erick Erickson <er...@gmail.com>
> wrote:
> >
> > bq. but I optimized all the cores, which should rewrite every segment as
> docValues.
> >
> > Not true. Optimize is a Lucene level force merge. Dealing with segments,
> i.e. merging and the like, is a low-level Lucene operation and Lucene has
> no notion of a schema. So a change you made to the schema is irrelevant to
> merging.
> >
> > You have to have something at the Solr level that does some magic for
> this to work. Take a look at UninvertDocValuesMergePolicyFactory if you
> have Solr 7.0 or later. WARNING: I haven’t used that personally, and I do
> not know what the behavior would be on an index that is “mixed”, i.e. one
> that already has segments with some docs having DV entries and some not.
> >
> > Best,
> > Erick
> >
> >> On May 31, 2019, at 12:35 AM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> >>
> >> That field was changed to docValues, but I optimized all the cores,
> which should rewrite every segment as docValues.
> >>
> >> wunder
> >> Walter Underwood
> >> wunder@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On May 30, 2019, at 7:37 PM, Erick Erickson <er...@gmail.com>
> wrote:
> >>>
> >>> This is odd. The only reason I know of that would happen is if there
> were no docValues for that field in those documents. By any chance were
> docValues added to an existing index without totally reindexing into a new
> collection?
> >>>
> >>> What happens if you just query the collection rather than the
> individual core? I’m thinking using a streaming expression as a check…..
> >>>
> >>>> On May 30, 2019, at 6:41 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> >>>>
> >>>> 3/4 of the documents I’m getting back from /export are empty. This
> collection has four shards, so I’m querying the leader core on each shard
> with /export. The results start like this:
> >>>>
> >>>>
> {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
> >>>>
> >>>> The final 1/4 of the results have UUIDs (the ID type). The id field
> is stored as docValues. This is the URL.
> >>>>
> >>>>
> http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc
> >>>>
> >>>> Running 6.6.2, Solr Cloud. The total number of non-null ids from all
> four shards is a bit less than 1/4 of the document count.
> >>>>
> >>>> Any ideas about what is going on?
> >>>>
> >>>> wunder
> >>>> Walter Underwood
> >>>> wunder@wunderwood.org
> >>>> http://observer.wunderwood.org/  (my blog)
> >>>>
> >>>
> >>
> >
>
>

Re: Empty rows from /export?

Posted by Walter Underwood <wu...@wunderwood.org>.

Ah. So docValues are managed by Solr outside of Lucene. Interesting. 

That actually answers a question I had not asked yet. I was curious if it was safe to change the id field to docValues without reindexing if we never sorted on it. It looks like fetching the value won’t work until everything is reindexed.

It seems like this would be a useful thing to have supported, migrating a field to docValues. 

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 31, 2019, at 5:00 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> bq. but I optimized all the cores, which should rewrite every segment as docValues.
> 
> Not true. Optimize is a Lucene level force merge. Dealing with segments, i.e. merging and the like, is a low-level Lucene operation and Lucene has no notion of a schema. So a change you made to the schema is irrelevant to merging.
> 
> You have to have something at the Solr level that does some magic for this to work. Take a look at UninvertDocValuesMergePolicyFactory if you have Solr 7.0 or later. WARNING: I haven’t used that personally, and I do not know what the behavior would be on an index that is “mixed”, i.e. one that already has segments with some docs having DV entries and some not.
> 
> Best,
> Erick
> 
>> On May 31, 2019, at 12:35 AM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> That field was changed to docValues, but I optimized all the cores, which should rewrite every segment as docValues.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On May 30, 2019, at 7:37 PM, Erick Erickson <er...@gmail.com> wrote:
>>> 
>>> This is odd. The only reason I know of that would happen is if there were no docValues for that field in those documents. By any chance were docValues added to an existing index without totally reindexing into a new collection?
>>> 
>>> What happens if you just query the collection rather than the individual core? I’m thinking using a streaming expression as a check…..
>>> 
>>>> On May 30, 2019, at 6:41 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>>>> 
>>>> 3/4 of the documents I’m getting back from /export are empty. This collection has four shards, so I’m querying the leader core on each shard with /export. The results start like this:
>>>> 
>>>> {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
>>>> 
>>>> The final 1/4 of the results have UUIDs (the ID type). The id field is stored as docValues. This is the URL.
>>>> 
>>>> http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc
>>>> 
>>>> Running 6.6.2, Solr Cloud. The total number of non-null ids from all four shards is a bit less than 1/4 of the document count.
>>>> 
>>>> Any ideas about what is going on?
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>> 
>> 
>

Re: Empty rows from /export?

Posted by Erick Erickson <er...@gmail.com>.

bq. but I optimized all the cores, which should rewrite every segment as docValues.

Not true. Optimize is a Lucene level force merge. Dealing with segments, i.e. merging and the like, is a low-level Lucene operation and Lucene has no notion of a schema. So a change you made to the schema is irrelevant to merging.

You have to have something at the Solr level that does some magic for this to work. Take a look at UninvertDocValuesMergePolicyFactory if you have Solr 7.0 or later. WARNING: I haven’t used that personally, and I do not know what the behavior would be on an index that is “mixed”, i.e. one that already has segments with some docs having DV entries and some not.

Best,
Erick

> On May 31, 2019, at 12:35 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> That field was changed to docValues, but I optimized all the cores, which should rewrite every segment as docValues.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On May 30, 2019, at 7:37 PM, Erick Erickson <er...@gmail.com> wrote:
>> 
>> This is odd. The only reason I know of that would happen is if there were no docValues for that field in those documents. By any chance were docValues added to an existing index without totally reindexing into a new collection?
>> 
>> What happens if you just query the collection rather than the individual core? I’m thinking using a streaming expression as a check…..
>> 
>>> On May 30, 2019, at 6:41 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>>> 
>>> 3/4 of the documents I’m getting back from /export are empty. This collection has four shards, so I’m querying the leader core on each shard with /export. The results start like this:
>>> 
>>> {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
>>> 
>>> The final 1/4 of the results have UUIDs (the ID type). The id field is stored as docValues. This is the URL.
>>> 
>>> http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc
>>> 
>>> Running 6.6.2, Solr Cloud. The total number of non-null ids from all four shards is a bit less than 1/4 of the document count.
>>> 
>>> Any ideas about what is going on?
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>> 
>

Re: Empty rows from /export?

Posted by Walter Underwood <wu...@wunderwood.org>.

That field was changed to docValues, but I optimized all the cores, which should rewrite every segment as docValues.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 30, 2019, at 7:37 PM, Erick Erickson <er...@gmail.com> wrote:
> 
> This is odd. The only reason I know of that would happen is if there were no docValues for that field in those documents. By any chance were docValues added to an existing index without totally reindexing into a new collection?
> 
> What happens if you just query the collection rather than the individual core? I’m thinking using a streaming expression as a check…..
> 
>> On May 30, 2019, at 6:41 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>> 
>> 3/4 of the documents I’m getting back from /export are empty. This collection has four shards, so I’m querying the leader core on each shard with /export. The results start like this:
>> 
>> {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
>> 
>> The final 1/4 of the results have UUIDs (the ID type). The id field is stored as docValues. This is the URL.
>> 
>> http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc
>> 
>> Running 6.6.2, Solr Cloud. The total number of non-null ids from all four shards is a bit less than 1/4 of the document count.
>> 
>> Any ideas about what is going on?
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>

Re: Empty rows from /export?

Posted by Erick Erickson <er...@gmail.com>.

This is odd. The only reason I know of that would happen is if there were no docValues for that field in those documents. By any chance were docValues added to an existing index without totally reindexing into a new collection?

What happens if you just query the collection rather than the individual core? I’m thinking using a streaming expression as a check…..

> On May 30, 2019, at 6:41 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> 3/4 of the documents I’m getting back from /export are empty. This collection has four shards, so I’m querying the leader core on each shard with /export. The results start like this:
> 
> {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
> 
> The final 1/4 of the results have UUIDs (the ID type). The id field is stored as docValues. This is the URL.
> 
> http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc
> 
> Running 6.6.2, Solr Cloud. The total number of non-null ids from all four shards is a bit less than 1/4 of the document count.
> 
> Any ideas about what is going on?
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>