You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Wei <we...@gmail.com> on 2023/01/01 01:53:30 UTC

Re: Enable different schemas per shard based on core.properties

Hi Bram,

Can you explain a bit more on the approach? How does Solr Cloud maintain
different schema when update mixture of old and new documents in the same
segment?

Thanks, and happy new year!

- Wei

On Fri, Dec 23, 2022 at 8:21 AM Bram Van Dam <br...@intix.eu> wrote:

> Greetings,
>
> We ran into a pretty hairy problem on 7.7. TL;DR; we had to enable
> docValues on the unique key field in a large SolrCloud instance, without
> being able to reindex old data.
>
> This kind of worked, by specifying different config sets in
> core.properties for different shards, where new shards would get the
> schema from ZK and newly indexed data would (correctly) use DocValues,
> while old data in older shards remained unaffected.
>
> This broke when old data was modified: Solr would use the new schema for
> the updates, and the index would get corrupted because documents with
> and without docValues would be mixed in the same segment in the same
> core, which resulted in errors when retrieving the documents (curiously,
> not when merging the segments?).
>
> The linked patch, by my colleague Danny, allows Solr to use the correct
> schema when updating data in these old shards (based on the
> configuration in core.properties).
>
> We realize that this is a pretty ugly hack for a rather specific
> problem. But at the same time, Solr allows for different configSets to
> be specified for different cores, and this patch sort of improves
> support for that.
>
> This applies cleanly on (the admittedly ancient) branch_7_7. All tests
> are green, precommit checks are OK.
>
> If there is any interest in this patch, we might be able to look in to
> making it available on master or branch_9x.
>
> https://foss.intix.eu/solr/2022-12-solr-schema.patch
>
> Any feedback is of course greatly appreciated.
>
> Thanks, and season's greetings!
>
>   - Bram
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
>

Re: Enable different schemas per shard based on core.properties

Posted by Jan Høydahl <ja...@cominvent.com>.
We should rather work towards better documentation and perhaps tooling on what schema updates are "safe" without reindexing. It is NOT safe to change a field's type (or add docValues in-flight). Would be nice if the user is alerted about this when deploying the new schema. Masking the issue by alllowing mixed schema across the cluster is probably not a good idea. While it COULD work for some use cases, other kind of schema differences will fail loudly further down the road.

Jan

> 3. jan. 2023 kl. 03:54 skrev Shawn Heisey <ap...@elyograg.org>:
> 
> On 1/2/23 02:29, Bram Van Dam wrote:
>> Mixtures of old and new schemas are not supported here. In fact, this patch was made specifically to prevent that: it allows Solr Cloud to always use the "old" schema for old shards. Simply put the configSet in ZK and refer to it in core.properties.
>> We noticed that once documents with multiple schemas got merged into a single segment, things would break horribly (in this case because some docs had DocValues on the unique field and others did not).
> 
> Some features like grouping appear to require docValues to work at all. If some shards have docValues and some don't, those features are likely to break horribly, or at least not return expected results.
> 
> For everything to work right, whenever a schema change is made that involves docValues at all, a complete index wipe followed by a full reindex is usually required by Lucene.  Without this patch, the breakage when the reindex is not done is a very loud Exception.  I fear that this patch would result in incorrect results being returned without an error for features that use docValues, like grouping, sorting, facets, etc.
> 
> But I do not know enough about the Lucene internals to know if my fear is justified.
> 
> Thanks,
> Shawn
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Enable different schemas per shard based on core.properties

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/3/23 02:53, Bram Van Dam wrote:
> For a bit more context: we noticed that without docValues, the unique 
> key field ends up wasting gigabytes of memory per core (in field cache, 
> iirc). One entry per unique identifier. Regardless of whether or not the 
> core is even being hit. With docValues enabled, this is ameliorated. 
> This was pretty painful in large cloud instances with many large shards 
> in memory constrained environments.

Yep, that is exactly the problem that docValues solves.  Instead of the 
data structure that advanced features need being built in the heap, it 
is just read from disk using MMAP.  It still will consume memory, but 
that's in the form of the OS disk cache, which under "normal" 
circumstances can be instantly reclaimed by the OS if it is needed.  The 
docValues data structure is NOT allocated in the Solr heap, so the heap 
can be smaller, which helps GC performance.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Enable different schemas per shard based on core.properties

Posted by Bram Van Dam <br...@intix.eu>.
On 03/01/2023 03.54, Shawn Heisey wrote:
> Some features like grouping appear to require docValues to work at all. 
> If some shards have docValues and some don't, those features are likely 
> to break horribly, or at least not return expected results.

You're probably right in saying that this will break horribly in the 
general case. For our case, it's pretty safe, as the docValues schema 
change was only made on the unique key field, which we never use for 
grouping/faceting/etc.

> I fear that this 
> patch would result in incorrect results being returned without an error 
> for features that use docValues, like grouping, sorting, facets, etc.

I think you're right, and it's probably not wise to merge this into 
Solr. We submitted the patch with the idea that it might help people who 
faced similar issues, but a couple of flashy warning signs would have 
been a good idea.

Thanks for the feedback!

For a bit more context: we noticed that without docValues, the unique 
key field ends up wasting gigabytes of memory per core (in field cache, 
iirc). One entry per unique identifier. Regardless of whether or not the 
core is even being hit. With docValues enabled, this is ameliorated. 
This was pretty painful in large cloud instances with many large shards 
in memory constrained environments.

  - Bram

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Enable different schemas per shard based on core.properties

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/2/23 02:29, Bram Van Dam wrote:
> Mixtures of old and new schemas are not supported here. In fact, this 
> patch was made specifically to prevent that: it allows Solr Cloud to 
> always use the "old" schema for old shards. Simply put the configSet in 
> ZK and refer to it in core.properties.
> 
> We noticed that once documents with multiple schemas got merged into a 
> single segment, things would break horribly (in this case because some 
> docs had DocValues on the unique field and others did not).

Some features like grouping appear to require docValues to work at all. 
If some shards have docValues and some don't, those features are likely 
to break horribly, or at least not return expected results.

For everything to work right, whenever a schema change is made that 
involves docValues at all, a complete index wipe followed by a full 
reindex is usually required by Lucene.  Without this patch, the breakage 
when the reindex is not done is a very loud Exception.  I fear that this 
patch would result in incorrect results being returned without an error 
for features that use docValues, like grouping, sorting, facets, etc.

But I do not know enough about the Lucene internals to know if my fear 
is justified.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Enable different schemas per shard based on core.properties

Posted by Bram Van Dam <br...@intix.eu>.
Hi Wei,

Mixtures of old and new schemas are not supported here. In fact, this 
patch was made specifically to prevent that: it allows Solr Cloud to 
always use the "old" schema for old shards. Simply put the configSet in 
ZK and refer to it in core.properties.

We noticed that once documents with multiple schemas got merged into a 
single segment, things would break horribly (in this case because some 
docs had DocValues on the unique field and others did not).

And a happy new year to you too :-)

  - Bram

On 01/01/2023 02.53, Wei wrote:
> Hi Bram,
> 
> Can you explain a bit more on the approach? How does Solr Cloud maintain
> different schema when update mixture of old and new documents in the same
> segment?
> 
> Thanks, and happy new year!
> 
> - Wei
> 
> On Fri, Dec 23, 2022 at 8:21 AM Bram Van Dam <br...@intix.eu> wrote:
> 
>> Greetings,
>>
>> We ran into a pretty hairy problem on 7.7. TL;DR; we had to enable
>> docValues on the unique key field in a large SolrCloud instance, without
>> being able to reindex old data.
>>
>> This kind of worked, by specifying different config sets in
>> core.properties for different shards, where new shards would get the
>> schema from ZK and newly indexed data would (correctly) use DocValues,
>> while old data in older shards remained unaffected.
>>
>> This broke when old data was modified: Solr would use the new schema for
>> the updates, and the index would get corrupted because documents with
>> and without docValues would be mixed in the same segment in the same
>> core, which resulted in errors when retrieving the documents (curiously,
>> not when merging the segments?).
>>
>> The linked patch, by my colleague Danny, allows Solr to use the correct
>> schema when updating data in these old shards (based on the
>> configuration in core.properties).
>>
>> We realize that this is a pretty ugly hack for a rather specific
>> problem. But at the same time, Solr allows for different configSets to
>> be specified for different cores, and this patch sort of improves
>> support for that.
>>
>> This applies cleanly on (the admittedly ancient) branch_7_7. All tests
>> are green, precommit checks are OK.
>>
>> If there is any interest in this patch, we might be able to look in to
>> making it available on master or branch_9x.
>>
>> https://foss.intix.eu/solr/2022-12-solr-schema.patch
>>
>> Any feedback is of course greatly appreciated.
>>
>> Thanks, and season's greetings!
>>
>>    - Bram
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
>> For additional commands, e-mail: dev-help@solr.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org