You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dotan Cohen <do...@gmail.com> on 2013/05/28 11:21:05 UTC

What exactly happens to extant documents when the schema changes?

When adding or removing a text field to/from the schema and then
restarting Solr, what exactly happens to extant documents? Is the
schema only consulted when Solr writes a document, therefore extant
documents are unaffected?

Considering that Solr supports dynamic fields, my experimentation with
removing and adding fields to the schema has shown almost no change in
the extant index results returned.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

Posted by Dotan Cohen <do...@gmail.com>.

On Wed, May 29, 2013 at 5:09 PM, Shawn Heisey <so...@elyograg.org> wrote:
> I handle this in a very specific way with my sharded index.  This won't
> work for all designs, and the precise procedure won't work for SolrCloud.
>
> There is a 'live' and a 'build' core for each of my shards.  When I want
> to reindex, the program makes a note of my current position for deletes,
> reinserts, and new documents.  Then I use a DIH full-import from mysql
> into the build cores.  Once the import is done, I run the update cycle
> of deletes, reinserts, and new documents on those build cores, using the
> position information noted earlier.  Then I swap the cores so the new
> index is online.
>

I do need to examine sharding and multiple cores. I'll look into that,
thank you. By the way, don't google for DIH! It took me some time to
figure out that it is DataImportHandler, as some people use the
acronym for something completely different.


> To adapt this for SolrCloud, I would need to use two collections, and
> update a collection alias for what is considered live.
>
> To control the I/O and CPU usage, you might need some kind of throttling
> in your update/rebuild application.
>
> I don't need any throttling in my design.  Because I'm using DIH, the
> import only uses a single thread for each shard on the server.  I've got
> RAID10 for storage and half of the CPU cores are still available for
> queries, so it doesn't overwhelm the server.
>
> The rebuild does lower performance, so I have the other copy of the
> index handle queries while the rebuild is underway.  When the rebuild is
> done on one copy, I run it again on the other copy.  Right now I'm
> half-upgraded -- one copy of my index is version 3.5.0, the other is
> 4.2.1.  Switching to SolrCloud with sharding and replication would
> eliminate this flexibility, unless I maintained two separate clouds.
>

Thank you. I am not using Solr Cloud but if I ever consider it, then I
will keep this in mind.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/29/2013 1:07 AM, Dotan Cohen wrote:
> In the case of this particular application, reindexing really is
> overly burdensome as the application is performing hundreds of writes
> to the index per minute. How might I gauge how much spare I/O Solr
> could commit to a reindex? All the data that I need is in fact in
> stored fields.
> 
> Note that because the social media application that feeds our Solr
> index is global, there are no 'off hours'.

I handle this in a very specific way with my sharded index.  This won't
work for all designs, and the precise procedure won't work for SolrCloud.

There is a 'live' and a 'build' core for each of my shards.  When I want
to reindex, the program makes a note of my current position for deletes,
reinserts, and new documents.  Then I use a DIH full-import from mysql
into the build cores.  Once the import is done, I run the update cycle
of deletes, reinserts, and new documents on those build cores, using the
position information noted earlier.  Then I swap the cores so the new
index is online.

To adapt this for SolrCloud, I would need to use two collections, and
update a collection alias for what is considered live.

To control the I/O and CPU usage, you might need some kind of throttling
in your update/rebuild application.

I don't need any throttling in my design.  Because I'm using DIH, the
import only uses a single thread for each shard on the server.  I've got
RAID10 for storage and half of the CPU cores are still available for
queries, so it doesn't overwhelm the server.

The rebuild does lower performance, so I have the other copy of the
index handle queries while the rebuild is underway.  When the rebuild is
done on one copy, I run it again on the other copy.  Right now I'm
half-upgraded -- one copy of my index is version 3.5.0, the other is
4.2.1.  Switching to SolrCloud with sharding and replication would
eliminate this flexibility, unless I maintained two separate clouds.

Thanks,
Shawn

Re: What exactly happens to extant documents when the schema changes?

Posted by Dotan Cohen <do...@gmail.com>.

On Tue, May 28, 2013 at 3:58 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> The technical answer: Undefined and not guaranteed.
>

I was afraid of that!

> Sure, you can experiment and see what the effects "happen" to be in any
> given release, and maybe they don't tend to change (too much) between most
> releases, but there is no guarantee that any given "change schema but keep
> existing data without a delete of directory contents and full reindex" will
> actually be benign or what you expect.
>
> As a general proposition, when it comes to changing the schema and not
> deleting the directory and doing a full reindex, don't do it! Of course, we
> all know not to try to walk on thin ice, but a lot of people will try to do
> it anyway - and maybe it happens that most of the time the results are
> benign.
>

In the case of this particular application, reindexing really is
overly burdensome as the application is performing hundreds of writes
to the index per minute. How might I gauge how much spare I/O Solr
could commit to a reindex? All the data that I need is in fact in
stored fields.

Note that because the social media application that feeds our Solr
index is global, there are no 'off hours'.


> OTOH, you could file a Jira to propose that the effects of changing the
> schema but keeping the existing data should be precisely defined and
> documented, but, that could still change from release to release.
>

Seems like a lot of effort to document, for little benefit. I'm not
going to file it. I would like to know, though, is the schema
consulted at index time, query time, or both?


> From a practical perspective for your original question: If you suddenly add
> a field, there is no guarantee what will happen when you try to access that
> field for existing documents, or what will happen if you "update" existing
> documents. Sure, people can talk about what "happens to be true today", but
> there is no guarantee for the future. Similarly for deleting a field from
> the schema, there is no guarantee about the status of existing data, even
> though people can chatter about "what it seems to do today."
>
> Generally, you should design your application around contracts and what is
> guaranteed to be true, not what happens to be true from experiments or even
> experience. Granted, that is the theory and sometimes you do need to rely on
> experimentation and folklore and spotty or ambiguous documentation, but to
> the extent possible, it is best to avoid explicitly trying to rely on
> undocumented, uncontracted behavior.
>

Thanks. The application does change (added features) and we do not
want to loose old data.


> One question I asked long ago and never received an answer: what is the best
> practice for doing a full reindex - is it sufficient to first do a delete of
> "*:*", or does the Solr index directory contents or even the directory
> itself need to be explicitly deleted first? I believe it is the latter, but
> the former "seems" to work, most of the time. Deleting the directory itself
> "seems" to be the best answer, to date - but no guarantees!
>

I don't have an answer for that, sorry!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

Posted by Jack Krupansky <ja...@basetechnology.com>.

The technical answer: Undefined and not guaranteed.

Sure, you can experiment and see what the effects "happen" to be in any 
given release, and maybe they don't tend to change (too much) between most 
releases, but there is no guarantee that any given "change schema but keep 
existing data without a delete of directory contents and full reindex" will 
actually be benign or what you expect.

As a general proposition, when it comes to changing the schema and not 
deleting the directory and doing a full reindex, don't do it! Of course, we 
all know not to try to walk on thin ice, but a lot of people will try to do 
it anyway - and maybe it happens that most of the time the results are 
benign.

OTOH, you could file a Jira to propose that the effects of changing the 
schema but keeping the existing data should be precisely defined and 
documented, but, that could still change from release to release.

>From a practical perspective for your original question: If you suddenly add 
a field, there is no guarantee what will happen when you try to access that 
field for existing documents, or what will happen if you "update" existing 
documents. Sure, people can talk about what "happens to be true today", but 
there is no guarantee for the future. Similarly for deleting a field from 
the schema, there is no guarantee about the status of existing data, even 
though people can chatter about "what it seems to do today."

Generally, you should design your application around contracts and what is 
guaranteed to be true, not what happens to be true from experiments or even 
experience. Granted, that is the theory and sometimes you do need to rely on 
experimentation and folklore and spotty or ambiguous documentation, but to 
the extent possible, it is best to avoid explicitly trying to rely on 
undocumented, uncontracted behavior.

One question I asked long ago and never received an answer: what is the best 
practice for doing a full reindex - is it sufficient to first do a delete of 
"*:*", or does the Solr index directory contents or even the directory 
itself need to be explicitly deleted first? I believe it is the latter, but 
the former "seems" to work, most of the time. Deleting the directory itself 
"seems" to be the best answer, to date - but no guarantees!


-- Jack Krupansky

-----Original Message----- 
From: Dotan Cohen
Sent: Tuesday, May 28, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: What exactly happens to extant documents when the schema changes?

When adding or removing a text field to/from the schema and then
restarting Solr, what exactly happens to extant documents? Is the
schema only consulted when Solr writes a document, therefore extant
documents are unaffected?

Considering that Solr supports dynamic fields, my experimentation with
removing and adding fields to the schema has shown almost no change in
the extant index results returned.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

Posted by Dotan Cohen <do...@gmail.com>.

On Tue, May 28, 2013 at 2:20 PM, Upayavira <uv...@odoko.co.uk> wrote:
> The schema provides Solr with a description of what it will find in the
> Lucene indexes. If you, for example, changed a string field to an
> integer in your schema, that'd mess things up bigtime. I recently had to
> upgrade a date field from the 1.4.1 date field format to the newer
> TrieDateField. Given I had to do it on a live index, I had to add a new
> field (just using copyfield) and re-index over the top, as the old field
> was still in use. I guess, given my app now uses the new date field
> only, I could presumably reindex the old date field with the new
> TrieDateField format, but I'd want to try that before I do it for real.
>

Thank you for the insight. Unfortunately, with 20 million records and
growing by hundreds each minute (social media posts) I don't see that
I could ever reindex the data in a timely way.


> However, if you changed a single valued field to a multi-valued one,
> that's not an issue, as a field with a single value is still valid for a
> multi-valued field.
>
> Also, if you add a new field, existing documents will be considered to
> have no value in that field. If that is acceptable, then you're fine.
>
> I guess if you remove a field, then those fields will be ignored by
> Solr, and thus not impact anything. But I have to say, I've never tried
> that.
>
> Thus - changing the schema will only impact on future indexing. Whether
> your existing index will still be valid depends upon the changes you are
> making.
>
> Upayavira

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

Posted by Upayavira <uv...@odoko.co.uk>.

On Tue, May 28, 2013, at 10:21 AM, Dotan Cohen wrote:
> When adding or removing a text field to/from the schema and then
> restarting Solr, what exactly happens to extant documents? Is the
> schema only consulted when Solr writes a document, therefore extant
> documents are unaffected?
> 
> Considering that Solr supports dynamic fields, my experimentation with
> removing and adding fields to the schema has shown almost no change in
> the extant index results returned.

The schema provides Solr with a description of what it will find in the
Lucene indexes. If you, for example, changed a string field to an
integer in your schema, that'd mess things up bigtime. I recently had to
upgrade a date field from the 1.4.1 date field format to the newer
TrieDateField. Given I had to do it on a live index, I had to add a new
field (just using copyfield) and re-index over the top, as the old field
was still in use. I guess, given my app now uses the new date field
only, I could presumably reindex the old date field with the new
TrieDateField format, but I'd want to try that before I do it for real.

However, if you changed a single valued field to a multi-valued one,
that's not an issue, as a field with a single value is still valid for a
multi-valued field.

Also, if you add a new field, existing documents will be considered to
have no value in that field. If that is acceptable, then you're fine.

I guess if you remove a field, then those fields will be ignored by
Solr, and thus not impact anything. But I have to say, I've never tried
that.

Thus - changing the schema will only impact on future indexing. Whether
your existing index will still be valid depends upon the changes you are
making.

Upayavira