You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by yaswanth kumar <ya...@gmail.com> on 2020/10/16 18:01:40 UTC

converting string to solr.TextField

I am using solr 8.2

Can I change the schema fieldtype from string to solr.TextField
without indexing?

    <field name="messagetext" type="string" indexed="true" stored="true"/>

The reason is that string has only 32K char limit where as I am looking to
store more than 32K now.

The contents on this field doesn't require any analysis or tokenized but I
need this field in the queries and as well as output fields.

-- 
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanthcse@gmail.com

Re: converting string to solr.TextField

Posted by Walter Underwood <wu...@wunderwood.org>.
In addition, what happens at query time when documents have
been index under a varying field type? Well, it doesn’t work well.

The full set of steps for uninterrupted searching is:

1. Add the new text field.
2. Reindex to populate that.
3. Switch querying to use the new text field.
4. Change the old string field to indexed=“false” stored=“false” and/or stop
including that field in search updates and/or populating it with copyField.
5. Reindex again to clean up all occurrences of the old field.
6. Remove the old field from the schema.

I just finished this process on two big clusters in prod. We had
created a bunch of extra fields for a series of A/B tests on 
relevance improvements. Those tests were finished, so we 
needed to remove those from the index. It was slightly simpler
because we had already stopped querying those fields.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 16, 2020, at 12:57 PM, David Hastings <ha...@gmail.com> wrote:
> 
> Gotcha, thanks for the explanation.  another small question if you
> dont mind, when deleting docs they arent actually removed, just tagged as
> deleted, and the old field/field type is still in the index until
> merged/optimized as well, wouldnt that cause almost the same conflicts
> until then?
> 
> On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> Doesn’t re-indexing a document just delete/replace….
>> 
>> It’s complicated. For the individual document, yes. The problem
>> comes because the field is inconsistent _between_ documents, and
>> segment merging blows things up.
>> 
>> Consider. I have segment1 with documents indexed with the old
>> schema (String in this case). I  change my schema and index the same
>> field as a text type.
>> 
>> Eventually, a segment merge happens and these two segments get merged
>> into a single new segment. How should the field be handled? Should it
>> be defined as String or Text in the new segment? If you convert the docs
>> with a Text definition for the field to String,
>> you’d lose the ability to search for individual tokens. If you convert the
>> String to Text, you don’t have any guarantee that the information is even
>> available.
>> 
>> This is just the tip of the iceberg in terms of trying to change the
>> definition of a field. Take the case of changing the analysis chain,
>> say you use a phonetic filter on a field then decide to remove it and
>> do not store the original. Erick might be encoded as “ENXY” so the
>> original data is simply not there to convert. Ditto removing a
>> stemmer, lowercasing, applying a regex, …...
>> 
>> 
>> From Mike McCandless:
>> 
>> "This really is the difference between an index and a database:
>> we do not store, precisely, the original documents.  We store
>> an efficient derived/computed index from them.  Yes, Solr/ES
>> can add database-like behavior where they hold the true original
>> source of the document and use that to rebuild Lucene indices
>> over time.  But Lucene really is just a "search index" and we
>> need to be free to make important improvements with time."
>> 
>> And all that aside, you have to re-index all the docs anyway or
>> your search results will be inconsistent. So leaving aside the
>> impossible task of covering all the possibilities on the fly, it’s
>> better to plan on re-indexing….
>> 
>> Best,
>> Erick
>> 
>> 
>>> On Oct 16, 2020, at 3:16 PM, David Hastings <
>> hastings.recursive@gmail.com> wrote:
>>> 
>>> "If you want to
>>> keep the same field name, you need to delete all of the
>>> documents in the index, change the schema, and reindex."
>>> 
>>> actually doesnt re-indexing a document just delete/replace anyways
>> assuming
>>> the same id?
>>> 
>>> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <
>> arafalov@gmail.com>
>>> wrote:
>>> 
>>>> Just as a side note,
>>>> 
>>>>> indexed="true"
>>>> If you are storing 32K message, you probably are not searching it as a
>>>> whole string. So, don't index it. You may also want to mark the field
>>>> as 'large' (and lazy):
>>>> 
>>>> 
>> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
>>>> 
>>>> When you are going to make it a text field, you will probably be
>>>> having the same issues as well.
>>>> 
>>>> And honestly, if you are not storing those fields to search, maybe you
>>>> need to consider the architecture. Maybe those fields do not need to
>>>> be in Solr at all, but in external systems. Solr (or any search
>>>> system) should not be your system of records since - as the other
>>>> reply showed - some of the answers are "reindex everything".
>>>> 
>>>> Regards,
>>>>  Alex.
>>>> 
>>>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <ya...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> I am using solr 8.2
>>>>> 
>>>>> Can I change the schema fieldtype from string to solr.TextField
>>>>> without indexing?
>>>>> 
>>>>>   <field name="messagetext" type="string" indexed="true"
>>>> stored="true"/>
>>>>> 
>>>>> The reason is that string has only 32K char limit where as I am looking
>>>> to
>>>>> store more than 32K now.
>>>>> 
>>>>> The contents on this field doesn't require any analysis or tokenized
>> but
>>>> I
>>>>> need this field in the queries and as well as output fields.
>>>>> 
>>>>> --
>>>>> Thanks & Regards,
>>>>> Yaswanth Kumar Konathala.
>>>>> yaswanthcse@gmail.com
>>>> 
>> 
>> 


Re: converting string to solr.TextField

Posted by Walter Underwood <wu...@wunderwood.org>.
Because Solr is not updating documents. Solr is adding to indexes
of fields. You cannot add a TextField document to a StringField index.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 17, 2020, at 5:23 AM, Vinay Rajput <vi...@gmail.com> wrote:
> 
> Sorry to jump into this discussion. I also get confused whenever I see this
> strange Solr/Lucene behaviour. Probably, As @Erick said in his last year
> talk, this is how it has been designed to avoid many problems that are
> hard/impossible to solve.
> 
> That said, one more time I want to come back to the same question: why
> solr/lucene can not handle this when we are updating all the documents?
> Let's take a couple of examples :-
> 
> *Ex 1:*
> Let's say I have only 10 documents in my index and all of them are in a
> single segment (Segment 1). Now, I change the schema (update field type in
> this case) and reindex all of them.
> This is what (according to me) should happen internally :-
> 
> 1st update req : Solr will mark 1st doc as deleted and index it again
> (might run the analyser chain based on config)
> 2nd update req : Solr will mark 2st doc as deleted and index it again
> (might run the analyser chain based on config)
> And so on......
> based on autoSoftCommit/autoCommit configuration, all new documents will be
> indexed and probably flushed to disk as part of new segment (Segment 2)
> 
> 
> Now, whenever segment merging happens (during commit or later in time),
> lucene will create a new segment (Segment 3) can discard all the docs
> present in segment 1 as there are no live docs in it. And there would *NOT*
> be any situation to decide whether to choose the old config or new config
> as there is not even a single live document with the old config. Isn't it?
> 
> *Ex 2:*
> I see that it can be an issue if we think about reindexing millions of
> docs. Because in that case, merging can be triggered when indexing is half
> way through, and since there are some live docs in the old segment (with
> old cofig), things will blow up. Please correct me if I am wrong.
> 
> I am *NOT* a Solr/Lucene expert and just started learning the ways things
> are working internally. In the above example, I can be wrong at many
> places. Can someone confirm if scenarios like Ex-2 are the reasons behind
> the fact that even re-indexing all documents doesn't help if some
> incompatible schema changes are done?  Any other insight would also be
> helpful.
> 
> Thanks,
> Vinay
> 
> On Sat, Oct 17, 2020 at 5:48 AM Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 10/16/2020 2:36 PM, David Hastings wrote:
>>> sorry, i was thinking just using the
>>> <delete><query>*:*</query></delete>
>>> method for clearing the index would leave them still
>> 
>> In theory, if you delete all documents at the Solr level, Lucene will
>> delete all the segment files on the next commit, because they are empty.
>>  I have not confirmed with testing whether this actually happens.
>> 
>> It is far safer to use a new index as Erick has said, or to delete the
>> index directories completely and restart Solr ... so you KNOW the index
>> has nothing in it.
>> 
>> Thanks,
>> Shawn
>> 


Re: converting string to solr.TextField

Posted by Erick Erickson <er...@gmail.com>.
Did you read the long explanation in this thread already about
segment merging? If so, can you ask specific questions about
the information in those?

Best,
Erick

> On Oct 17, 2020, at 8:23 AM, Vinay Rajput <vi...@gmail.com> wrote:
> 
> Sorry to jump into this discussion. I also get confused whenever I see this
> strange Solr/Lucene behaviour. Probably, As @Erick said in his last year
> talk, this is how it has been designed to avoid many problems that are
> hard/impossible to solve.
> 
> That said, one more time I want to come back to the same question: why
> solr/lucene can not handle this when we are updating all the documents?
> Let's take a couple of examples :-
> 
> *Ex 1:*
> Let's say I have only 10 documents in my index and all of them are in a
> single segment (Segment 1). Now, I change the schema (update field type in
> this case) and reindex all of them.
> This is what (according to me) should happen internally :-
> 
> 1st update req : Solr will mark 1st doc as deleted and index it again
> (might run the analyser chain based on config)
> 2nd update req : Solr will mark 2st doc as deleted and index it again
> (might run the analyser chain based on config)
> And so on......
> based on autoSoftCommit/autoCommit configuration, all new documents will be
> indexed and probably flushed to disk as part of new segment (Segment 2)
> 
> 
> Now, whenever segment merging happens (during commit or later in time),
> lucene will create a new segment (Segment 3) can discard all the docs
> present in segment 1 as there are no live docs in it. And there would *NOT*
> be any situation to decide whether to choose the old config or new config
> as there is not even a single live document with the old config. Isn't it?
> 
> *Ex 2:*
> I see that it can be an issue if we think about reindexing millions of
> docs. Because in that case, merging can be triggered when indexing is half
> way through, and since there are some live docs in the old segment (with
> old cofig), things will blow up. Please correct me if I am wrong.
> 
> I am *NOT* a Solr/Lucene expert and just started learning the ways things
> are working internally. In the above example, I can be wrong at many
> places. Can someone confirm if scenarios like Ex-2 are the reasons behind
> the fact that even re-indexing all documents doesn't help if some
> incompatible schema changes are done?  Any other insight would also be
> helpful.
> 
> Thanks,
> Vinay
> 
> On Sat, Oct 17, 2020 at 5:48 AM Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 10/16/2020 2:36 PM, David Hastings wrote:
>>> sorry, i was thinking just using the
>>> <delete><query>*:*</query></delete>
>>> method for clearing the index would leave them still
>> 
>> In theory, if you delete all documents at the Solr level, Lucene will
>> delete all the segment files on the next commit, because they are empty.
>>  I have not confirmed with testing whether this actually happens.
>> 
>> It is far safer to use a new index as Erick has said, or to delete the
>> index directories completely and restart Solr ... so you KNOW the index
>> has nothing in it.
>> 
>> Thanks,
>> Shawn
>> 


Re: converting string to solr.TextField

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/17/2020 6:23 AM, Vinay Rajput wrote:
> That said, one more time I want to come back to the same question: why
> solr/lucene can not handle this when we are updating all the documents?
> Let's take a couple of examples :-
> 
> *Ex 1:*
> Let's say I have only 10 documents in my index and all of them are in a
> single segment (Segment 1). Now, I change the schema (update field type in
> this case) and reindex all of them.
> This is what (according to me) should happen internally :-
> 
> 1st update req : Solr will mark 1st doc as deleted and index it again
> (might run the analyser chain based on config)
> 2nd update req : Solr will mark 2st doc as deleted and index it again
> (might run the analyser chain based on config)
> And so on......
> based on autoSoftCommit/autoCommit configuration, all new documents will be
> indexed and probably flushed to disk as part of new segment (Segment 2)

<snip>

> *Ex 2:*
> I see that it can be an issue if we think about reindexing millions of
> docs. Because in that case, merging can be triggered when indexing is half
> way through, and since there are some live docs in the old segment (with
> old cofig), things will blow up. Please correct me if I am wrong.

If you could guarantee a few things, you could be sure this will work. 
But it's a serious long shot.

The change in schema might be such that when Lucene tries to merge them, 
it fails because the data in the old segments is incompatible with the 
new segments.  If that happens, then you're sunk ... it won't work at all.

If the merges of old and new segments are successful, then you would 
have to optimize the index after you're done indexing to be SURE there 
were no old documents remaining.  Lucene calls that operation 
"ForceMerge".  This operation is disruptive and can take a very long time.

You would also have to be sure there was no query activity until the 
update/merge is completely done.  Which probably means that you'd want 
to work on a copy of the index in another collection.  And if you're 
going to do that, you might as well start indexing from scratch into a 
new/empty collection.  That would also allow you to continue querying 
the old collection until the new one was ready.

Thanks,
Shawn

Re: converting string to solr.TextField

Posted by Vinay Rajput <vi...@gmail.com>.
Sorry to jump into this discussion. I also get confused whenever I see this
strange Solr/Lucene behaviour. Probably, As @Erick said in his last year
talk, this is how it has been designed to avoid many problems that are
hard/impossible to solve.

That said, one more time I want to come back to the same question: why
solr/lucene can not handle this when we are updating all the documents?
Let's take a couple of examples :-

*Ex 1:*
Let's say I have only 10 documents in my index and all of them are in a
single segment (Segment 1). Now, I change the schema (update field type in
this case) and reindex all of them.
This is what (according to me) should happen internally :-

1st update req : Solr will mark 1st doc as deleted and index it again
(might run the analyser chain based on config)
2nd update req : Solr will mark 2st doc as deleted and index it again
(might run the analyser chain based on config)
And so on......
based on autoSoftCommit/autoCommit configuration, all new documents will be
indexed and probably flushed to disk as part of new segment (Segment 2)


Now, whenever segment merging happens (during commit or later in time),
lucene will create a new segment (Segment 3) can discard all the docs
present in segment 1 as there are no live docs in it. And there would *NOT*
be any situation to decide whether to choose the old config or new config
as there is not even a single live document with the old config. Isn't it?

*Ex 2:*
I see that it can be an issue if we think about reindexing millions of
docs. Because in that case, merging can be triggered when indexing is half
way through, and since there are some live docs in the old segment (with
old cofig), things will blow up. Please correct me if I am wrong.

I am *NOT* a Solr/Lucene expert and just started learning the ways things
are working internally. In the above example, I can be wrong at many
places. Can someone confirm if scenarios like Ex-2 are the reasons behind
the fact that even re-indexing all documents doesn't help if some
incompatible schema changes are done?  Any other insight would also be
helpful.

Thanks,
Vinay

On Sat, Oct 17, 2020 at 5:48 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/16/2020 2:36 PM, David Hastings wrote:
> > sorry, i was thinking just using the
> > <delete><query>*:*</query></delete>
> > method for clearing the index would leave them still
>
> In theory, if you delete all documents at the Solr level, Lucene will
> delete all the segment files on the next commit, because they are empty.
>   I have not confirmed with testing whether this actually happens.
>
> It is far safer to use a new index as Erick has said, or to delete the
> index directories completely and restart Solr ... so you KNOW the index
> has nothing in it.
>
> Thanks,
> Shawn
>

Re: converting string to solr.TextField

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/16/2020 2:36 PM, David Hastings wrote:
> sorry, i was thinking just using the
> <delete><query>*:*</query></delete>
> method for clearing the index would leave them still

In theory, if you delete all documents at the Solr level, Lucene will 
delete all the segment files on the next commit, because they are empty. 
  I have not confirmed with testing whether this actually happens.

It is far safer to use a new index as Erick has said, or to delete the 
index directories completely and restart Solr ... so you KNOW the index 
has nothing in it.

Thanks,
Shawn

Re: converting string to solr.TextField

Posted by David Hastings <ha...@gmail.com>.
sorry, i was thinking just using the
<delete><query>*:*</query></delete>
method for clearing the index would leave them still

On Fri, Oct 16, 2020 at 4:28 PM Erick Erickson <er...@gmail.com>
wrote:

> Not sure what you’re asking here. re-indexing, as I was
> using the term, means completely removing the index and
> starting over. Or indexing to a new collection. At any
> rate, starting from a state where there are _no_ segments.
>
> I’m guessing you’re still thinking that re-indexing without
> doing the above will work; it won’t. The way merging works,
> it chooses segments based on a number of things, including
> the percentage deleted documents. But there are still _other_
> live docs in the segment.
>
> Segment S1 has docs 1, 2, 3, 4 (old definition)
> Segment S2 has docs 5, 6, 7, 8 (new definition)
>
> Doc 2 is deleted, and S1 and S2 are merged into S3. The whole
> discussion about not being able to do the right thing kicks in.
> Should S3 use the new or old definition? Whichever one
> it uses is wrong for the other segment. And remember,
> Lucene simply _cannot_ “do the right thing” if the data
> isn’t there.
>
> What you may be missing is that a segment is a “mini-index”.
> The underlying assumption is that all documents in that
> segment are produced with the same schema and can be
> accessed the same way. My comments about merging
> “doing the right thing” is really about transforming docs
> so all the docs can be treated the same. Which they can’t
> if they were produced with different schemas.
>
> Robert Muir’s statement is interesting here, built
> on Mike McCandless’ comment:
>
> "I think the key issue here is Lucene is an index not a database.
> Because it is a lossy index and does not retain all of the user’s
> data, its not possible to safely migrate some things automagically.
> …. The function is y = f(x) and if x is not available its not
> possible, so lucene can't do it."
>
> Don’t try to get around this. Prepare to
> re-index the entire corpus into a new collection whenever
> you change the schema and then maybe use an alias to
> seamlessly convert from the user’s perspective. If you
> simply cannot re-index from the system-of-record, you have
> two choices:
>
> 1> use new collections whenever you need to change the
>      schema and “somehow” have the app do different things
>     with the new and old collections
>
> 2> set stored=true for all your source fields (i.e. not
>    copyField destination). You can either roll your own
>    program that pulls data from the old and sends
>    it to the new or use the Collections API REINDEXCOLLECTION
>    API call. But note that it’s specifically called out
>    in the docs that all fields must be stored to use the
>     API, what happens under the covers is that the
>      stored fields are read and sent to the target
>    collection.
>
> In both these cases, Robert’s comment doesn’t apply. Well,
> it does apply but “if x is not available” is not the case,
> the original _is_ available; it’s the stored data...
>
> I’m over-stating the case somewhat, there are a few changes
> that you can get away with re-indexing all the docs into an
> existing index, things like changing from stored=true to
> stored=false, adding new fields, deleting fields (although the
> meta-data for the field is still kept around) etc.
>
> > On Oct 16, 2020, at 3:57 PM, David Hastings <
> hastings.recursive@gmail.com> wrote:
> >
> > Gotcha, thanks for the explanation.  another small question if you
> > dont mind, when deleting docs they arent actually removed, just tagged as
> > deleted, and the old field/field type is still in the index until
> > merged/optimized as well, wouldnt that cause almost the same conflicts
> > until then?
> >
> > On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> Doesn’t re-indexing a document just delete/replace….
> >>
> >> It’s complicated. For the individual document, yes. The problem
> >> comes because the field is inconsistent _between_ documents, and
> >> segment merging blows things up.
> >>
> >> Consider. I have segment1 with documents indexed with the old
> >> schema (String in this case). I  change my schema and index the same
> >> field as a text type.
> >>
> >> Eventually, a segment merge happens and these two segments get merged
> >> into a single new segment. How should the field be handled? Should it
> >> be defined as String or Text in the new segment? If you convert the docs
> >> with a Text definition for the field to String,
> >> you’d lose the ability to search for individual tokens. If you convert
> the
> >> String to Text, you don’t have any guarantee that the information is
> even
> >> available.
> >>
> >> This is just the tip of the iceberg in terms of trying to change the
> >> definition of a field. Take the case of changing the analysis chain,
> >> say you use a phonetic filter on a field then decide to remove it and
> >> do not store the original. Erick might be encoded as “ENXY” so the
> >> original data is simply not there to convert. Ditto removing a
> >> stemmer, lowercasing, applying a regex, …...
> >>
> >>
> >> From Mike McCandless:
> >>
> >> "This really is the difference between an index and a database:
> >> we do not store, precisely, the original documents.  We store
> >> an efficient derived/computed index from them.  Yes, Solr/ES
> >> can add database-like behavior where they hold the true original
> >> source of the document and use that to rebuild Lucene indices
> >> over time.  But Lucene really is just a "search index" and we
> >> need to be free to make important improvements with time."
> >>
> >> And all that aside, you have to re-index all the docs anyway or
> >> your search results will be inconsistent. So leaving aside the
> >> impossible task of covering all the possibilities on the fly, it’s
> >> better to plan on re-indexing….
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>> On Oct 16, 2020, at 3:16 PM, David Hastings <
> >> hastings.recursive@gmail.com> wrote:
> >>>
> >>> "If you want to
> >>> keep the same field name, you need to delete all of the
> >>> documents in the index, change the schema, and reindex."
> >>>
> >>> actually doesnt re-indexing a document just delete/replace anyways
> >> assuming
> >>> the same id?
> >>>
> >>> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <
> >> arafalov@gmail.com>
> >>> wrote:
> >>>
> >>>> Just as a side note,
> >>>>
> >>>>> indexed="true"
> >>>> If you are storing 32K message, you probably are not searching it as a
> >>>> whole string. So, don't index it. You may also want to mark the field
> >>>> as 'large' (and lazy):
> >>>>
> >>>>
> >>
> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
> >>>>
> >>>> When you are going to make it a text field, you will probably be
> >>>> having the same issues as well.
> >>>>
> >>>> And honestly, if you are not storing those fields to search, maybe you
> >>>> need to consider the architecture. Maybe those fields do not need to
> >>>> be in Solr at all, but in external systems. Solr (or any search
> >>>> system) should not be your system of records since - as the other
> >>>> reply showed - some of the answers are "reindex everything".
> >>>>
> >>>> Regards,
> >>>>  Alex.
> >>>>
> >>>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <ya...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> I am using solr 8.2
> >>>>>
> >>>>> Can I change the schema fieldtype from string to solr.TextField
> >>>>> without indexing?
> >>>>>
> >>>>>   <field name="messagetext" type="string" indexed="true"
> >>>> stored="true"/>
> >>>>>
> >>>>> The reason is that string has only 32K char limit where as I am
> looking
> >>>> to
> >>>>> store more than 32K now.
> >>>>>
> >>>>> The contents on this field doesn't require any analysis or tokenized
> >> but
> >>>> I
> >>>>> need this field in the queries and as well as output fields.
> >>>>>
> >>>>> --
> >>>>> Thanks & Regards,
> >>>>> Yaswanth Kumar Konathala.
> >>>>> yaswanthcse@gmail.com
> >>>>
> >>
> >>
>
>

Re: converting string to solr.TextField

Posted by Erick Erickson <er...@gmail.com>.
Not sure what you’re asking here. re-indexing, as I was
using the term, means completely removing the index and
starting over. Or indexing to a new collection. At any
rate, starting from a state where there are _no_ segments.

I’m guessing you’re still thinking that re-indexing without
doing the above will work; it won’t. The way merging works,
it chooses segments based on a number of things, including
the percentage deleted documents. But there are still _other_
live docs in the segment.

Segment S1 has docs 1, 2, 3, 4 (old definition)
Segment S2 has docs 5, 6, 7, 8 (new definition)

Doc 2 is deleted, and S1 and S2 are merged into S3. The whole
discussion about not being able to do the right thing kicks in.
Should S3 use the new or old definition? Whichever one
it uses is wrong for the other segment. And remember,
Lucene simply _cannot_ “do the right thing” if the data
isn’t there.

What you may be missing is that a segment is a “mini-index”.
The underlying assumption is that all documents in that
segment are produced with the same schema and can be
accessed the same way. My comments about merging
“doing the right thing” is really about transforming docs
so all the docs can be treated the same. Which they can’t
if they were produced with different schemas.

Robert Muir’s statement is interesting here, built
on Mike McCandless’ comment:

"I think the key issue here is Lucene is an index not a database.
Because it is a lossy index and does not retain all of the user’s
data, its not possible to safely migrate some things automagically.
…. The function is y = f(x) and if x is not available its not 
possible, so lucene can't do it."

Don’t try to get around this. Prepare to
re-index the entire corpus into a new collection whenever
you change the schema and then maybe use an alias to
seamlessly convert from the user’s perspective. If you
simply cannot re-index from the system-of-record, you have
two choices:

1> use new collections whenever you need to change the
     schema and “somehow” have the app do different things
    with the new and old collections

2> set stored=true for all your source fields (i.e. not
   copyField destination). You can either roll your own
   program that pulls data from the old and sends
   it to the new or use the Collections API REINDEXCOLLECTION
   API call. But note that it’s specifically called out
   in the docs that all fields must be stored to use the
    API, what happens under the covers is that the 
     stored fields are read and sent to the target
   collection.

In both these cases, Robert’s comment doesn’t apply. Well,
it does apply but “if x is not available” is not the case,
the original _is_ available; it’s the stored data...

I’m over-stating the case somewhat, there are a few changes
that you can get away with re-indexing all the docs into an
existing index, things like changing from stored=true to 
stored=false, adding new fields, deleting fields (although the
meta-data for the field is still kept around) etc.

> On Oct 16, 2020, at 3:57 PM, David Hastings <ha...@gmail.com> wrote:
> 
> Gotcha, thanks for the explanation.  another small question if you
> dont mind, when deleting docs they arent actually removed, just tagged as
> deleted, and the old field/field type is still in the index until
> merged/optimized as well, wouldnt that cause almost the same conflicts
> until then?
> 
> On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> Doesn’t re-indexing a document just delete/replace….
>> 
>> It’s complicated. For the individual document, yes. The problem
>> comes because the field is inconsistent _between_ documents, and
>> segment merging blows things up.
>> 
>> Consider. I have segment1 with documents indexed with the old
>> schema (String in this case). I  change my schema and index the same
>> field as a text type.
>> 
>> Eventually, a segment merge happens and these two segments get merged
>> into a single new segment. How should the field be handled? Should it
>> be defined as String or Text in the new segment? If you convert the docs
>> with a Text definition for the field to String,
>> you’d lose the ability to search for individual tokens. If you convert the
>> String to Text, you don’t have any guarantee that the information is even
>> available.
>> 
>> This is just the tip of the iceberg in terms of trying to change the
>> definition of a field. Take the case of changing the analysis chain,
>> say you use a phonetic filter on a field then decide to remove it and
>> do not store the original. Erick might be encoded as “ENXY” so the
>> original data is simply not there to convert. Ditto removing a
>> stemmer, lowercasing, applying a regex, …...
>> 
>> 
>> From Mike McCandless:
>> 
>> "This really is the difference between an index and a database:
>> we do not store, precisely, the original documents.  We store
>> an efficient derived/computed index from them.  Yes, Solr/ES
>> can add database-like behavior where they hold the true original
>> source of the document and use that to rebuild Lucene indices
>> over time.  But Lucene really is just a "search index" and we
>> need to be free to make important improvements with time."
>> 
>> And all that aside, you have to re-index all the docs anyway or
>> your search results will be inconsistent. So leaving aside the
>> impossible task of covering all the possibilities on the fly, it’s
>> better to plan on re-indexing….
>> 
>> Best,
>> Erick
>> 
>> 
>>> On Oct 16, 2020, at 3:16 PM, David Hastings <
>> hastings.recursive@gmail.com> wrote:
>>> 
>>> "If you want to
>>> keep the same field name, you need to delete all of the
>>> documents in the index, change the schema, and reindex."
>>> 
>>> actually doesnt re-indexing a document just delete/replace anyways
>> assuming
>>> the same id?
>>> 
>>> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <
>> arafalov@gmail.com>
>>> wrote:
>>> 
>>>> Just as a side note,
>>>> 
>>>>> indexed="true"
>>>> If you are storing 32K message, you probably are not searching it as a
>>>> whole string. So, don't index it. You may also want to mark the field
>>>> as 'large' (and lazy):
>>>> 
>>>> 
>> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
>>>> 
>>>> When you are going to make it a text field, you will probably be
>>>> having the same issues as well.
>>>> 
>>>> And honestly, if you are not storing those fields to search, maybe you
>>>> need to consider the architecture. Maybe those fields do not need to
>>>> be in Solr at all, but in external systems. Solr (or any search
>>>> system) should not be your system of records since - as the other
>>>> reply showed - some of the answers are "reindex everything".
>>>> 
>>>> Regards,
>>>>  Alex.
>>>> 
>>>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <ya...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> I am using solr 8.2
>>>>> 
>>>>> Can I change the schema fieldtype from string to solr.TextField
>>>>> without indexing?
>>>>> 
>>>>>   <field name="messagetext" type="string" indexed="true"
>>>> stored="true"/>
>>>>> 
>>>>> The reason is that string has only 32K char limit where as I am looking
>>>> to
>>>>> store more than 32K now.
>>>>> 
>>>>> The contents on this field doesn't require any analysis or tokenized
>> but
>>>> I
>>>>> need this field in the queries and as well as output fields.
>>>>> 
>>>>> --
>>>>> Thanks & Regards,
>>>>> Yaswanth Kumar Konathala.
>>>>> yaswanthcse@gmail.com
>>>> 
>> 
>> 


Re: converting string to solr.TextField

Posted by David Hastings <ha...@gmail.com>.
Gotcha, thanks for the explanation.  another small question if you
dont mind, when deleting docs they arent actually removed, just tagged as
deleted, and the old field/field type is still in the index until
merged/optimized as well, wouldnt that cause almost the same conflicts
until then?

On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson <er...@gmail.com>
wrote:

> Doesn’t re-indexing a document just delete/replace….
>
> It’s complicated. For the individual document, yes. The problem
> comes because the field is inconsistent _between_ documents, and
> segment merging blows things up.
>
> Consider. I have segment1 with documents indexed with the old
> schema (String in this case). I  change my schema and index the same
> field as a text type.
>
> Eventually, a segment merge happens and these two segments get merged
> into a single new segment. How should the field be handled? Should it
> be defined as String or Text in the new segment? If you convert the docs
> with a Text definition for the field to String,
> you’d lose the ability to search for individual tokens. If you convert the
> String to Text, you don’t have any guarantee that the information is even
> available.
>
> This is just the tip of the iceberg in terms of trying to change the
> definition of a field. Take the case of changing the analysis chain,
> say you use a phonetic filter on a field then decide to remove it and
> do not store the original. Erick might be encoded as “ENXY” so the
> original data is simply not there to convert. Ditto removing a
> stemmer, lowercasing, applying a regex, …...
>
>
> From Mike McCandless:
>
> "This really is the difference between an index and a database:
>  we do not store, precisely, the original documents.  We store
> an efficient derived/computed index from them.  Yes, Solr/ES
> can add database-like behavior where they hold the true original
> source of the document and use that to rebuild Lucene indices
> over time.  But Lucene really is just a "search index" and we
> need to be free to make important improvements with time."
>
> And all that aside, you have to re-index all the docs anyway or
> your search results will be inconsistent. So leaving aside the
> impossible task of covering all the possibilities on the fly, it’s
> better to plan on re-indexing….
>
> Best,
> Erick
>
>
> > On Oct 16, 2020, at 3:16 PM, David Hastings <
> hastings.recursive@gmail.com> wrote:
> >
> > "If you want to
> > keep the same field name, you need to delete all of the
> > documents in the index, change the schema, and reindex."
> >
> > actually doesnt re-indexing a document just delete/replace anyways
> assuming
> > the same id?
> >
> > On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <
> arafalov@gmail.com>
> > wrote:
> >
> >> Just as a side note,
> >>
> >>> indexed="true"
> >> If you are storing 32K message, you probably are not searching it as a
> >> whole string. So, don't index it. You may also want to mark the field
> >> as 'large' (and lazy):
> >>
> >>
> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
> >>
> >> When you are going to make it a text field, you will probably be
> >> having the same issues as well.
> >>
> >> And honestly, if you are not storing those fields to search, maybe you
> >> need to consider the architecture. Maybe those fields do not need to
> >> be in Solr at all, but in external systems. Solr (or any search
> >> system) should not be your system of records since - as the other
> >> reply showed - some of the answers are "reindex everything".
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <ya...@gmail.com>
> >> wrote:
> >>>
> >>> I am using solr 8.2
> >>>
> >>> Can I change the schema fieldtype from string to solr.TextField
> >>> without indexing?
> >>>
> >>>    <field name="messagetext" type="string" indexed="true"
> >> stored="true"/>
> >>>
> >>> The reason is that string has only 32K char limit where as I am looking
> >> to
> >>> store more than 32K now.
> >>>
> >>> The contents on this field doesn't require any analysis or tokenized
> but
> >> I
> >>> need this field in the queries and as well as output fields.
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Yaswanth Kumar Konathala.
> >>> yaswanthcse@gmail.com
> >>
>
>

Re: converting string to solr.TextField

Posted by Erick Erickson <er...@gmail.com>.
Doesn’t re-indexing a document just delete/replace….

It’s complicated. For the individual document, yes. The problem
comes because the field is inconsistent _between_ documents, and
segment merging blows things up.

Consider. I have segment1 with documents indexed with the old
schema (String in this case). I  change my schema and index the same
field as a text type.

Eventually, a segment merge happens and these two segments get merged
into a single new segment. How should the field be handled? Should it
be defined as String or Text in the new segment? If you convert the docs
with a Text definition for the field to String,
you’d lose the ability to search for individual tokens. If you convert the
String to Text, you don’t have any guarantee that the information is even
available.

This is just the tip of the iceberg in terms of trying to change the 
definition of a field. Take the case of changing the analysis chain,
say you use a phonetic filter on a field then decide to remove it and
do not store the original. Erick might be encoded as “ENXY” so the 
original data is simply not there to convert. Ditto removing a 
stemmer, lowercasing, applying a regex, …...


From Mike McCandless:

"This really is the difference between an index and a database:
 we do not store, precisely, the original documents.  We store 
an efficient derived/computed index from them.  Yes, Solr/ES 
can add database-like behavior where they hold the true original 
source of the document and use that to rebuild Lucene indices 
over time.  But Lucene really is just a "search index" and we 
need to be free to make important improvements with time."

And all that aside, you have to re-index all the docs anyway or
your search results will be inconsistent. So leaving aside the 
impossible task of covering all the possibilities on the fly, it’s
better to plan on re-indexing….

Best,
Erick


> On Oct 16, 2020, at 3:16 PM, David Hastings <ha...@gmail.com> wrote:
> 
> "If you want to
> keep the same field name, you need to delete all of the
> documents in the index, change the schema, and reindex."
> 
> actually doesnt re-indexing a document just delete/replace anyways assuming
> the same id?
> 
> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
> 
>> Just as a side note,
>> 
>>> indexed="true"
>> If you are storing 32K message, you probably are not searching it as a
>> whole string. So, don't index it. You may also want to mark the field
>> as 'large' (and lazy):
>> 
>> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
>> 
>> When you are going to make it a text field, you will probably be
>> having the same issues as well.
>> 
>> And honestly, if you are not storing those fields to search, maybe you
>> need to consider the architecture. Maybe those fields do not need to
>> be in Solr at all, but in external systems. Solr (or any search
>> system) should not be your system of records since - as the other
>> reply showed - some of the answers are "reindex everything".
>> 
>> Regards,
>>   Alex.
>> 
>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <ya...@gmail.com>
>> wrote:
>>> 
>>> I am using solr 8.2
>>> 
>>> Can I change the schema fieldtype from string to solr.TextField
>>> without indexing?
>>> 
>>>    <field name="messagetext" type="string" indexed="true"
>> stored="true"/>
>>> 
>>> The reason is that string has only 32K char limit where as I am looking
>> to
>>> store more than 32K now.
>>> 
>>> The contents on this field doesn't require any analysis or tokenized but
>> I
>>> need this field in the queries and as well as output fields.
>>> 
>>> --
>>> Thanks & Regards,
>>> Yaswanth Kumar Konathala.
>>> yaswanthcse@gmail.com
>> 


Re: converting string to solr.TextField

Posted by David Hastings <ha...@gmail.com>.
"If you want to
keep the same field name, you need to delete all of the
documents in the index, change the schema, and reindex."

actually doesnt re-indexing a document just delete/replace anyways assuming
the same id?

On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Just as a side note,
>
> > indexed="true"
> If you are storing 32K message, you probably are not searching it as a
> whole string. So, don't index it. You may also want to mark the field
> as 'large' (and lazy):
>
> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
>
> When you are going to make it a text field, you will probably be
> having the same issues as well.
>
> And honestly, if you are not storing those fields to search, maybe you
> need to consider the architecture. Maybe those fields do not need to
> be in Solr at all, but in external systems. Solr (or any search
> system) should not be your system of records since - as the other
> reply showed - some of the answers are "reindex everything".
>
> Regards,
>    Alex.
>
> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <ya...@gmail.com>
> wrote:
> >
> > I am using solr 8.2
> >
> > Can I change the schema fieldtype from string to solr.TextField
> > without indexing?
> >
> >     <field name="messagetext" type="string" indexed="true"
> stored="true"/>
> >
> > The reason is that string has only 32K char limit where as I am looking
> to
> > store more than 32K now.
> >
> > The contents on this field doesn't require any analysis or tokenized but
> I
> > need this field in the queries and as well as output fields.
> >
> > --
> > Thanks & Regards,
> > Yaswanth Kumar Konathala.
> > yaswanthcse@gmail.com
>

Re: converting string to solr.TextField

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Just as a side note,

> indexed="true"
If you are storing 32K message, you probably are not searching it as a
whole string. So, don't index it. You may also want to mark the field
as 'large' (and lazy):
https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties

When you are going to make it a text field, you will probably be
having the same issues as well.

And honestly, if you are not storing those fields to search, maybe you
need to consider the architecture. Maybe those fields do not need to
be in Solr at all, but in external systems. Solr (or any search
system) should not be your system of records since - as the other
reply showed - some of the answers are "reindex everything".

Regards,
   Alex.

On Fri, 16 Oct 2020 at 14:02, yaswanth kumar <ya...@gmail.com> wrote:
>
> I am using solr 8.2
>
> Can I change the schema fieldtype from string to solr.TextField
> without indexing?
>
>     <field name="messagetext" type="string" indexed="true" stored="true"/>
>
> The reason is that string has only 32K char limit where as I am looking to
> store more than 32K now.
>
> The contents on this field doesn't require any analysis or tokenized but I
> need this field in the queries and as well as output fields.
>
> --
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanthcse@gmail.com

Re: converting string to solr.TextField

Posted by Walter Underwood <wu...@wunderwood.org>.
No. The data is already indexed as a StringField.

You need to make a new field and reindex. If you want to 
keep the same field name, you need to delete all of the 
documents in the index, change the schema, and reindex.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 16, 2020, at 11:01 AM, yaswanth kumar <ya...@gmail.com> wrote:
> 
> I am using solr 8.2
> 
> Can I change the schema fieldtype from string to solr.TextField
> without indexing?
> 
>    <field name="messagetext" type="string" indexed="true" stored="true"/>
> 
> The reason is that string has only 32K char limit where as I am looking to
> store more than 32K now.
> 
> The contents on this field doesn't require any analysis or tokenized but I
> need this field in the queries and as well as output fields.
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanthcse@gmail.com