You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Riccardo Tasso <ri...@gmail.com> on 2019/06/05 15:27:47 UTC

IntField to IntPoint

Hello everybody,
 I have a (very big) lucene 4 index with documents using IntField. On that
field, which should be stored and sortable, I should search and execute
range queries.

I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed that
IntFields aren't searchable anymore.

Which is the most efficient way to convert IntFields to IntPoints, which
are stored and sortable?

Thanks,
 Riccardo

Re: IntField to IntPoint

Posted by Erick Erickson <er...@gmail.com>.

> On Jun 5, 2019, at 2:07 PM, Riccardo Tasso <ri...@gmail.com> wrote:
> 
> 
> Considering that the IndexUpgrader will efficiently do the most of the work
> I should investigate how to fill this gap, without reindexing from scratch.
> 
> 

This is actually a problem. IndexUpgraderTool creates a single massive segment, essentially an optimize. Here are the reasons that’s bad: 

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

IndexUpgraderTool does _not_ respect the max segment size even now, so the linked article in the one above  about how optimize may not be so bad in Solr 7.5+ is irrelevant.

Textual data is most sensitive to the changes in how Lucene works, other than deprecated types. I strongly recommend you bite the bullet and re-index from your  system of record.

Best,
Erick
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IntField to IntPoint

Posted by Trejkaz <tr...@trypticon.org>.

How we would do it:

- update the index format to v7 (this in itself is fiddly
  but there are ways)
- open the index in-place migrated:
    - get all the leaf indices and wrap each in a new
      subclass of FilterCodecReader
    - override getPointsReader() on that subclass
      to return a correctly implemented PointsReader,
      which can read the data from the stored fields
    - be careful about the order you return the points
    - you might want to spool the points to a
      database like Derby or H2 since if you have a lot
      of data there is a risk of running out of memory
- copy that whole index to a new index using
  IndexWriter#addIndexes(CodecReader...)

Copying the docs works too if you have the original text stored still, but
we didn’t, so we use this sort of technique for all Lucene migrations.

TX


On Thu, 6 Jun 2019 at 07:07, Riccardo Tasso <ri...@gmail.com>
wrote:

> Ok,
>  I know this policy and you perfectly explained why it makes sense.
>
> Anyway my index is really big and contains mostly textual data which are
> expensive to reindex (because of custom analysis).
>
> Considering that the IndexUpgrader will efficiently do the most of the work
> I should investigate how to fill this gap, without reindexing from scratch.
>
>
> The most efficient approach I can figure is:
> * convert from 4 to 7
> * open an index reader and an index writer on the 7 index
> * iterate every document
> * read the numeric field (since it's already stored)
> * add to each document the IntPoint field
> * update the document on the index
>
> I guess the expensive task here is the update, since it will delete and
> readd the document, but in this case I think I will save the analysis
> costs.
>
> Do you think there's a better way of doing this reindex?
>
> Thanks
>
>
> Il mer 5 giu 2019, 17:41 Erick Erickson <er...@gmail.com> ha
> scritto:
>
> > You cannot upgrade more than one major version, you must re-index from
> > scratch. There’s a long discussion of why, but basically it’s summed up
> by
> > this quote from Robert Muir:
> >
> > “I think the key issue here is Lucene is an index not a database. Because
> > it is a lossy index and does not retain all of the user's data, its not
> > possible to safely migrate some things automagically. In the norms case
> > IndexWriter needs to re-analyze the text ("re-index") and compute stats
> to
> > get back the value, so it can be re-encoded. The function is y = f(x) and
> > if x is not available its not possible, so lucene can't do it.”
> >
> > This has always been true, before 8x it would just  fail silently as  you
> > have found. Solr/Lucene starts up but don’t  work quite as expected. As
> of
> > Lucene 8x, Lucene (and therefore Solr) will not even open an index that
> > has  _ever_ been touched by Lucene 6x, no matter what intervening steps
> > have been taken. Or in general,  Lucene/Solr X will  not  open indexes
> > touched by X-2, starting with 8x rather than behave unexpectedly.
> >
> > Best,
> > Erick
> >
> > > On Jun 5, 2019, at 8:27 AM, Riccardo Tasso <ri...@gmail.com>
> > wrote:
> > >
> > > Hello everybody,
> > > I have a (very big) lucene 4 index with documents using IntField. On
> that
> > > field, which should be stored and sortable, I should search and execute
> > > range queries.
> > >
> > > I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed
> > that
> > > IntFields aren't searchable anymore.
> > >
> > > Which is the most efficient way to convert IntFields to IntPoints,
> which
> > > are stored and sortable?
> > >
> > > Thanks,
> > > Riccardo
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>

Re: IntField to IntPoint

Posted by Riccardo Tasso <ri...@gmail.com>.

Ok,
 I know this policy and you perfectly explained why it makes sense.

Anyway my index is really big and contains mostly textual data which are
expensive to reindex (because of custom analysis).

Considering that the IndexUpgrader will efficiently do the most of the work
I should investigate how to fill this gap, without reindexing from scratch.


The most efficient approach I can figure is:
* convert from 4 to 7
* open an index reader and an index writer on the 7 index
* iterate every document
* read the numeric field (since it's already stored)
* add to each document the IntPoint field
* update the document on the index

I guess the expensive task here is the update, since it will delete and
readd the document, but in this case I think I will save the analysis costs.

Do you think there's a better way of doing this reindex?

Thanks


Il mer 5 giu 2019, 17:41 Erick Erickson <er...@gmail.com> ha
scritto:

> You cannot upgrade more than one major version, you must re-index from
> scratch. There’s a long discussion of why, but basically it’s summed up by
> this quote from Robert Muir:
>
> “I think the key issue here is Lucene is an index not a database. Because
> it is a lossy index and does not retain all of the user's data, its not
> possible to safely migrate some things automagically. In the norms case
> IndexWriter needs to re-analyze the text ("re-index") and compute stats to
> get back the value, so it can be re-encoded. The function is y = f(x) and
> if x is not available its not possible, so lucene can't do it.”
>
> This has always been true, before 8x it would just  fail silently as  you
> have found. Solr/Lucene starts up but don’t  work quite as expected. As of
> Lucene 8x, Lucene (and therefore Solr) will not even open an index that
> has  _ever_ been touched by Lucene 6x, no matter what intervening steps
> have been taken. Or in general,  Lucene/Solr X will  not  open indexes
> touched by X-2, starting with 8x rather than behave unexpectedly.
>
> Best,
> Erick
>
> > On Jun 5, 2019, at 8:27 AM, Riccardo Tasso <ri...@gmail.com>
> wrote:
> >
> > Hello everybody,
> > I have a (very big) lucene 4 index with documents using IntField. On that
> > field, which should be stored and sortable, I should search and execute
> > range queries.
> >
> > I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed
> that
> > IntFields aren't searchable anymore.
> >
> > Which is the most efficient way to convert IntFields to IntPoints, which
> > are stored and sortable?
> >
> > Thanks,
> > Riccardo
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

Re: IntField to IntPoint

Posted by Erick Erickson <er...@gmail.com>.

Omitting norms and the like only matters for text fields, primitives (numerics, boolean, string) don’t have any  of that information.

You really have no choice but to re-index to jump from 4->7. Or, I should say you’re completely unsupported  and you will have to deal with any anomalies. I suppose if the only thing you care about is non-textual data you might be OK, but it's iffy at best.

And you’ll have to play low-level games with Lucene to rewrite the segments with points rather than ints.

Good  luck!
Erick

I’ll wager that you’ll be faster to re-index, painful though it may be rather than write custom code  to do this.

> On Jun 7, 2019, at 1:40 AM, Riccardo Tasso <ri...@gmail.com> wrote:
> 
> Thanks Erik for your answer.
> 
> Unfortunately I should migrate the index for time reasons. Maybe in a
> second moment we will have the opportunity to reindex.
> 
> Our use case is to classify documents in the index with lucene queries,
> hence we're not really interested in ranking or sorting (which could be
> relevant for the "norms case"). Do you think that migrating and reindexing
> only the numeric fields could compromise the results returned by any query
> (term, boolean, range, phrase, prefix)?
> 
> Il giorno mer 5 giu 2019 alle ore 17:41 Erick Erickson <
> erickerickson@gmail.com> ha scritto:
> 
>> You cannot upgrade more than one major version, you must re-index from
>> scratch. There’s a long discussion of why, but basically it’s summed up by
>> this quote from Robert Muir:
>> 
>> “I think the key issue here is Lucene is an index not a database. Because
>> it is a lossy index and does not retain all of the user's data, its not
>> possible to safely migrate some things automagically. In the norms case
>> IndexWriter needs to re-analyze the text ("re-index") and compute stats to
>> get back the value, so it can be re-encoded. The function is y = f(x) and
>> if x is not available its not possible, so lucene can't do it.”
>> 
>> This has always been true, before 8x it would just  fail silently as  you
>> have found. Solr/Lucene starts up but don’t  work quite as expected. As of
>> Lucene 8x, Lucene (and therefore Solr) will not even open an index that
>> has  _ever_ been touched by Lucene 6x, no matter what intervening steps
>> have been taken. Or in general,  Lucene/Solr X will  not  open indexes
>> touched by X-2, starting with 8x rather than behave unexpectedly.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 5, 2019, at 8:27 AM, Riccardo Tasso <ri...@gmail.com>
>> wrote:
>>> 
>>> Hello everybody,
>>> I have a (very big) lucene 4 index with documents using IntField. On that
>>> field, which should be stored and sortable, I should search and execute
>>> range queries.
>>> 
>>> I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed
>> that
>>> IntFields aren't searchable anymore.
>>> 
>>> Which is the most efficient way to convert IntFields to IntPoints, which
>>> are stored and sortable?
>>> 
>>> Thanks,
>>> Riccardo
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IntField to IntPoint

Posted by Riccardo Tasso <ri...@gmail.com>.

Thanks Erik for your answer.

Unfortunately I should migrate the index for time reasons. Maybe in a
second moment we will have the opportunity to reindex.

Our use case is to classify documents in the index with lucene queries,
hence we're not really interested in ranking or sorting (which could be
relevant for the "norms case"). Do you think that migrating and reindexing
only the numeric fields could compromise the results returned by any query
(term, boolean, range, phrase, prefix)?

Il giorno mer 5 giu 2019 alle ore 17:41 Erick Erickson <
erickerickson@gmail.com> ha scritto:

> You cannot upgrade more than one major version, you must re-index from
> scratch. There’s a long discussion of why, but basically it’s summed up by
> this quote from Robert Muir:
>
> “I think the key issue here is Lucene is an index not a database. Because
> it is a lossy index and does not retain all of the user's data, its not
> possible to safely migrate some things automagically. In the norms case
> IndexWriter needs to re-analyze the text ("re-index") and compute stats to
> get back the value, so it can be re-encoded. The function is y = f(x) and
> if x is not available its not possible, so lucene can't do it.”
>
> This has always been true, before 8x it would just  fail silently as  you
> have found. Solr/Lucene starts up but don’t  work quite as expected. As of
> Lucene 8x, Lucene (and therefore Solr) will not even open an index that
> has  _ever_ been touched by Lucene 6x, no matter what intervening steps
> have been taken. Or in general,  Lucene/Solr X will  not  open indexes
> touched by X-2, starting with 8x rather than behave unexpectedly.
>
> Best,
> Erick
>
> > On Jun 5, 2019, at 8:27 AM, Riccardo Tasso <ri...@gmail.com>
> wrote:
> >
> > Hello everybody,
> > I have a (very big) lucene 4 index with documents using IntField. On that
> > field, which should be stored and sortable, I should search and execute
> > range queries.
> >
> > I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed
> that
> > IntFields aren't searchable anymore.
> >
> > Which is the most efficient way to convert IntFields to IntPoints, which
> > are stored and sortable?
> >
> > Thanks,
> > Riccardo
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: IntField to IntPoint

Posted by Erick Erickson <er...@gmail.com>.

You cannot upgrade more than one major version, you must re-index from scratch. There’s a long discussion of why, but basically it’s summed up by this quote from Robert Muir:

“I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all of the user's data, its not possible to safely migrate some things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index") and compute stats to get back the value, so it can be re-encoded. The function is y = f(x) and if x is not available its not possible, so lucene can't do it.”

This has always been true, before 8x it would just  fail silently as  you have found. Solr/Lucene starts up but don’t  work quite as expected. As of Lucene 8x, Lucene (and therefore Solr) will not even open an index that has  _ever_ been touched by Lucene 6x, no matter what intervening steps have been taken. Or in general,  Lucene/Solr X will  not  open indexes touched by X-2, starting with 8x rather than behave unexpectedly.

Best,
Erick

> On Jun 5, 2019, at 8:27 AM, Riccardo Tasso <ri...@gmail.com> wrote:
> 
> Hello everybody,
> I have a (very big) lucene 4 index with documents using IntField. On that
> field, which should be stored and sortable, I should search and execute
> range queries.
> 
> I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed that
> IntFields aren't searchable anymore.
> 
> Which is the most efficient way to convert IntFields to IntPoints, which
> are stored and sortable?
> 
> Thanks,
> Riccardo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org