You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stuart Goldberg <sg...@fixflyer.com> on 2016/07/08 16:12:41 UTC

Problems Refactoring a Lucene Index

As our software goes through its lifecycle, we sometimes have to alter
existing Lucene indexes. The way I have done that in the past is to open the
existing index for reading, read each Document, modify it and write that
Document to a new index. At the end of the process, I delete the old index
and rename the new index to the old name.

I do not do any tokenizing and use no analyzers.

I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
problem: Suppose the existing document has 10 fields in it and there's one I
have to modify. I remove that field and re-add it with the new settings.
Then I add the Document in its entirety to the new index. I run into the
following problems:

*	I get Exceptions thrown for the fields I don't even touch. That's
because their FieldType has 'tokenized' set to true and it fails because I
am using no analyzers. 'tokenized' is set to true even though when I
originally added the field to the original index I had 'tokenized' set to
false!
		
*	I have LongFields that come back with 'indexed' set to false even
though in the original index they were indexed! This makes the new index not
searchable on these fields and hence unusable. 

*	I can't even alter 'indexed' for these LongFields because for some
reason the FieldType instance comes back frozen from the IndexReader. Once
frozen,  you can't alter it. Even if I create a new FieldType, there is no
way to change the FieldType of a Field
		
It seems the returned FieldType contents are kind of random!

I did see in the Javadoc of IndexReader.document() that field metadata is
not returned and that, in fact, that they should have new kind of object
returned like 'StoredField' so there is no pretense of there being any
metadata.

I thought perhaps I could use FieldInfos. But that class returns the same
bogus metadata.  What then is the purpose of FieldInfos if the info is
bogus?

Am I not understanding something here? This is not very usable. What can I
do to work around this? Is this a Lucene bug? Oversight?


Re: Problems Refactoring a Lucene Index

Posted by Michael McCandless <lu...@mikemccandless.com>.
It has never worked, though I do think the metadata has changed over time,
so the degree to which it didn't work has changed?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Aug 22, 2016 at 4:41 PM, Stuart Goldberg <sg...@fixflyer.com>
wrote:

> Understood, but did it used to work?
>
>
>
> Stuart M Goldberg
>
> Senior Vice President of Software Develpment
> *FIX Flyer LLC*
> http://www.FIXFlyer.com/
>
> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>
>
>
> *From:* Michael McCandless [mailto:lucene@mikemccandless.com]
> *Sent:* Monday, August 22, 2016 4:38 PM
> *To:* Stuart Goldberg <sg...@fixflyer.com>
> *Cc:* Lucene Users <ja...@lucene.apache.org>
>
> *Subject:* Re: Problems Refactoring a Lucene Index
>
>
>
> The design is indeed trappy, and many users have hit the situation you
> have, and we have tried to fix this before (to change IndexReader.document
> to return a different class than Document), but it didn't "take":
> https://issues.apache.org/jira/browse/LUCENE-6971
>
>
>
> Have a look at FieldInfo.java to see the metadata it records.
>
>
>
> The challenge here is Lucene's schema-less-ness.  For example, on a
> document by document basis, you can change how term vectors are indexed,
> whether a field is stored, or omits norms, or indexes only docs and not
> freqs, etc., for the same field across documents, across segments.
>
>
>
> Lucene only stores in FieldInfo what is necessary for it to read the index
> files, and does not store metadata beyond that.
>
>
>
> Patches welcome :)  We should fix this trap; it's just that doing so is
> apparently not so easy.
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Mon, Aug 22, 2016 at 11:04 AM, Stuart Goldberg <sg...@fixflyer.com>
> wrote:
>
> Thanks for the quick response.
>
>
>
> I kind of figured on my own that I had to recreate the document from
> scratch
>
>
>
> But there is something in your response that I don’t understand. You say “Lucene
> only preserves the metadata it needs for each field”. What does that mean?
> In my posting I gave examples of metadata returned that is clearly the
> exact opposite of the metadata that was there when originally indexed.
>
>
>
> According to what you are saying there is metadata that is preserved
> correctly. What metadata is that?
>
>
>
> Not sure if you are just a Lucene guru (I have your Lucene in Action
> books!) or an actual author/contributor to the code, so my observation
> might not be appropriately directed at you. But it seems a questionable API
> design to return a “Document” from the index that has properties described
> by the Javadoc that give back bogus data.
>
>
>
> And what about the FieldInfo class that purports to give back field
> information. Why have such an API if the data it provides is bogus?
>
>
>
> Stuart M Goldberg
>
> Senior Vice President of Software Develpment
> *FIX Flyer LLC*
> http://www.FIXFlyer.com/
>
> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>
>
>
> *From:* Michael McCandless [mailto:lucene@mikemccandless.com]
> *Sent:* Monday, August 22, 2016 10:48 AM
> *To:* Lucene Users <ja...@lucene.apache.org>; sgoldberg@fixflyer.com
> *Subject:* Re: Problems Refactoring a Lucene Index
>
>
>
> This is unfortunately "by design": Lucene makes no guarantees that the
> Document you retrieve from an IndexReader is precisely the same Document
> you had indexed.
>
>
>
> Lucene only preserves the metadata it needs for each field.
>
>
>
> Your only recourse is to create a new Document using your application
> level information about which fields are tokenized, indexed, etc.
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Fri, Jul 8, 2016 at 12:12 PM, Stuart Goldberg <sg...@fixflyer.com>
> wrote:
>
> As our software goes through its lifecycle, we sometimes have to alter
> existing Lucene indexes. The way I have done that in the past is to open
> the
> existing index for reading, read each Document, modify it and write that
> Document to a new index. At the end of the process, I delete the old index
> and rename the new index to the old name.
>
> I do not do any tokenizing and use no analyzers.
>
> I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
> problem: Suppose the existing document has 10 fields in it and there's one
> I
> have to modify. I remove that field and re-add it with the new settings.
> Then I add the Document in its entirety to the new index. I run into the
> following problems:
>
> *       I get Exceptions thrown for the fields I don't even touch. That's
> because their FieldType has 'tokenized' set to true and it fails because I
> am using no analyzers. 'tokenized' is set to true even though when I
> originally added the field to the original index I had 'tokenized' set to
> false!
>
> *       I have LongFields that come back with 'indexed' set to false even
> though in the original index they were indexed! This makes the new index
> not
> searchable on these fields and hence unusable.
>
> *       I can't even alter 'indexed' for these LongFields because for some
> reason the FieldType instance comes back frozen from the IndexReader. Once
> frozen,  you can't alter it. Even if I create a new FieldType, there is no
> way to change the FieldType of a Field
>
> It seems the returned FieldType contents are kind of random!
>
> I did see in the Javadoc of IndexReader.document() that field metadata is
> not returned and that, in fact, that they should have new kind of object
> returned like 'StoredField' so there is no pretense of there being any
> metadata.
>
> I thought perhaps I could use FieldInfos. But that class returns the same
> bogus metadata.  What then is the purpose of FieldInfos if the info is
> bogus?
>
> Am I not understanding something here? This is not very usable. What can I
> do to work around this? Is this a Lucene bug? Oversight?
>
>
>
>
>

RE: Problems Refactoring a Lucene Index

Posted by Stuart Goldberg <sg...@fixflyer.com>.
Understood, but did it used to work?

 

Stuart M Goldberg

Senior Vice President of Software Develpment
FIX Flyer LLC
http://www.FIXFlyer.com/

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

 

From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Monday, August 22, 2016 4:38 PM
To: Stuart Goldberg <sg...@fixflyer.com>
Cc: Lucene Users <ja...@lucene.apache.org>
Subject: Re: Problems Refactoring a Lucene Index

 

The design is indeed trappy, and many users have hit the situation you have, and we have tried to fix this before (to change IndexReader.document to return a different class than Document), but it didn't "take": https://issues.apache.org/jira/browse/LUCENE-6971

 

Have a look at FieldInfo.java to see the metadata it records.

 

The challenge here is Lucene's schema-less-ness.  For example, on a document by document basis, you can change how term vectors are indexed, whether a field is stored, or omits norms, or indexes only docs and not freqs, etc., for the same field across documents, across segments.

 

Lucene only stores in FieldInfo what is necessary for it to read the index files, and does not store metadata beyond that.

 

Patches welcome :)  We should fix this trap; it's just that doing so is apparently not so easy.




Mike McCandless

http://blog.mikemccandless.com

 

On Mon, Aug 22, 2016 at 11:04 AM, Stuart Goldberg <sgoldberg@fixflyer.com <ma...@fixflyer.com> > wrote:

Thanks for the quick response.

 

I kind of figured on my own that I had to recreate the document from scratch

 

But there is something in your response that I don’t understand. You say “Lucene only preserves the metadata it needs for each field”. What does that mean? In my posting I gave examples of metadata returned that is clearly the exact opposite of the metadata that was there when originally indexed.

 

According to what you are saying there is metadata that is preserved correctly. What metadata is that?

 

Not sure if you are just a Lucene guru (I have your Lucene in Action books!) or an actual author/contributor to the code, so my observation might not be appropriately directed at you. But it seems a questionable API design to return a “Document” from the index that has properties described by the Javadoc that give back bogus data.

 

And what about the FieldInfo class that purports to give back field information. Why have such an API if the data it provides is bogus?

 

Stuart M Goldberg

Senior Vice President of Software Develpment
FIX Flyer LLC
http://www.FIXFlyer.com/

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

 

From: Michael McCandless [mailto:lucene@mikemccandless.com <ma...@mikemccandless.com> ] 
Sent: Monday, August 22, 2016 10:48 AM
To: Lucene Users <java-user@lucene.apache.org <ma...@lucene.apache.org> >; sgoldberg@fixflyer.com <ma...@fixflyer.com> 
Subject: Re: Problems Refactoring a Lucene Index

 

This is unfortunately "by design": Lucene makes no guarantees that the Document you retrieve from an IndexReader is precisely the same Document you had indexed.

 

Lucene only preserves the metadata it needs for each field.

 

Your only recourse is to create a new Document using your application level information about which fields are tokenized, indexed, etc.




Mike McCandless

http://blog.mikemccandless.com

 

On Fri, Jul 8, 2016 at 12:12 PM, Stuart Goldberg <sgoldberg@fixflyer.com <ma...@fixflyer.com> > wrote:

As our software goes through its lifecycle, we sometimes have to alter
existing Lucene indexes. The way I have done that in the past is to open the
existing index for reading, read each Document, modify it and write that
Document to a new index. At the end of the process, I delete the old index
and rename the new index to the old name.

I do not do any tokenizing and use no analyzers.

I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
problem: Suppose the existing document has 10 fields in it and there's one I
have to modify. I remove that field and re-add it with the new settings.
Then I add the Document in its entirety to the new index. I run into the
following problems:

*       I get Exceptions thrown for the fields I don't even touch. That's
because their FieldType has 'tokenized' set to true and it fails because I
am using no analyzers. 'tokenized' is set to true even though when I
originally added the field to the original index I had 'tokenized' set to
false!

*       I have LongFields that come back with 'indexed' set to false even
though in the original index they were indexed! This makes the new index not
searchable on these fields and hence unusable.

*       I can't even alter 'indexed' for these LongFields because for some
reason the FieldType instance comes back frozen from the IndexReader. Once
frozen,  you can't alter it. Even if I create a new FieldType, there is no
way to change the FieldType of a Field

It seems the returned FieldType contents are kind of random!

I did see in the Javadoc of IndexReader.document() that field metadata is
not returned and that, in fact, that they should have new kind of object
returned like 'StoredField' so there is no pretense of there being any
metadata.

I thought perhaps I could use FieldInfos. But that class returns the same
bogus metadata.  What then is the purpose of FieldInfos if the info is
bogus?

Am I not understanding something here? This is not very usable. What can I
do to work around this? Is this a Lucene bug? Oversight?

 

 


Re: Problems Refactoring a Lucene Index

Posted by Michael McCandless <lu...@mikemccandless.com>.
The design is indeed trappy, and many users have hit the situation you
have, and we have tried to fix this before (to change IndexReader.document
to return a different class than Document), but it didn't "take":
https://issues.apache.org/jira/browse/LUCENE-6971

Have a look at FieldInfo.java to see the metadata it records.

The challenge here is Lucene's schema-less-ness.  For example, on a
document by document basis, you can change how term vectors are indexed,
whether a field is stored, or omits norms, or indexes only docs and not
freqs, etc., for the same field across documents, across segments.

Lucene only stores in FieldInfo what is necessary for it to read the index
files, and does not store metadata beyond that.

Patches welcome :)  We should fix this trap; it's just that doing so is
apparently not so easy.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Aug 22, 2016 at 11:04 AM, Stuart Goldberg <sg...@fixflyer.com>
wrote:

> Thanks for the quick response.
>
>
>
> I kind of figured on my own that I had to recreate the document from
> scratch
>
>
>
> But there is something in your response that I don’t understand. You say “Lucene
> only preserves the metadata it needs for each field”. What does that mean?
> In my posting I gave examples of metadata returned that is clearly the
> exact opposite of the metadata that was there when originally indexed.
>
>
>
> According to what you are saying there is metadata that is preserved
> correctly. What metadata is that?
>
>
>
> Not sure if you are just a Lucene guru (I have your Lucene in Action
> books!) or an actual author/contributor to the code, so my observation
> might not be appropriately directed at you. But it seems a questionable API
> design to return a “Document” from the index that has properties described
> by the Javadoc that give back bogus data.
>
>
>
> And what about the FieldInfo class that purports to give back field
> information. Why have such an API if the data it provides is bogus?
>
>
>
> Stuart M Goldberg
>
> Senior Vice President of Software Develpment
> *FIX Flyer LLC*
> http://www.FIXFlyer.com/
>
> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>
>
>
> *From:* Michael McCandless [mailto:lucene@mikemccandless.com]
> *Sent:* Monday, August 22, 2016 10:48 AM
> *To:* Lucene Users <ja...@lucene.apache.org>; sgoldberg@fixflyer.com
> *Subject:* Re: Problems Refactoring a Lucene Index
>
>
>
> This is unfortunately "by design": Lucene makes no guarantees that the
> Document you retrieve from an IndexReader is precisely the same Document
> you had indexed.
>
>
>
> Lucene only preserves the metadata it needs for each field.
>
>
>
> Your only recourse is to create a new Document using your application
> level information about which fields are tokenized, indexed, etc.
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Fri, Jul 8, 2016 at 12:12 PM, Stuart Goldberg <sg...@fixflyer.com>
> wrote:
>
> As our software goes through its lifecycle, we sometimes have to alter
> existing Lucene indexes. The way I have done that in the past is to open
> the
> existing index for reading, read each Document, modify it and write that
> Document to a new index. At the end of the process, I delete the old index
> and rename the new index to the old name.
>
> I do not do any tokenizing and use no analyzers.
>
> I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
> problem: Suppose the existing document has 10 fields in it and there's one
> I
> have to modify. I remove that field and re-add it with the new settings.
> Then I add the Document in its entirety to the new index. I run into the
> following problems:
>
> *       I get Exceptions thrown for the fields I don't even touch. That's
> because their FieldType has 'tokenized' set to true and it fails because I
> am using no analyzers. 'tokenized' is set to true even though when I
> originally added the field to the original index I had 'tokenized' set to
> false!
>
> *       I have LongFields that come back with 'indexed' set to false even
> though in the original index they were indexed! This makes the new index
> not
> searchable on these fields and hence unusable.
>
> *       I can't even alter 'indexed' for these LongFields because for some
> reason the FieldType instance comes back frozen from the IndexReader. Once
> frozen,  you can't alter it. Even if I create a new FieldType, there is no
> way to change the FieldType of a Field
>
> It seems the returned FieldType contents are kind of random!
>
> I did see in the Javadoc of IndexReader.document() that field metadata is
> not returned and that, in fact, that they should have new kind of object
> returned like 'StoredField' so there is no pretense of there being any
> metadata.
>
> I thought perhaps I could use FieldInfos. But that class returns the same
> bogus metadata.  What then is the purpose of FieldInfos if the info is
> bogus?
>
> Am I not understanding something here? This is not very usable. What can I
> do to work around this? Is this a Lucene bug? Oversight?
>
>
>

RE: Problems Refactoring a Lucene Index

Posted by Stuart Goldberg <sg...@fixflyer.com>.
Thanks for the quick response.

 

I kind of figured on my own that I had to recreate the document from scratch

 

But there is something in your response that I don’t understand. You say “Lucene only preserves the metadata it needs for each field”. What does that mean? In my posting I gave examples of metadata returned that is clearly the exact opposite of the metadata that was there when originally indexed.

 

According to what you are saying there is metadata that is preserved correctly. What metadata is that?

 

Not sure if you are just a Lucene guru (I have your Lucene in Action books!) or an actual author/contributor to the code, so my observation might not be appropriately directed at you. But it seems a questionable API design to return a “Document” from the index that has properties described by the Javadoc that give back bogus data.

 

And what about the FieldInfo class that purports to give back field information. Why have such an API if the data it provides is bogus?

 

Stuart M Goldberg

Senior Vice President of Software Develpment
FIX Flyer LLC
http://www.FIXFlyer.com/

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

 

From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Monday, August 22, 2016 10:48 AM
To: Lucene Users <ja...@lucene.apache.org>; sgoldberg@fixflyer.com
Subject: Re: Problems Refactoring a Lucene Index

 

This is unfortunately "by design": Lucene makes no guarantees that the Document you retrieve from an IndexReader is precisely the same Document you had indexed.

 

Lucene only preserves the metadata it needs for each field.

 

Your only recourse is to create a new Document using your application level information about which fields are tokenized, indexed, etc.




Mike McCandless

http://blog.mikemccandless.com

 

On Fri, Jul 8, 2016 at 12:12 PM, Stuart Goldberg <sgoldberg@fixflyer.com <ma...@fixflyer.com> > wrote:

As our software goes through its lifecycle, we sometimes have to alter
existing Lucene indexes. The way I have done that in the past is to open the
existing index for reading, read each Document, modify it and write that
Document to a new index. At the end of the process, I delete the old index
and rename the new index to the old name.

I do not do any tokenizing and use no analyzers.

I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
problem: Suppose the existing document has 10 fields in it and there's one I
have to modify. I remove that field and re-add it with the new settings.
Then I add the Document in its entirety to the new index. I run into the
following problems:

*       I get Exceptions thrown for the fields I don't even touch. That's
because their FieldType has 'tokenized' set to true and it fails because I
am using no analyzers. 'tokenized' is set to true even though when I
originally added the field to the original index I had 'tokenized' set to
false!

*       I have LongFields that come back with 'indexed' set to false even
though in the original index they were indexed! This makes the new index not
searchable on these fields and hence unusable.

*       I can't even alter 'indexed' for these LongFields because for some
reason the FieldType instance comes back frozen from the IndexReader. Once
frozen,  you can't alter it. Even if I create a new FieldType, there is no
way to change the FieldType of a Field

It seems the returned FieldType contents are kind of random!

I did see in the Javadoc of IndexReader.document() that field metadata is
not returned and that, in fact, that they should have new kind of object
returned like 'StoredField' so there is no pretense of there being any
metadata.

I thought perhaps I could use FieldInfos. But that class returns the same
bogus metadata.  What then is the purpose of FieldInfos if the info is
bogus?

Am I not understanding something here? This is not very usable. What can I
do to work around this? Is this a Lucene bug? Oversight?

 


Re: Problems Refactoring a Lucene Index

Posted by Michael McCandless <lu...@mikemccandless.com>.
This is unfortunately "by design": Lucene makes no guarantees that the
Document you retrieve from an IndexReader is precisely the same Document
you had indexed.

Lucene only preserves the metadata it needs for each field.

Your only recourse is to create a new Document using your application level
information about which fields are tokenized, indexed, etc.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jul 8, 2016 at 12:12 PM, Stuart Goldberg <sg...@fixflyer.com>
wrote:

> As our software goes through its lifecycle, we sometimes have to alter
> existing Lucene indexes. The way I have done that in the past is to open
> the
> existing index for reading, read each Document, modify it and write that
> Document to a new index. At the end of the process, I delete the old index
> and rename the new index to the old name.
>
> I do not do any tokenizing and use no analyzers.
>
> I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
> problem: Suppose the existing document has 10 fields in it and there's one
> I
> have to modify. I remove that field and re-add it with the new settings.
> Then I add the Document in its entirety to the new index. I run into the
> following problems:
>
> *       I get Exceptions thrown for the fields I don't even touch. That's
> because their FieldType has 'tokenized' set to true and it fails because I
> am using no analyzers. 'tokenized' is set to true even though when I
> originally added the field to the original index I had 'tokenized' set to
> false!
>
> *       I have LongFields that come back with 'indexed' set to false even
> though in the original index they were indexed! This makes the new index
> not
> searchable on these fields and hence unusable.
>
> *       I can't even alter 'indexed' for these LongFields because for some
> reason the FieldType instance comes back frozen from the IndexReader. Once
> frozen,  you can't alter it. Even if I create a new FieldType, there is no
> way to change the FieldType of a Field
>
> It seems the returned FieldType contents are kind of random!
>
> I did see in the Javadoc of IndexReader.document() that field metadata is
> not returned and that, in fact, that they should have new kind of object
> returned like 'StoredField' so there is no pretense of there being any
> metadata.
>
> I thought perhaps I could use FieldInfos. But that class returns the same
> bogus metadata.  What then is the purpose of FieldInfos if the info is
> bogus?
>
> Am I not understanding something here? This is not very usable. What can I
> do to work around this? Is this a Lucene bug? Oversight?
>
>