You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2016/08/11 23:43:08 UTC

MultiFields#getTerms docs clarification

Hi all.

The docs on MultiFields#getTerms state:

> This method may return null if the field does not exist.

Does this mean:

  (a) The method *will* return null if the field does not exist.

  (b) The method will *not necessarily* return null if the field does not exist.

I think we've seen a situation where it somehow returned non-null, but
them Terms#getMin() returned an empty BytesRef, as if we had asked for
an absent value. I would expect getMin() not to count absent values as
the minimum, only because if that were the case, I would have
reproduced the same error during development.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Unsubscribing problems

Posted by Chris Hostetter <ho...@fucit.org>.
Peyman: I'll contact you off list to try and address your specific 
problem.

As a general reminder for all users: If you need help with the mailing 
list, step #1 should be to email the automated help system via 
java-user-help@lucene (identified in the Mailin-List and List-Help mail 
MIME headers of every email)

The automated response will then point you to java-user-owner@lucene to 
contact the human list moderators if you still need additional help.

FWIW: Some additional helpful tips can be found here: 
https://wiki.apache.org/solr/Unsubscribing%20from%20mailing%20lists


: Date: Wed, 7 Sep 2016 11:11:52 -0400
: From: Robust Links <pe...@robustlinks.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Unsubscribing problems
: 
: Hi
: 
: I am not sure who to report this to but I have tried to unsubscribe from lucene lists (including java-user@lucene.apache.org) without success many times now. I have sent an unsubscribe email to all of the list servers on this page, with no bounces. 
: 
: https://lucene.apache.org/core/discussion.html <https://lucene.apache.org/core/discussion.html>
: 
: I am however still receiving emails. How does one unsubscribe?
: 
: thank you
: 
: Peyman

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Unsubscribing problems

Posted by Robust Links <pe...@robustlinks.com>.
Hi

I am not sure who to report this to but I have tried to unsubscribe from lucene lists (including java-user@lucene.apache.org) without success many times now. I have sent an unsubscribe email to all of the list servers on this page, with no bounces. 

https://lucene.apache.org/core/discussion.html <https://lucene.apache.org/core/discussion.html>

I am however still receiving emails. How does one unsubscribe?

thank you

Peyman

RE: MultiFields#getTerms docs clarification

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

if you have an untokenized StringField and index the "empty token" it will appear in the index. If you are reindexing by hand (parsing the stored fields of your 3.x index), I'd suggest to add some length==0 check before adding the field.

With IndexUpgrader you cannot easily get rid of the field, unless you use a FilterAtomicReader that removes empty tokens and IndexWriter.addIndexes() to rebuild your index.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Trejkaz [mailto:trejkaz@trypticon.org]
> Sent: Wednesday, August 31, 2016 6:33 AM
> To: Lucene Users Mailing List <ja...@lucene.apache.org>
> Subject: Re: MultiFields#getTerms docs clarification
> 
> On Mon, Aug 29, 2016 at 8:23 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > Seems like you need to scrutinize exactly what documents were indexed in
> step 3?
> >
> > How exactly did you copy documents out of the old index?  Note that
> > when Lucene's IndexReader returns a Document, it's not the same
> > Document that was indexed in the first place: it will only have fields
> > that were stored, and it does not store certain metadata about how
> > those field values were indexed.  But I don't see how that alone can
> > lead to indexing an empty string token.
> 
> The root cause is that, apparently, in some older version, we *did*
> index an empty field, which at some point later had already been fixed
> by someone else. I verified that this empty field was in fact present
> in the stored fields for the document before the index was migrated to
> Lucene 5.
> 
> So the only obvious difference then is between Lucene 3 indexing no
> tokens for this field, and Lucene 5 indexing a single empty token?
> 
> I have ended up putting in a migration to delete the spurious empty
> term in the postings as well as deleting the empty field from all the
> documents where it's present.
> 
> TX
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MultiFields#getTerms docs clarification

Posted by Trejkaz <tr...@trypticon.org>.
On Mon, Aug 29, 2016 at 8:23 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Seems like you need to scrutinize exactly what documents were indexed in step 3?
>
> How exactly did you copy documents out of the old index?  Note that
> when Lucene's IndexReader returns a Document, it's not the same
> Document that was indexed in the first place: it will only have fields
> that were stored, and it does not store certain metadata about how
> those field values were indexed.  But I don't see how that alone can
> lead to indexing an empty string token.

The root cause is that, apparently, in some older version, we *did*
index an empty field, which at some point later had already been fixed
by someone else. I verified that this empty field was in fact present
in the stored fields for the document before the index was migrated to
Lucene 5.

So the only obvious difference then is between Lucene 3 indexing no
tokens for this field, and Lucene 5 indexing a single empty token?

I have ended up putting in a migration to delete the spurious empty
term in the postings as well as deleting the empty field from all the
documents where it's present.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MultiFields#getTerms docs clarification

Posted by Michael McCandless <lu...@mikemccandless.com>.
Seems like you need to scrutinize exactly what documents were indexed in step 3?

How exactly did you copy documents out of the old index?  Note that
when Lucene's IndexReader returns a Document, it's not the same
Document that was indexed in the first place: it will only have fields
that were stored, and it does not store certain metadata about how
those field values were indexed.  But I don't see how that alone can
lead to indexing an empty string token.

Mike McCandless

http://blog.mikemccandless.com


On Sun, Aug 28, 2016 at 7:56 PM, Trejkaz <tr...@trypticon.org> wrote:
> Updating this with newly-obtained info.
>
> 1. The original index was created in Lucene 3.x. In 3.x, if I call
> getMin(), it returns non-empty values. So far so good.
>
> 2. The index then gets migrated to 5.x using multiple IndexUpgrader
> steps. Now, when I call getMin(), it still returns a non-empty value.
>
> 3. At some point, the user performs an operation where we copy
> documents out of the current index into a new index. When we get the
> Document, it has the field in question, even though no value was set
> into the field. This then gets indexed, and when the destination index
> is finally opened, getMin() returns an empty string.
>
> Something doesn't quite add up though.
>
> Surely if we had put an empty string into a field back in 3.x, it
> would have indexed it, and then getMin() would have always returned
> the empty string, but that isn't what we're seeing at all. Even after
> upgrading the index to the 5.x format, getMin() still returns the
> lowest real value. Therefore, it seems reasonable to assume that we
> weren't putting the empty field into the document. But if we didn't
> put it into the document, why is the field now coming back in Lucene
> 5.x?
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MultiFields#getTerms docs clarification

Posted by Trejkaz <tr...@trypticon.org>.
Updating this with newly-obtained info.

1. The original index was created in Lucene 3.x. In 3.x, if I call
getMin(), it returns non-empty values. So far so good.

2. The index then gets migrated to 5.x using multiple IndexUpgrader
steps. Now, when I call getMin(), it still returns a non-empty value.

3. At some point, the user performs an operation where we copy
documents out of the current index into a new index. When we get the
Document, it has the field in question, even though no value was set
into the field. This then gets indexed, and when the destination index
is finally opened, getMin() returns an empty string.

Something doesn't quite add up though.

Surely if we had put an empty string into a field back in 3.x, it
would have indexed it, and then getMin() would have always returned
the empty string, but that isn't what we're seeing at all. Even after
upgrading the index to the 5.x format, getMin() still returns the
lowest real value. Therefore, it seems reasonable to assume that we
weren't putting the empty field into the document. But if we didn't
put it into the document, why is the field now coming back in Lucene
5.x?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MultiFields#getTerms docs clarification

Posted by Trejkaz <tr...@trypticon.org>.
On Fri, Aug 12, 2016 at 11:51 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Getting an empty BytesRef back from Terms.getMin() means Lucene thinks you
> indexed an empty (zero length) token.  Lucene (unfortunately) allows this.
> Is it possible you did that?
>
> If not, can you make a test case showing this?

I have no idea how they got it to happen either. I'm hoping a
reproduction will be provided so that I can figure out how it could
have happened and then hopefully be able to make a test case.

The field is some kind of numeric field, so it *should* have either
been absent or contained a number. But maybe there was some past bug
where it really was an empty string.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MultiFields#getTerms docs clarification

Posted by Michael McCandless <lu...@mikemccandless.com>.
Getting an empty BytesRef back from Terms.getMin() means Lucene thinks you
indexed an empty (zero length) token.  Lucene (unfortunately) allows this.
Is it possible you did that?

If not, can you make a test case showing this?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 11, 2016 at 7:43 PM, Trejkaz <tr...@trypticon.org> wrote:

> Hi all.
>
> The docs on MultiFields#getTerms state:
>
> > This method may return null if the field does not exist.
>
> Does this mean:
>
>   (a) The method *will* return null if the field does not exist.
>
>   (b) The method will *not necessarily* return null if the field does not
> exist.
>
> I think we've seen a situation where it somehow returned non-null, but
> them Terms#getMin() returned an empty BytesRef, as if we had asked for
> an absent value. I would expect getMin() not to count absent values as
> the minimum, only because if that were the case, I would have
> reproduced the same error during development.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: MultiFields#getTerms docs clarification

Posted by Cristian Lorenzetto <cr...@gmail.com>.
For type null I added a field with byte ref with a empty byte array.  Maybe
it will resolve ?
Il 12/ago/2016 11:57 "Trejkaz" <tr...@trypticon.org> ha scritto:

> Hi all.
>
> The docs on MultiFields#getTerms state:
>
> > This method may return null if the field does not exist.
>
> Does this mean:
>
>   (a) The method *will* return null if the field does not exist.
>
>   (b) The method will *not necessarily* return null if the field does not
> exist.
>
> I think we've seen a situation where it somehow returned non-null, but
> them Terms#getMin() returned an empty BytesRef, as if we had asked for
> an absent value. I would expect getMin() not to count absent values as
> the minimum, only because if that were the case, I would have
> reproduced the same error during development.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>