You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Benson Margulies (Created) (JIRA)" <ji...@apache.org> on 2012/03/06 15:52:57 UTC

[jira] [Created] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Non-tokenized fields become tokenized when a document is deleted and added back
-------------------------------------------------------------------------------

Key: LUCENE-3854
URL: https://issues.apache.org/jira/browse/LUCENE-3854
Project: Lucene - Java
Issue Type: Bug
Components: core/index
Affects Versions: 4.0
Reporter: Benson Margulies

https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.

Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on.

Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.

So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Benson Margulies (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated LUCENE-3854:
-------------------------------------

    Description: 
https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.

Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 

Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.

So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.

I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.


  was:
https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.

Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 

Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.

So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.

    
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223364#comment-13223364 ] 

Robert Muir commented on LUCENE-3854:
-------------------------------------

{quote}
though separate classes for input / output documents would be better. Solr uses SolrInputDocument for input and SolrDocument for output, and obviously they are not interchangeable.
{quote}

+1
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Andrzej Bialecki (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223309#comment-13223309 ] 

Andrzej Bialecki  edited comment on LUCENE-3854 at 3/6/12 3:09 PM:
-------------------------------------------------------------------

I suspect the problem lies in DocuemntStoredFieldVisitor.stringField(...). It uses FieldInfo to populate FieldType of the retrieved field, and there is no information there about the tokenization (so it assumes true by default). AFAIK the information about the tokenization is lost once the document is indexed so it's not possible to retrieve it back, hence the use of a default value.

(Mike said the same while I was typing this comment ;) ).
                
      was (Author: ab):
    I suspect the problem lies in DocuemntStoredFieldVisitor.stringField(...). It uses FieldInfo to populate FieldType of the retrieved field, and there is no information there about the tokenization (so it assumes true by default). AFAIK the information about the tokenization is lost once the document is indexed so it's not possible to retrieve it back, hence the use of a default value.
                  
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Hoss Man (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223402#comment-13223402 ] 

Hoss Man commented on LUCENE-3854:
----------------------------------

i tried arguing a long time ago that IndexReader.document(...) should return "Map<String,String[]>" since known of the Document/Field object metdata makes sense at "read" time ... never got any buy in from anybody else.
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223337#comment-13223337 ] 

Uwe Schindler commented on LUCENE-3854:
---------------------------------------

In my opinion Document for *indexing* should be different from document *retrieved from stored fields" (I am argueing all the time about that).

One simple solution:
When a field is loaded using StoredFieldsVisitor from index, lets set an internal flag in the document/field instances (e.g. by a pkg-private ctor of Document), so when you try to readd such a loaded document to IndexWriter you get an exception. Very simple and is a good solution for now.

But I agree with Robert, Document/Field API is messy and trappy in that regard.
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223348#comment-13223348 ] 

Michael McCandless commented on LUCENE-3854:
--------------------------------------------

bq. FWIW, there are blog posts out there with more or less the recipe I followed to get into this pickle.

Sigh :(  Bad bad trap.

{quote}
When a field is loaded using StoredFieldsVisitor from index, lets set an internal flag in the document/field instances (e.g. by a pkg-private ctor of Document), so when you try to readd such a loaded document to IndexWriter you get an exception. Very simple and is a good solution for now.
{quote}

+1
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Uwe Schindler (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223337#comment-13223337 ] 

Uwe Schindler edited comment on LUCENE-3854 at 3/6/12 3:40 PM:
---------------------------------------------------------------

In my opinion Document for *indexing* should be different from document *retrieved from stored fields* (I am argueing all the time about that).

One simple solution:
When a field is loaded using StoredFieldsVisitor from index, lets set an internal flag in the document/field instances (e.g. by a pkg-private ctor of Document), so when you try to readd such a loaded document to IndexWriter you get an exception. Very simple and is a good solution for now.

But I agree with Robert, Document/Field API is messy and trappy in that regard.
                
      was (Author: thetaphi):
    In my opinion Document for *indexing* should be different from document *retrieved from stored fields" (I am argueing all the time about that).

One simple solution:
When a field is loaded using StoredFieldsVisitor from index, lets set an internal flag in the document/field instances (e.g. by a pkg-private ctor of Document), so when you try to readd such a loaded document to IndexWriter you get an exception. Very simple and is a good solution for now.

But I agree with Robert, Document/Field API is messy and trappy in that regard.
                  
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "András Péteri (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293446#comment-13293446 ] 

András Péteri commented on LUCENE-3854:
---------------------------------------

Isn't this considered a regression from 3.x? In 3.6.0 I'm seeing an additional byte being read from the stream in FieldsReader, which contained bits that allowed the reader to reconstruct the Index enum correctly for the field. This should make it possible to properly update a document in which all fields were stored, with the exception of boost values (and they could be stored redundantly in a field as well to overcome this limitation).
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223470#comment-13223470 ] 

Andrzej Bialecki  commented on LUCENE-3854:
-------------------------------------------

bq. There isn't a better way to attack that problem in 4.0, is there?

Not yet - LUCENE-3837 is still in early stages.
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223361#comment-13223361 ] 

Andrzej Bialecki  commented on LUCENE-3854:
-------------------------------------------

+1, though separate classes for input / output documents would be better. Solr uses SolrInputDocument for input and SolrDocument for output, and obviously they are not interchangeable.
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223305#comment-13223305 ] 

Michael McCandless commented on LUCENE-3854:
--------------------------------------------

OK I see the problem... it's not a bug, but is a looongstanding trap in Lucene: you cannot retrieve a Document (from IR.document API) and expect it to accurately reflect what you had indexed.  Information is lost, eg whether each field was tokenized or not, what the document boost was, fields that were not stored are missing, etc.  In this particular case, IR.document will enable "tokenized" for each text field it loads, which then causes the test failure.

This is a bad trap, since it tricks you into thinking you can load a stored document and reindex it; instead, you have to re-create a new Document with the correct details on how it should be indexed.

Really, IR.document should not even return a Document/Field.
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Benson Margulies (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223461#comment-13223461 ] 

Benson Margulies commented on LUCENE-3854:
------------------------------------------

Notes:

1) The trap opened a bit wider in 4.0 with the removal of IndexReader.deleteDocument. I'm not sure I exactly understand how, but by deleting through the reader we didn't hit this.

2) I got into this because I wanted, really, to update a field value. There isn't a better way to attack that problem in 4.0, is there?

                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Benson Margulies (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223330#comment-13223330 ] 

Benson Margulies commented on LUCENE-3854:
------------------------------------------

FWIW, there are blog posts out there with more or less the recipe I followed to get into this pickle.

Do you want to keep this open for nulling some things in IR.document()? Obviously, not returning a Document at all would be a bit on the violent side.
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3854) Non-tokenized fields become tokenized when a document is deleted and added back

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223309#comment-13223309 ] 

Andrzej Bialecki  commented on LUCENE-3854:
-------------------------------------------

I suspect the problem lies in DocuemntStoredFieldVisitor.stringField(...). It uses FieldInfo to populate FieldType of the retrieved field, and there is no information there about the tokenization (so it assumes true by default). AFAIK the information about the tokenization is lost once the document is indexed so it's not possible to retrieve it back, hence the use of a default value.
                
> Non-tokenized fields become tokenized when a document is deleted and added back
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3854
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Benson Margulies
>
> https://github.com/bimargulies/lucene-4-update-case is a JUnit test case that seems to show a problem with the current trunk. It creates a document with a Field typed as StringField.TYPE_STORED and a value with a "-" in it. A TermQuery can find the value, initially, since the field is not tokenized.
> Then, the case reads the Document back out through a reader. In the copy of the Document that gets read out, the Field now has the tokenized bit turned on. 
> Next, the case deletes and adds the Document. The 'tokenized' bit is respected, so now the field gets tokenized, and the result is that the query on the term with the - in it no longer works.
> So I think that the defect here is in the code that reconstructs the Document when read from the index, and which turns on the tokenized bit.
> I have an ICLA on file so you can take this code from github, but if you prefer I can also attach it here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org