You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Simon Willnauer (JIRA)" <ji...@apache.org> on 2011/01/24 08:32:43 UTC

[jira] Created: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Track FieldInfo per segment instead of per-IW-session
-----------------------------------------------------

                 Key: LUCENE-2881
                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
             Project: Lucene - Java
          Issue Type: Improvement
    Affects Versions: Realtime Branch, CSF branch, 4.0
            Reporter: Simon Willnauer
             Fix For: Realtime Branch, CSF branch, 4.0


Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.

The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 

We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008777#comment-13008777 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

{quote}
bq. Why do we default SegmentInfos.format now...? Seems spooky?

this hasn't been used in SIS before so I think it didn't matter before.
Yet, I check the format in files() so if you create the SIS without reading it its set to 0. I can certainly make that work with default to 0 but it seemed just natural to have it assigned the current_format. I think its fine....
{quote}

Ahh, I see: it's for the case where we make a new SIS() in RAM, because we'll now look @ the format in files().  OK this sounds right then.

{quote}
bq. Should we backport this to 3.x (after sufficient aging)?

I think we should let it bake in first though. Maybe we can also factor out the hasVectors in another issues and then backport both once they have been random-tested for a little while.
{quote}

Definitely let it bake!

Also, I have lots of pending backports to 3.2... which this patch likely overlaps on, so we should try to do them "in order" to reduce conflicts I think.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-2881:
----------------------------------

    Attachment: lucene-2881.patch

I fixed a bug in FieldInfos that could lead to wrong field numbers, that might have been related to the wrong behavior you're seeing, Simon.

About codecIds:  I made the fix to FieldInfo.clone() to set the codecId on the clone.  I also made FieldInfo.codecId private and added getter and setter.  The setter checks whether the new value for codecId is different from the previous one, and throws in exception in that case (unless it was set to the default 0 before, which I think means Preflex codec).

All tests pass.  Please let me know if that fixes your problem.  If not then you should at least see the new exception that I added, which might make debugging easier.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999809#comment-12999809 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

One question: we seem to have lost DocFieldProcessorPerThread.trimFields?

If an app indexes docs where each doc has different fields... does this mean we are not going to reclaim the DFPPerField for the fields that are no longer seen after flushing?  Ie, would it be a memory leak?  Somewhere we have a test case for this I think.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992340#comment-12992340 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

bq. Maybe we can simply implement Iterable<FieldInfo>?

good idea - done.


bq. Maybe we can rename SI#clearFilesCache()

Actually I renamed it intentionally, because all this method does is really clearing the files cache.  SI has a separate reset() method for resetting its state entirely.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992175#comment-12992175 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

Thanks for reviewing!

bq. I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue.

I just want to add another tests for the global fieldname->number map, after that I think it'll be ready to commit.  Will do that tonight :)

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer resolved LUCENE-2881.
-------------------------------------

    Resolution: Fixed

merged to docvalues and realtime branch.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001214#comment-13001214 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

bq. maybe, we should store the FieldInfos inside the segments file? Hmmm....

I had the same thought while adding the ref to FieldInfos to SegmentInfo.  Actually this is probably the right thing to do.  At the same time we could switch to a human-readable format :)


bq. I fear we may not necessarily ever "stabilize" on a fixed global name/number bimap, because we re-compute this map on every IW init?

We could also store the global map on disk?  addIndexes() would have to ignore the global map from the external index(es).



> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Reopened: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch reopened LUCENE-2881:
-----------------------------------


Reopening to make the described improvement that ensures consistent field numbers.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2881:
------------------------------------

    Attachment: LUCENE-2881.patch

next iteration, I think we are ready to commit here. I added a couple of testcases regarding the fnx file and made sure they get deleted accordingly even if we fail with Exceptions during prepareCommit & finishCommit. Moved the BW-Compat code for building the initial global map to SegmentInfos and cleaned up assertions in FieldInfos (also added the suggested assertion to FIs#putInternal). To prevent that we miss a fnx file if we open an old index, write the fnx file and keep that SIS in memory with format set to some old version I removed the check if we are on 4.0 index but used the latestGlobalFieldNumberVersion which is kept consistent and only include the file if its not set to 0.

I run whileTrue tests on this patch now for a while and things are looking good from my side. Mike if you have time let beast chew it again. If its fine I will commit tomorrow.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985759#action_12985759 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

bq. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.

Yeah, I was going to do this in the realtime branch, because I don't think we can get DWPTs working correctly without SegmentInfos having private FIs. 

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008530#comment-13008530 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

Patch looks much better -- I think it's close!  Comments:

  * I think we should put header (id + version, ie
    CodecUtil.write/readHeader) on fnx file?

  * When does FieldNumberBiMap init from another...?  (We have ctor
    that takes a FieldNumberBiMap in... but I don't see who uses
    that).

  * Who uses BiMap.entries()?  Looks like just for testing?  Can you
    add comment saying "just for testing.."?

  * FieldInfos ctor that loads from index file name but makes a new
    bimap seems spooky...?  Is this only used by tests now...?

  * The passed in "version" to SIS.readGlobalFieldMap is unused

  * I'm a little worried that we name the new file _X.fnx, because it
    will appear that this file 'belongs' to segment X, which is
    dangerous because in some recovery cases we will remove all
    files associated w/ a given segment (ie, _X.*).  Maybe, we can
    name it without the leading _?  Ie, 0.fnx, 1.fnx, etc.?

  * In IW.getGlobalFieldNumberMap... shouldn't that "legacy" logic be
    moved inside SegmentInfos?  Ie it's a bit weird now how we first
    pull an (empty) biMap from the legacy SegmentInfos then we go and
    populate it ourselves...?  Then we don't need this method in IW
    (we just ask the SIS for the bimap).

  * Should FieldInfo.putInternal(FieldInfo) assert that the global
    bimap "matches"?  (Ie, it makes no effort to update the global
    bimap).

  * The addition of "si.hasProx = hasProx" in SegmentInfo.clone isn't
    needed; we pass hasProx to the ctor of that si.

  * Why do we default SegmentInfos.format now...?  Seems spooky?

  * In SegmentInfos.rollbackCommit shouldn't we set the
    pendingMapVersion to -1?

Should we backport this to 3.x (after sufficient aging)?


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992156#comment-12992156 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

{quote}
New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos.
{quote}

Wow! nice work Michael! I like how you preserve bw compat in SegmentInfo and FieldInfos is now bound to SegmentInfo - yay! this solves two problems at once for DocValues branch.

bq.The alternative would be to rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files.

I think  copying over is fine. Ideally we will move all those boolean etc to the codec level so that we don't need that at all. Once stored fields and vectors are written by the codec we can push all that into PreFlex codec (maybe!?) and get rid of the bw compat code.

I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992382#comment-12992382 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

bq. All tests pass. I'll commit this in a day or two if nobody objects.

+1

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-2881:
----------------------------------

    Attachment: lucene-2881.patch

New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo.  Instead SegmentInfo now has a reference to its corresponding FieldInfos.  

For backwards-compatibility reasons we can't completely remove the hasVectors and hasProx bytes from the serialized SegmentInfo yet.  Eg. if someone uses addIndexes(Directory...) to add external old pre-4.0 segments to a new index, we "upgrade" the SegmentInfo to the latest version.  However, we don't modify the FieldInfos of that segment, instead we just copy it over to the new dir.  So the hasVector and hasProx bits in the FieldInfos might not be accurate and we have to keep those bits in the SegmentInfo instead.  Not an ideal solution, but we can remove it entirely in Lucene 5.0 :).  The alternative would be to
rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files.

All core & contrib tests pass.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000438#comment-13000438 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

bq. Can't we sync globally on the assignment of field name -> number (the global map lookup)?  And FieldInfos per-DWPT would share the same global map.  Wouldn't that keep us consistent in the DWPT case?


Yes.  The global map is already shared across DWPTs and the lookup is synchronized on the global map.  I think if we change the logic to always pick the next available global number we would increase the likelihood that fields get bulk-merged.

It can't be perfect though, because e.g. if you use addIndexes() to add an external segment that has a different field number assignment.  That's why we definitely have to keep the code that can fallback to  a different local number if it's not possible to use the global number in a segment.  But I agree that we should optimize for the "normal" indexing case.  And it seems like we all agree that field number gaps are fine.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009340#comment-13009340 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

OK beast chewed on this for a few hours -- ran all (Lucene + Solr)
tests 144 times.

The UIMAUpdateRequestProcessorTest had a number of failures, but I
suspect they are unrelated. (They could also be due to how I run the
test -- I'm using the python runner from luceneutil).

So I think it's good!  I'll review the patch one more time...


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001534#comment-13001534 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

{quote}
Unfortunately, I'm still seeing the assert trip (that there are no non-bulk merges) in TestNRTThreads. Took beast a while to repro but eventually it did...
{quote}

this is a problem we have seen before. This is an exception where we write a single doc segment and fail inverting the doc after FI has been updated. Since hasVectors has been moved to FI out of SI this is inconsistent and the TV tries to open the files since hasVectors is true. 

This is also why buschmi added clearVectors (just as a workaround afaik)

bq. Also, I think we should understand why Solr's SimpleFacetsTest failed from the previous patch and hopefully make a standalone test showing the problem (and that we fixed it!).

man I still try to trigger it in isolation but I can spend too much time on that right now. Got sucked into Tiered Flushing ;)

About the other comments (the big bulletpoint list) - in the meanwhile I think we should split this up in several smaller issues. 
 * start with resetting the FI after flush to make the flags consistent for each segment
 * move FIs into SI with hasVectors etc still in SI
 * factor hasVectors out of SI into FI
 * introduce a global field num map (maybe on realtime first?) and / or store global map
 * ...
We can still use this patch as a PoC does that make sense?



> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008552#comment-13008552 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

bq. Should we backport this to 3.x (after sufficient aging)?
I think we should let it bake in first though. Maybe we can also factor out the hasVectors in another issues and then backport both once they have been random-tested for a little while.


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000153#comment-13000153 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

I mentioned on dev that assigning the same field number across segments is best effort now and wanted to explain in greater detail here how it works:

There is now a global fieldName <-> fieldNumber bi-map in FieldInfos, which contains all fieldName/number pairs seen in a IndexWriter session.  It is passed into each new FieldInfos that is created in the same IndexWriter session.

Also, when a new IndexWriter is opened, the FieldInfos of all segments are read and the global map created - this is tested in a new unit test this issue adds.

A FieldInfos has in addition to the reference to the global map also a "private" map, which holds all FieldInfo objects that belong to the corresponding segment (remember there's now a 1-1 mapping SegmentInfo->FieldInfos).  

Now the fieldNumber assignment strategy works as follows:  If a new FI is added to FieldInfos, the global map is checked for the number of that field.  If the field name hasn't been seen before, the smallest number available in the *local* map is picked (to keep the numbers dense).  
Otherwise, if we have seen the field before, the global number is used.  The problem now might be, that the global number might already be taken in the local FieldInfos.  In this case the global and local numbers for the same fieldName would differ.  This is not a problem in terms of correctness, but could prevent that field from being efficiently bulk-merged.

With DocumentsWriterPerThreads (DWPTs) in mind I don't see how we could guarantee consistent field numbering across DWPTs, that's why I implemented it in this "best effort" way.

Here's an example on how we can get into a situation where a field would get different numbers in different segments:
segment_1 has fields A and B, therefore these mappings A -> 1, B -> 2.
Now in segment_2 the first field we add is C, which hasn't been seen ever before, so we pick locally number 1 for it.  Then we add the next document which has field A, but since number 1 is already taken, it would get a different number than in segment_1.  This means A would not get bulk merged.

Hmm, after writing this example down I'm realizing that it would be better to just always pick the next available global field number for a new field, then, at least until we get DWPTs, we should never get different numbers across segments, I think?  The disadvantage would be that FieldInfos could have "gaps" in the numbers.  I implemented the current approach because I wanted to avoid those gaps, but having them would probably not be a big deal?

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000180#comment-13000180 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

Thanks for the clarification, michael!

bq. ... the smallest number available in the local map is picked (to keep the numbers dense). ...
Oh man I was not aware of this. I got totally confused... see next comment

{quote}
Hmm, after writing this example down I'm realizing that it would be better to just always pick the next available global field number for a new field, then, at least until we get DWPTs, we should never get different numbers across segments, I think? The disadvantage would be that FieldInfos could have "gaps" in the numbers. I implemented the current approach because I wanted to avoid those gaps, but having them would probably not be a big deal?
{quote}

This is how I thought it works but I obviously got confused by global vs. local this is also why I had trouble to understand how that failure could ever have happened. But after looking at the code again this makes sense ;) I don't see any problems in FieldInfo number gaps. this should work just fine and guarantee the bulk copy just for now at least.



> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997475#comment-12997475 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

bq. Uses -1 now as initial value for codecID.
alright that looks good!

bq. Let me know if it works without problems now in the doc values branch, Simon!
this looks good now. I think we need to add a CHANGES.TXT entry for this issue since it changes the runtime behavior. +1 to commit Nice work!

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch reassigned LUCENE-2881:
-------------------------------------

    Assignee: Michael Busch

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000581#comment-13000581 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

For the record, robert reverted the changes made by this issue since we have been experiencing a fair bit of [problems|https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/5420/] lately. 

eventually reproducible with:
{code}
ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetSingleValued -Dtests.seed=-4971136915249645135:5200209917417531291 -Dtests.multiplier=3

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetSingleValuedFcs -Dtests.seed=-4971136915249645135:-3738166620811568832 -Dtests.multiplier=3

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixMultiValued -Dtests.seed=-4971136915249645135:4594369826150277150 -Dtests.multiplier=3

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixSingleValued -Dtests.seed=-4971136915249645135:-7702531001769827248 -Dtests.multiplier=3

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixSingleValuedFcs -Dtests.seed=-4971136915249645135:698398490325732548 -Dtests.multiplier=3

{code}

I found the problem causing this where certain field numbers got mixed up when the FieldInfos get build initially in IndexWriter and a segment is loaded first which had gaps in its field numbering. 
FieldInfos is ignoring the FieldInfo's number if the FieldInfo does not exist yet and tries to assigne a new "local" field number. But if the next available field number x while the actual FI's number was > x+1 the new added FI will be set to x instead.

in other words, lets say we have 2 segments:
{code}
 seg1 : { fields : [(a:0, c:2)] } 
 seg2 : { fields : [(a:0, b:1, c:2)] } 
{code}
if we load seg1's FI we end up with 

{code}fields : [(a:0, c:1)] {code}

then we add seg2's FI's and end up with 

{code}fields : [(a:0, c:1, b:2)] {code}

this will also explain the TestNRTThreads.testNRTThreads failure where bulkMerge could not be applied due to different field numbers across segments.

I will upload a patch tomorrow.


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008116#comment-13008116 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

bq. Mike would you be so kind and let beast break it?

Beast ran Lucene's core+contrib tests 398 times!

There was one failure, but I suspect it's unrelated.  I'll open a separate issue...

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer reassigned LUCENE-2881:
---------------------------------------

    Assignee: Simon Willnauer  (was: Michael Busch)

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985852#action_12985852 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

It would probably make sense to have a new class (maybe an extension of SegmentInfo) for in-memory (not-yet-flushed) segments that references the corresponding FieldInfos and SegmentDeletes.  That'd be better I think that adding another map SegmentInfo -> FieldInfos and we could then also remove the SegmentInfo -> SegmentDeletes map (in BufferedDeletes).

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008892#comment-13008892 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

bq. if we open a pre-4.0 SIS, make changes, write a new SIS (commit), do we change that new instance's format to 4.0 (in RAM)? Ie, so that files() is correct...?

hmm actually we don't while the DefaultSIWriter changes it upon write which is not reflected to ram. We might should do that either before we write in DSIWriter. The other thing is that we should carry over the format from the oldSegment in the case we open a specific commit point. I will write a test first that hopefully triggers a fail and fix that accordingly.


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985844#action_12985844 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

bq. Yeah, I was going to do this in the realtime branch, because I don't think we can get DWPTs working correctly without SegmentInfos having private FIs.
i think we should do that on trunk and then merge to RT - do you have time to work on this soon?


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2881:
------------------------------------

    Attachment: LUCENE-2881.patch

Here is an updated patch against trunk that fixes the problems we had with SimpleFacetsTest. 
I added a testcase that triggered anyNonBulkMerges to be false as well as field number to be mixed up across segments. This patch also tracks the fieldNumbers globally within a IW session.

I have been running tests with while(true) { ant clean test } on a top level directory without any failure for 12 hours now. Aside of that the testcase I added failed reliably with the previous version and it explains all failures we have see so far especially the one with NRTThreads.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985853#action_12985853 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

bq. i think we should do that on trunk and then merge to RT - do you have time to work on this soon?

Yeah I agree.  Hmm maybe I can spend some hours tonight on this, otherwise I don't think I'll have much time before Thursday.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-2881:
----------------------------------

    Attachment: lucene-2881.patch

New patch that adds a new junit for testing that field numbering is consistent across segments.  It tests two cases: 1) one IW is used to write two segments; 2) two IWs are used to write two segments.  
And it also tests that addIndexes(Directory...) doesn't mess up the field numbering of the external segment.

All tests pass.  I'll commit this in a day or two if nobody objects.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992174#comment-12992174 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

I gave the patch another glance - here are a couple of very minor comments:

* maybe we should return Iterable<FieldInfo> from FieldInfos#getFieldInfoIterator() this would make the iteration syntactically more javaish 
{code} 

for (FieldInfo info : getFieldInfoIterator()) {
  // do something with it
}

{code}

* If we return Iterable<FieldInfo> should we name it getFieldInfos?
* Maybe we can simply implement Iterable<FieldInfo>?
* Maybe we can rename SI#clearFilesCache() to SI#clearCache() or simply SI#clear() this would make thing less coupled to the SI internal cache but rather something that is called to clear internal state after flush? 

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001443#comment-13001443 ] 

Yonik Seeley commented on LUCENE-2881:
--------------------------------------

w.r.t. storing FielInfos in the segments file - don't we get some consistency benefit from the segments file being small (i.e. it's all in one disk block so it either gets written in it's entirety or not, and readers will see all of it, or not)?

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-2881:
----------------------------------

    Attachment: lucene-2881.patch

  * Creates for every segment a new FieldInfos
  * Changes FieldInfos, so that the FieldInfo numbers within a single FieldInfos don't have to be contiguous - this allows using the same numbering as the previous segment(s), even if not all fields are present in the new segment
  * Adds a global fieldName -> fieldNumber map;  if possible when a new field is added to a FieldInfo it tries to use an already assigned number for that field

All tests pass.  Though I need to verify if the global map works correctly (it'd probably be good to add a test for that).  Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue. 

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992524#comment-12992524 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

Awesome, thanks for letting me know!  I hope I'll be able to say the same about the RT branch after I tried it there... :)

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992476#comment-12992476 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

I just integrated this patch to the docvalues branch! It works like a charm! Nice work Michael, this brings docValues a huge step closer. All tests pass which failed before, FieldInfo is reliable now!

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985959#action_12985959 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

bq. Yeah I agree. Hmm maybe I can spend some hours tonight on this, otherwise I don't think I'll have much time before Thursday.
Michael, if you start something I can work on it tomorrow for a while too. That way we can get it in quickly ;) 

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009718#comment-13009718 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

Committed to trunk - I will keep this open until RT and docvalues have synced up with it.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001590#comment-13001590 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

{quote}
w.r.t. storing FielInfos in the segments file - don't we get some consistency benefit from the segments file being small (i.e. it's all in one disk block so it either gets written in it's entirety or not, and readers will see all of it, or not)?
{quote}

I think we have three options here:
1) Store FIs inside of compound file (current patch)
2) Store FIs in SI
3) Store FIs in own file outside of compound file

Each has disadvantages:

1) more expensive to open SegmentInfos, as it now has to open cfs files to load FIs
2) SI files becomes bigger
3) one more file descriptor per segment - but can be closed as soon as FIs was read into memory

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999928#comment-12999928 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

bq. I actually renamed it to doAfterFlush(). It now resets the whole hashmap in DocFieldProcessorPerThread, because we don't want to carry over any field settings into the next segment anymore with per-segment FieldInfos. I think this should be fine?

Ahh, good, I think you're right!  I missed that we clear this on flush.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009610#comment-13009610 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

bq. We still need a header (id + version) on the fnx file?
its there I guess you got the wrong patch.

{quote}
FieldInfos ctor that loads from index file name but makes a new
bimap seems spooky...? Is this only used by tests now...?
{quote}

Well this on is only for the read case where we open a FIS. Yet, we need to do this since we store a fnx file per SIS and upon IW#addIndexed(Directory) we could have a FIS that has field number different to the global map. This is fine as long as we don't seed the FIS on read. I will open another issue to make this case more efficient and assert that the FIS is read-only once we created the FIS from a directory.

I fixed all the remaining issues and will go ahead and commit now. Thanks mike

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997129#comment-12997129 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

bq. I fixed a bug in FieldInfos that could lead to wrong field numbers, that might have been related to the wrong behavior you're seeing, Simon

ah that could be the reason. I will need to patch my branch again to see if your patch helps. Will do tomorrow.

{quote}
About codecIds: I made the fix to FieldInfo.clone() to set the codecId on the clone. I also made FieldInfo.codecId private and added getter and setter. The setter checks whether the new value for codecId is different from the previous one, and throws in exception in that case (unless it was set to the default 0 before, which I think means Preflex codec).
{quote}

The clone issue seems to be fixed in your latest patch. While the setter seems kind of wrong. Lemme explain how that numbering works. If you create a SI for a flushed segment SegmentCodecs creates an ordinal for each codec and sets it to the corresponding fields. The ordinal (codecID) is the array index in the SegmentCodec's Codec array which holds the codec instance used for that field. So codecID = 0 is a valid value for segments having PreFlex or a >= 4.0 codec. But if we open a segment that is pre-flex there will only be one codec for the entire segment with codecID=0, thats why this is assigned. (Note: I need to document this where its set!) I think we should initialize the codecID with a different value and replace the this.codecId != 0 check with something like this.codecId != -1. Maybe we should just use an assert here instead of an exception, this is somewhat internal though.

{quote}
All tests pass. Please let me know if that fixes your problem. If not then you should at least see the new exception that I added, which might make debugging easier.{quote}

Will do tomorrow! What exactly was the problem with the previous patch beside the codecID clone issue?


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001011#comment-13001011 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------


This is a big patch!  Some comments...:

  * Do we even use FieldInfos.add(Document)?  Maybe we can remove
    it...

  * FieldNumberBiMap.addOrGet doesn't need to take the FieldInfoBiMap?
    (It's an unused param).

  * Why create the FieldInfoBiMap class?  Ie, why not "merge" that
    back into FieldInfos itself (like it used to be)?  (I understand
    why we need FieldNumberBiMap -- so we can share a single instance
    across FieldInfos).

  * We mix up autoboxing then unboxing, eg
    FieldInfos.addOrUpdateInternal takes int preferredFieldNumber,
    which is boxed when calling localFieldInfos.nextFieldNumber, then
    manually unboxed on calling globalFieldNumbers.addOrGet.

  * On working with a pre-4.0 index that has non-congruent assignments
    across segments... I fear we may not necessarily ever "stabilize" on
    a fixed global name/number bimap, because we re-compute this map
    on every IW init?  Ie, when you first open 4.0 IW on pre-4.0
    index, you'll compute a certain global map, and write new segs
    with those bindings.  But say some fields differed in their
    assignment... but then some of those conflicting segs are merged.
    Later, when you open the IW again, you'll get a different global
    map?  And write new segments conflicting with the previous new
    segments you had written?

  * The fact that SegmentInfo.clearFilesCache is now a public API and
    consumer is responsible for knowing when to call this
    is... spooky.  Previously this cache was a fully private thing (ie
    invalidated whenever a change was made to the SegmentInfo).  But I
    don't see any way around it; since we now embed a FieldInfos and
    that FieldInfos (hasVectors) could change... maybe, whenever we
    call this method, add a comment explaining why?

  * It makes me nervous that the API that's allowed to pick a new
    field number (FieldInfo.addInternal) is the same API that used
    when reading a FieldInfos from _X.fnm (when we better not pick a
    different field number!).  In theory of course those
    field numbers will never conflict and we'll always get our
    preferred field number... but still.  Maybe add an assert in
    FieldInfos.read() verifying we always get that field number?

  * The call to localFieldInfos.setIfNotSet in FieldInfos.addInternal
    makes me nervous... is it actually possible for it to already be
    set, to a conflicting binding?  Shouldn't it always match? (Ie,
    above, in addOrUpdateInternal, we just consulted the global map to
    get the binding).  Can't we assert the global binding is either
    not present (and we add it) or if it is present it "matches"?

  * Why do we have FieldInfos.clearVectors?  Nobody should call
    that...?

  * It's not great that we open then close the CFS reader inside
    SegmentInfo, just to read the FieldInfos.  Ie, this means on
    opening an SR we will open this CFS reader twice... it also means
    that opening a SegmentInfos is quite a bit more costly than it
    used to be.  EG creating an IW must now go and open/close a CFS reader
    per-segment... not sure what we can really do about that
    though... maybe, we should store the FieldInfos inside the
    segments file?  Hmmm....

  * Shouldn't IndexWriter.getFieldInfos(SegmentInfo) use the
    SegmentInfo's fieldInfos rather than loading it again from the
    directory...?

Indentation is also off in various places, for us lonely people who
still use Emacs ;)


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008550#comment-13008550 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

{quote}
I think we should put header (id + version, ie
CodecUtil.write/readHeader) on fnx file?
{quote}
man, I told myself to add it about 10 times during that patch :)

bq. When does FieldNumberBiMap init from another...? 
ah legacy

{quote}
I'm a little worried that we name the new file _X.fnx, because it
will appear that this file 'belongs' to segment X, which is
dangerous because in some recovery cases we will remove all
files associated w/ a given segment (ie, _X.*). Maybe, we can
name it without the leading _? Ie, 0.fnx, 1.fnx, etc.?
{quote}
right, good catch! that we can simple remove and then it should be clear that its not a file belonging to a certain segment

bq. In IW.getGlobalFieldNumberMap... shouldn't that "legacy" logic be....
yeah thats is where is should be though. I will move it in.

bq. The addition of "si.hasProx = hasProx" in SegmentInfo.clone isn't...
true I will remove
bq. Why do we default SegmentInfos.format now...? Seems spooky?
this hasn't been used in SIS before so I think it didn't matter before.
Yet, I check the format in files() so if you create the SIS without reading it its set to 0. I can certainly make that work with default to 0 but it seemed just natural to have it assigned the current_format. I think its fine....

bq. In SegmentInfos.rollbackCommit shouldn't we set the pendingMapVersion to -1
ah good catch! thanks

I will fix those issues and upload another patch. Thanks mike for reviewing!!!!



> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999848#comment-12999848 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

bq. One question: we seem to have lost DocFieldProcessorPerThread.trimFields?

I actually renamed it to doAfterFlush().  It now resets the whole hashmap in DocFieldProcessorPerThread, because we don't want to carry over any field settings into the next segment anymore with per-segment FieldInfos.  I think this should be fine?

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2881:
------------------------------------

    Attachment: LUCENE-2881.patch

bq. Beast ran Lucene's core+contrib tests 398 times!

awesome. Thanks for letting it chew on that patch for a while. 

I added more tests to ensure field numbers are stable and some jdoc comments to package private apis. This patch also (hopefully) solves all indentation issues.



> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch resolved LUCENE-2881.
-----------------------------------

    Resolution: Fixed

Committed revision 1073110.

Thanks for reviewing the patch, Simon!

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000229#comment-13000229 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

This was an impressive change :)  Not only did we shift FieldInfos under SegmentInfo, we also fixed a given FieldInfos to allow for sparse mapping (ie, it only contains certain field numbers).  Various places previously assumed this was a dense mapping.

I think even in the DWPT case we should try to assign consistent field numbers?  I'm already worried enough about the possible perf hit DWPT will put on normal indexing and the NRT case, since it basically pushes merging out from RAM and onto the "normal" more costly disk based merging.  If we also suddenly risk these merges not being bulk merges, that's even worse.

Can't we sync globally on the assignment of field name -> number (the global map lookup)?  And FieldInfos per-DWPT would share the same global map.  Wouldn't that keep us consistent in the DWPT case?

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000185#comment-13000185 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

bq.  I don't see any problems in FieldInfo number gaps. this should work just fine and guarantee the bulk copy just for now at least.

I was thinking that we probably write field numbers as VInts in a lot of places, and it would therefore be less efficient to have gaps... but this is probably negligible.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2881:
------------------------------------

    Attachment: LUCENE-2881.patch

Attached next iteration.

* merged FieldInfoBiMap into FieldInfos
* fixed AutoBoxing issues
* reverted the deprecation of SegmentInfo#hasVector / hasProx since they where causing many failures due to missing files on exceptions. This also allowed me to make SegmentInfo.clearFilesCache private again.
*the FieldInfoBiMap now asserts that if you add a field via setIfNotSet that if either the number or the name exists they are the same as in the mapping.
* added lazy loading of FieldInfos in SegmentInfo to prevent opening the CFS file while opening SegmentInfo. SR#CoreReader now passes in the already opened CFSReader to load the FieldInfos
* when we open a IW we load the BiMap directly from the SegmentInfos

* SegmentInfos now owns the FieldInfoBiMap and writes the map upon commit into an additional .fnx file outside of the CFS file. The SegmentInfos persist the bimap only if it has changed otherwise references the previously written map. 

I still have one nocommit since I want to add more tests for the persisted global map. Beside that all tests pass and I would appreciate a review though.
Mike would you be so kind and let beast break it?



> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001010#comment-13001010 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

Also, I think we should understand why Solr's SimpleFacetsTest failed from the previous patch and hopefully make a standalone test showing the problem (and that we fixed it!).

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992038#comment-12992038 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

Michael, this looks very good!

{quote}
All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue.
{quote}
I agree we should make FieldInfos a member of SegmentInfo and remove the hasVectors / hasProx an push them down. I tried to apply this patch to the DocValues branch but I didn't get very far - I haven't merged for a while that killed me ;( 
I need to push that to after my vacation. I really hoped that I can get further but it didn't work out ;(

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-2881:
----------------------------------

    Attachment: lucene-2881.patch

- Uses -1 now as initial value for codecID.
- updated to current trunk

Let me know if it works without problems now in the doc values branch, Simon!

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009599#comment-13009599 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

Patch looks great!  Only a few things:

  * We still need a header (id + version) on the fnx file?

  * FieldInfos ctor that loads from index file name but makes a new
    bimap seems spooky...? Is this only used by tests now...?

  * Can you add a comment where we create the N.fnx name, explaining
    why it has no leading _?  (ie, because it's not a per-segment
    file, but rather a global file, shared by multiple segments)

  * In CHANGES entry, persistend is mis-spelled (need to drop the n);
    also remove the _ from _X.fnx, and add . after "successful commit".


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997252#comment-12997252 ] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

bq. I think we should initialize the codecID with a different value and replace the this.codecId != 0 check with something like this.codecId != -1.

Yeah, I had the same though.  I changed it to use -1 and use an assert now instead of throwing the exception.
(will post the new patch shortly)

bq. What exactly was the problem with the previous patch beside the codecID clone issue?

Not sure if that's what caused your codecID issues, but the previous patch had a problem with assigning field numbers.  It could happen that a global number for a FieldInfo was acquired, but that number wasn't available anymore in the local FieldInfos.  I think this would be quite rare, but now I'm preventing this from happening.

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008778#comment-13008778 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

Hmm, but: if we open a pre-4.0 SIS, make changes, write a new SIS (commit), do we change that new instance's format to 4.0 (in RAM)?  Ie, so that files() is correct...?

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, LUCENE-2881.patch, LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992707#comment-12992707 ] 

Simon Willnauer commented on LUCENE-2881:
-----------------------------------------

I actually have some intermitted failures on docvalues which seem to be caused by some mixed up codec IDs. I found one which fixed most of the cases in FieldInfo#clone() where the codec ID is lost. But there must be some other situation where the codec ID gets mixed up. I will be AFK for 10 days at least so I have to leave you alone with that. Seems that we need a test for that though. The docValues test bring that up quickly :(

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001372#comment-13001372 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

bq. I had the same thought while adding the ref to FieldInfos to SegmentInfo. Actually this is probably the right thing to do. At the same time we could switch to a human-readable format 

Human-readable format would be sweet :)

Though I'm still generally nervous that this means just opening the segments file will become quite a bit more costly.  Apps that have many fields will be especially penalized, though, apps really should not be creating so many fields.

bq. We could also store the global map on disk?

That's an interesting idea?  That'd ensure stability on the bindings, even for pre-4.0 indices.  This way a pre-4.0 index would gradually work itself towards being fully consistent...

bq. addIndexes() would have to ignore the global map from the external index(es).

Well, addIndexes(IR[]) would get fully remapped to the correct bindings?  (since it's a real merge).

But, yes, addIndexes(Dir[]) would not -- they are just file-copied.  Hmm, they'd also presumably have a different global map, so if we stored the index global map in the Directory, how would we resolve conflicts on the incoming addIndexes...?  I guess the local mapping for the incoming segments would override the global one, on conflict.


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001009#comment-13001009 ] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------

Unfortunately, I'm still seeing the assert trip (that there are no non-bulk merges) in TestNRTThreads.  Took beast a while to repro but eventually it did...

I'm also seeing this in TestIndexWriterExceptions:

{noformat}
NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterExceptions -Dtestmethod=testDocumentsWriterExceptions -Dtests.seed=-1056365072156857559:-3700694839465596125


There was 1 failure:
1) testDocumentsWriterExceptions(org.apache.lucene.index.TestIndexWriterExceptions)
java.io.FileNotFoundException: _1.tvx
	at org.apache.lucene.store.RAMDirectory.fileLength(RAMDirectory.java:147)
	at org.apache.lucene.store.MockDirectoryWrapper.fileLength(MockDirectoryWrapper.java:524)
	at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:290)
	at org.apache.lucene.index.LogMergePolicy.sizeBytes(LogMergePolicy.java:211)
	at org.apache.lucene.index.LogByteSizeMergePolicy.size(LogByteSizeMergePolicy.java:45)
	at org.apache.lucene.index.LogMergePolicy.useCompoundFile(LogMergePolicy.java:165)
	at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:629)
	at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2538)
	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2503)
	at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1077)
	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1041)
	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1005)
	at org.apache.lucene.index.TestIndexWriterExceptions.testDocumentsWriterExceptions(TestIndexWriterExceptions.java:565)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
	at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1213)
	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1145)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
	at org.junit.runners.Suite.runChild(Suite.java:128)
	at org.junit.runners.Suite.runChild(Suite.java:24)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
	at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98)
	at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53)
	at org.junit.runner.JUnitCore.main(JUnitCore.java:45)

FAILURES!!!
{noformat}

The failure reproduces for me (with the above seed)...

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support.  Codecs that rely on consistent properties in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org