You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Andrzej Bialecki (Created) (JIRA)" <ji...@apache.org> on 2012/03/01 12:57:59 UTC

[jira] [Created] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Most Codec.*Format().*Reader() methods should use SegmentReadState
------------------------------------------------------------------

                 Key: LUCENE-3836
                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/codecs
            Reporter: Andrzej Bialecki 
            Assignee: Andrzej Bialecki 
             Fix For: 4.0


Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220082#comment-13220082 ] 

Michael McCandless commented on LUCENE-3836:
--------------------------------------------

{quote}
The background for this issue is that I started looking at updateable fields, where updates are put in a segment (or reader) of its own and they provide an "overlay" for the main segment, with a special codec magic to pull and remap data from the "overlay" as the main data is accessed. However, in order to do that I need to provide this data when format readers are initialized. I can't do this when creating a Codec instance (Codec is automatically instantiated) or when creating Codec.*Format(), because format instances are usually shared as well.
{quote}

Sweet!

Couldn't the stacking/overlaying live "above" codec?  Eg, the codec thinks it's reading 3 segments, but really the code above knows there's 1 base segment and 2 stacked on top?
                
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220058#comment-13220058 ] 

Michael McCandless commented on LUCENE-3836:
--------------------------------------------

I agree catch-all "argument holder" classes are dangerous... they can bloat over time and probably lead to bugs...
                
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220003#comment-13220003 ] 

Robert Muir commented on LUCENE-3836:
-------------------------------------

I think this change is OK: I just want to mention that avoiding SegmentReadState 
was definitely intentional... well most of my issues are really based on
SegmentWriteState, but I think the whole concept is broken, see below:

SegmentWriteState is bad news, for many codec APIs
they would be underpopulated, or even have bogus data!

For example, what would be SegmentWriteState.numDocs for StoredFieldsWriter?

I understand that at a glance having foo(A) where A has A.B and A.C and A.D seems simpler than foo(B, C),
but I think its confusing to pass "A" at all if there is an A.D thats somehow bogus, invalid, etc.

In that case its actually much clearer to pass B and C directly... personally I think we 
should revisit these 'argument holder' APIs and likely remove them completely.

Because of that: for most codec APIs I avoided SegmentWriteState and also SegmentReadState (for symmetry).
                
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Andrzej Bialecki (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-3836:
--------------------------------------

    Attachment: LUCENE-3836.patch

Patch that implements the change. If there are no objections I'd like to commit this soon.
                
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220093#comment-13220093 ] 

Andrzej Bialecki  commented on LUCENE-3836:
-------------------------------------------

I think this could work, too - I would instantiate the "overlay" data in SegmentReader, and then I'd create the "overlay" codec's format readers in SegmentReader, using the original format readers plus the overlay data. I'll try this approach ... I'll create a separate issue to discuss this.

Let's close this as won't fix for now.
                
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220071#comment-13220071 ] 

Andrzej Bialecki  commented on LUCENE-3836:
-------------------------------------------

I hear you .. SegmentWriteState is bad, I agree. But the argument about SegmentWriteState is not really applicable to SegmentReadState - write state is mutable and can change under your feet whereas SegmentReadState is immutable, created once in SegmentReader and used only for initialization of format readers. On the other hand, if we insist that we always pass individual arguments around then providing some additional segment-global context to format readers requires changing method signatures (adding arguments).

The background for this issue is that I started looking at updateable fields, where updates are put in a segment (or reader) of its own and they provide an "overlay" for the main segment, with a special codec magic to pull and remap data from the "overlay" as the main data is accessed. However, in order to do that I need to provide this data when format readers are initialized. I can't do this when creating a Codec instance (Codec is automatically instantiated) or when creating Codec.*Format(), because format instances are usually shared as well.

So the idea I had in mind was to use SegmentReaderState uniformly, and put this overlay data in SegmentReadState so that it's passed down to formats during format readers' creation. I'm open to other ideas... :)
                
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220105#comment-13220105 ] 

Robert Muir commented on LUCENE-3836:
-------------------------------------

{quote}
(The reason I'm doing this at the codec level is that I wanted to avoid heavy mods to SegmentReader, and it's easier to visualize how this data is re-mapped and stacked at the level of fairly simple codec APIs).
{quote}

But SegmentReader is fairly simple these days, its just basically a pointer to a core (SegmentCoreReaders) + deletes.

Maybe it should stay the same, but instead we could have a StackedReader (perhaps a bad name), that points to multiple cores + deletes + mask files or whatever it needs and returns masked enums over the underlying Enums itself (e.g. combining enums from the underlying impls, passing masks down as Bits, and such). SegmentReader would stay as-is.

                
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Andrzej Bialecki (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved LUCENE-3836.
---------------------------------------

    Resolution: Won't Fix

Thanks for the insightful comments - this looks promising. I opened LUCENE-3837 to discuss a broader design for updateable fields.
                
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3836) Most Codec.*Format().*Reader() methods should use SegmentReadState

Posted by "Andrzej Bialecki (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220093#comment-13220093 ] 

Andrzej Bialecki  edited comment on LUCENE-3836 at 3/1/12 3:22 PM:
-------------------------------------------------------------------

I think this could work, too - I would instantiate the "overlay" data in SegmentReader, and then I'd create the "overlay" codec's format readers in SegmentReader, using the original format readers plus the overlay data. I'll try this approach ... I'll create a separate issue to discuss this.

(The reason I'm doing this at the codec level is that I wanted to avoid heavy mods to SegmentReader, and it's easier to visualize how this data is re-mapped and stacked at the level of fairly simple codec APIs).

Let's close this as won't fix for now.
                
      was (Author: ab):
    I think this could work, too - I would instantiate the "overlay" data in SegmentReader, and then I'd create the "overlay" codec's format readers in SegmentReader, using the original format readers plus the overlay data. I'll try this approach ... I'll create a separate issue to discuss this.

Let's close this as won't fix for now.
                  
> Most Codec.*Format().*Reader() methods should use SegmentReadState
> ------------------------------------------------------------------
>
>                 Key: LUCENE-3836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3836
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: LUCENE-3836.patch
>
>
> Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org