You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2011/11/05 02:09:51 UTC

[jira] [Created] (LUCENE-3560) add extra safety to concrete codec implementations

add extra safety to concrete codec implementations
--------------------------------------------------

                 Key: LUCENE-3560
                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
             Project: Lucene - Java
          Issue Type: Improvement
    Affects Versions: 4.0
            Reporter: Robert Muir
         Attachments: LUCENE-3560.patch

In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
nature, only IndexWriter.

Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
Service Provider Mechanism.

So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
to IndexWriter), this enforces that they must write any configuration information they need into
the index, so that we don't have a flimsy API.

I think we should go even further, for additional safety. Any methods on our concrete codecs that
are not intended to be subclassed should be final, and we should add assertions to verify this.

For example, SimpleText's files() implementation should be final. If you want to make an extension
of simpletext that has additional files, then this is a different index format and should have a
different name!

Note: This doesn't stop extensibility, only stupid mistakes. 
For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
formats (which it writes into a .per file into the index, so that it knows how to read each field).

{code}
private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
  @Override
  public PostingsFormat getPostingsFormatForField(String field) {
    return Lucene40Codec.this.getPostingsFormatForField(field);
  }
};

...

@Override
public final PostingsFormat postingsFormat() {
  return postingsFormat;
}

...

  /** Returns the postings format that should be used for writing 
   *  new segments of <code>field</code>.
   *  
   *  The default implementation always returns "Lucene40"
   */
  public PostingsFormat getPostingsFormatForField(String field) {
    return defaultFormat;
  }
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3560:
--------------------------------

    Attachment: LUCENE-3560.patch
    
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144645#comment-13144645 ] 

Uwe Schindler commented on LUCENE-3560:
---------------------------------------

Heavy reflecting in the good old TokenStream assertFinal (I was so unhappy when Analyzer was restructured that it had to go there... *g*).

Some comments:
- We should maybe also add a check that there is at least a default constructor available. this.getClass().getConstructor() does not throw exception
- In general, subclassing a Codec or a PostingsFormat is wrong (except the Lucene3x hack). If you subclass a codec/PF, you can no longer  change it's name. So anybody who subclasses a codec will produce a clone with the same name but perhaps another index format. This is prevented by Robert's finalness on the format hooks, but what else could a codec do different if its not final without breaking index format?
- I think even 3x Codec should be final and not subclassed by the RW codec. The RW Preflex codec in tests should subclass abstract Codec, and simply delegate all "read" methods to the RO-Codec [I am not sure if this all works as its very complicated... *g* - I only mention: new Exception().getStackTrace() to inspect call stack... highly sophisticated!].
                
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144714#comment-13144714 ] 

Andrzej Bialecki  commented on LUCENE-3560:
-------------------------------------------

Let's not go too far either... we want people to write and contribute new codecs, if we make the api too onerous to use then we risk a lot of copy&paste-ing (e.g. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now). So let's leave extensibility where it's clear that stuff can be extended with no harm (or "no harm if you read the instructions").
                
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Simon Willnauer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144635#comment-13144635 ] 

Simon Willnauer commented on LUCENE-3560:
-----------------------------------------

+1
                
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144715#comment-13144715 ] 

Robert Muir commented on LUCENE-3560:
-------------------------------------

{quote}
e.g. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now
{quote}

But this is a must: you made a new index format, it needs a name to differentiate itself from the existing codec, otherwise it won't work.
                
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144660#comment-13144660 ] 

Robert Muir commented on LUCENE-3560:
-------------------------------------

{quote}
This is prevented by Robert's finalness on the format hooks, but what else could a codec do different if its not final without breaking index format?
{quote}

A codec can take parameters or have a non-final "hook": just like the perfield stuff does... (and it records this stuff into the index, so its fine). This is similar to CharTokenizer :)

{quote}
I think even 3x Codec should be final and not subclassed by the RW codec. The RW Preflex codec in tests should subclass abstract Codec, and simply delegate all "read" methods to the RO-Codec [I am not sure if this all works as its very complicated... *g* - I only mention: new Exception().getStackTrace() to inspect call stack... highly sophisticated!].
{quote}

Yeah, it would be nice to clean up the preflex hack...

                
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144721#comment-13144721 ] 

Robert Muir commented on LUCENE-3560:
-------------------------------------

{quote}
So let's leave extensibility where it's clear that stuff can be extended with no harm (or "no harm if you read the instructions").
{quote}

And thats exactly what this patch does: it only allows extensibility where there is no harm. The problem is, in trunk today, no methods on any codecs are final (and some should be!).

The places where there is no harm: e.g. getPostingsFormatForField, are still left open. This patch doesn't stop you from doing anything you can't already do today.

It only stops code that was already doomed to fail at runtime from even compiling.
                
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144724#comment-13144724 ] 

Uwe Schindler commented on LUCENE-3560:
---------------------------------------

bq. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now

The abstract base class Codec is as stupid simple as Analyzer. There is no logic in it, it just defines the following:
- name of codec (which cannot be changed by subclassing!!!)
- factory methods for the format readers/writers of the different parts of an index (postings, stored fields, segments file,...)

If you want to create a new codec, you have to simply write this wrapper with a new name, otherwise SPI won't work.
                
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3560) add extra safety to concrete codec implementations

Posted by "Uwe Schindler (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144724#comment-13144724 ] 

Uwe Schindler edited comment on LUCENE-3560 at 11/5/11 3:39 PM:
----------------------------------------------------------------

bq. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now

The abstract base class Codec is as stupid simple as Analyzer. There is no logic in it, it just defines the following:
- name of codec (which cannot be changed by subclassing!!!)
- factory methods for the format readers/writers of the different parts of an index (postings, stored fields, segments file,...)

If you want to create a new codec, you have to simply write this wrapper with a new name, otherwise SPI won't work.

The first point in the above list is the real bummer in your "I only want to add one file" approach. If you would subclass the codec, the name cannot change anymore. This name is written to the index format. When you open IndexReader, it reads the name and uses Codec.forName() to lookup the codec. As the name was not changed in your subclass it would then use the superclass to read the index -> #fail
                
      was (Author: thetaphi):
    bq. I'd like to extend an existing codec to add one file to files() - bummer, I have to reimplement the whole codec now

The abstract base class Codec is as stupid simple as Analyzer. There is no logic in it, it just defines the following:
- name of codec (which cannot be changed by subclassing!!!)
- factory methods for the format readers/writers of the different parts of an index (postings, stored fields, segments file,...)

If you want to create a new codec, you have to simply write this wrapper with a new name, otherwise SPI won't work.
                  
> add extra safety to concrete codec implementations
> --------------------------------------------------
>
>                 Key: LUCENE-3560
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3560
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3560.patch
>
>
> In LUCENE-3490, we reorganized the codec model, and a key part of this is that Codecs are "safer"
> and don't rely upon client-side configuration: IndexReader doesn't take Codec or anything of that 
> nature, only IndexWriter.
> Instead for "read" all codecs are initialized from the classpath via a no-arg ctor from Java's 
> Service Provider Mechanism.
> So, although Codecs can still take parameters in the constructors, be subclassable, etc (for passing
> to IndexWriter), this enforces that they must write any configuration information they need into
> the index, so that we don't have a flimsy API.
> I think we should go even further, for additional safety. Any methods on our concrete codecs that
> are not intended to be subclassed should be final, and we should add assertions to verify this.
> For example, SimpleText's files() implementation should be final. If you want to make an extension
> of simpletext that has additional files, then this is a different index format and should have a
> different name!
> Note: This doesn't stop extensibility, only stupid mistakes. 
> For example, this means that Lucene40Codec's postingsFormat() implementation is final, even though 
> it offers a configurable "hook" (getPostingsFormatForField) for you to specify per-field postings 
> formats (which it writes into a .per file into the index, so that it knows how to read each field).
> {code}
> private final PostingsFormat postingsFormat = new PerFieldPostingsFormat() {
>   @Override
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return Lucene40Codec.this.getPostingsFormatForField(field);
>   }
> };
> ...
> @Override
> public final PostingsFormat postingsFormat() {
>   return postingsFormat;
> }
> ...
>   /** Returns the postings format that should be used for writing 
>    *  new segments of <code>field</code>.
>    *  
>    *  The default implementation always returns "Lucene40"
>    */
>   public PostingsFormat getPostingsFormatForField(String field) {
>     return defaultFormat;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org