You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/03/23 19:08:35 UTC

[GitHub] [lucene] uschindler opened a new pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

uschindler opened a new pull request #38:
URL: https://github.com/apache/lucene/pull/38


   This fixes the remaining change in #37, see comment https://github.com/apache/lucene/pull/37#issuecomment-805119396
   
   > Hi @rmuir,
   > I will dig into the TermToBytesRefAttribute issue, but I know why it fails. The reason is that TermToBytesRefAttribute is a superinterface of the actual CharTermAttribute, BytesRefTermAttribute,...
   > 
   > When the indexer calls getAttribute(TermToBytesRefAttribute.class) it gets back null, because the attributeimpl is not aware of it (not declared).
   > 
   > In general, all AttributeImpls should explicitely list all interfaces they implement (also for documentation purposes).
   > 
   > P.S.: The issue we have seen here is similar to a private method that is referenced by a MethodHandle (which is perfectly valid), but "unused" checker cannot see the method is ever called. Therefore we always declare those methods as package-private, although private would be more correct. I remmeber this from Elastic Painless and Expressions :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler commented on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805773997


   To finalize the discussion here: Should we maybe merge this PR and leave the other fixes to a specific discussion issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler commented on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805355226


   Thanks Robert for review. I added a few tests to verify that all TokenStreams have at least a TermToBytesRefAttribute, implemented by correct class. For standard UTF-16 streams using CharTermAttribute it must be same implementation class like the CharTermAttribute. For binary streams I added asserts and tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler commented on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805180200


   In short: the PR fixes the issue but the design is still broken since Lucene 4.0 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
rmuir commented on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805351590


   IMO it is arguably strange that AttributeSource "crawls" up the class hierarchy but not the interfaces too? https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/AttributeSource.java#L154-L164
   
   It seems this is the root cause of the unexpected behavior. I'm not suggesting we make it recursive, but maybe some javadocs is a good idea. Because to the programmer or tool, it is unintuitive that superinterfaces would need to be declared directly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler commented on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-806011084


   I added some Javadocs.
   
   @rmuir and I discussed also if we should require a TermToBytesRefAttribute be available on a tokenstream before sending to IndexWriter. I implemnted this, but EmptyTokenStream fails, so there's room for discussion. I will open new issues later.
   
   For now I will merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss commented on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
dweiss commented on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805602084


   bq. IMO it is arguably strange that AttributeSource "crawls" up the class hierarchy but not the interfaces too?
   
   It is strange and counterintuitive to me. I think all interfaces can and should be collected consistently (this means inherited interfaces too).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler edited a comment on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler edited a comment on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805693830


   >> IMO it is arguably strange that AttributeSource "crawls" up the class hierarchy but not the interfaces too?
   > 
   > It is strange and counterintuitive to me. I think all interfaces can and should be collected consistently (this means inherited interfaces too).
   
   We could change the ClassValue initializer as pointed out by Robert, yes. As it is cached, there is no problem with performance! Implementation is straight forward.
   
   But nevertheless, the additional checks I did in my tests are going far beyond that: The problem where this issue here comes from was explained in my first comment. When Michael Bush invented the attributes in Lucene 2.9, the idea and concept was quite clear: Every attribute should provide a way to store a property on the tokenstream. They should not interfere with each other. So originally the interface was defined to be just a POJO/bean with getters/setters to change the value. This concept allowed to combine multiple attributes in the same implementation class (what we do with our default implementation). If you change on attribute, another one should not change. So sharing functionality and methods between interfaces were not wanted. If Java would have allowed "final" on interface, we would have placed it on our attribute interfaces.
   
   With Lucene 4.0 we got a problem: The term can be represented as a byte[] or as a char[]. The indexer only wants bytes[]. But classical tokenstreams use UTF-16 characters. Also NumericTermAttribute was the first one who provided byte[] as a term. We now had a problem:
   - Indexer only wants byte[]
   - Tokenstreams want provide char[], sometimes (numerics) byte[]. But users don't want to deal with bytes and we did want to convert as seldom as possible
   
   And now the design did not fit that: How to create an interface that allows for this? And here the misconception started (it's partially my fault): We splitted the term attribute into actually three: TermToBytesRefAttribute with only a getter, CharTermAttribute implementing CharSequence and Appendable for easy usage, BytesTermAttribute holding a BytesRef. Problem was now that we did not want to copy char->byte all the time when somebody modifies term. So TermToBytesRefAttribute was a hack to be implemented by implementation class to have a "view" on the term (chars or bytes) from the viewpoint of indexer.
   
   This now brought problems that I am trying to verify with my tests added:
   - If you have a CharTermAttribute there must always be a TermToBytesRefAttribute available, and because they rely on each other: they must be implemented by same class (otherwise they wont be able to share data). Unfortunately theres a hacky special case in this ConcatGraphTokenFilter (again to minimize UTF-8 => UTF-16 copying).
   - the same for BytesTermAttribute
   
   There is nothing that enforces it! E.g., if you add a CharTermAttribute to a TokenStream, you implicitely add the TermToBytesRefattribute. If you then add a BytesTermAttribute, too, you're fcked up: You can set byte[] but they won't never go into index. This is why I added the tests. A single token stream with all filters on top should never have both attributes.
   
   The problem with the non-explicit interfaces is just one side-efect of the misconception described before!
   
   For documentation purposes I would really, really like to keep the explicit declaration of interfaces on attribute classes, so user knows which attributes are there! Javadocs does not show super interfaces. So I'd like to enforce it!
   
   I would add a impl note to AttributeSource to clarify that interfaces should be explicitely declared and it should generally be avoided to extend attributes from each other. We should maybe add a runtime check that getSuperClass()on every attribute interface is Attribute.class.
   
   **We can also fix this for BytesRefAttribute: Remove the extends and just redeclare the duplicate method. By that we also ensure that you cannot wrongly combine the interfaces with others (conflicting method declarations).**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler commented on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805169097


   The problem here is specific to BytesTermAttribute. In contrast CharTermAttribute does not have the issue:
   
   - TermToBytesRefAttribute defines getBytes()
   - Because BytesTermAttribute add setBytes() it also has getBytes(). So we decided to let it extend TermToBytesRefAttribute. 
   
   This is IMHO wrong. Attribute interfaces should not extend each other (separation of responsibilities). Both shoud declare getBytes() with same signature. Then compiler won't complain.
   
   P.S.: For the AppendGraphTokenFilter this was just copypasted. Here not even a test fails, as this is never tested with IndexWriter! We should add an indexing test!!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler merged pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler merged pull request #38:
URL: https://github.com/apache/lucene/pull/38


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler commented on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805693830


   >> IMO it is arguably strange that AttributeSource "crawls" up the class hierarchy but not the interfaces too?
   > 
   > It is strange and counterintuitive to me. I think all interfaces can and should be collected consistently (this means inherited interfaces too).
   
   We could change the ClassValue initializer as pointed out by Robert, yes. As it is cached, there is no problem with performance! Implementation is straight forward.
   
   But nevertheless, the additional checks I did in my tests are going far beyond that: The problem where this issue here comes from was explained in my first comment. When Michael Bush invented the attributes in Lucene 2.9, the idea and concept was quite clear: Every attribute should provide a way to store a property on the tokenstream. They should not interfere with each other. So originally the interface was defined to be just a POJO/bean with getters/setters to change the value. This concept allowed to combine multiple attributes in the same implementation class (what we do with our default implementation). If you change on attribute, another one should not change. So sharing functionality and methods between interfaces were not wanted. If Java would have allowed "final" on interface, we would have placed it on our attribute interfaces.
   
   With Lucene 4.0 we got a problem: The term can be represented as a byte[] or as a char[]. The indexer only wants bytes[]. But classical tokenstreams use UTF-16 characters. Also NumericTermAttribute was the first one who provided byte[] as a term. We now had a problem:
   - Indexer only wants byte[]
   - Tokenstreams want provide char[], sometimes (numerics) byte[]. But users don't want to deal with bytes and we did want to convert as seldom as possible
   
   And now the design did not fit that: How to create an interface that allows for this: And here the misconception started (it's partially my fault): We split the term attribute into actually 3: TermToBytesRefAttribute with only a getter, CharTermAttribute implementing CharSequence and Appendable for easy usage, BytesTermAttribute holding a BytesRef. Problem was now that we did not want to copy char->byte all the time when somebody modifies term. So TermToBytesRefAttribute was a hack to be implemented by implementation class to have a "view" on the term (chars or bytes) from the viewpoint of indexer.
   
   This now brough problems that I am trying to verify with my tests added:
   - If you have a CharTermAttribute there must always be a TermToBytesRefAttribute available, and because they rely on each other: they must be implemented by same class (otherwise they wont be able to share data).
   - the same for BytesTermAttribute
   
   There is nothing that enforces it. E.g., if you add a CharTermAttribute to a TokenStream, you implicitely add the TermToBytesRefattribute. If you then add a BytesTermAttribute, too you're fcked up: You can set byte[] but they won't never go into index. This is why I added the tests.
   
   The problem with the non-explicit interfaces is just one side-efect of the misconception.
   
   For documentation purposes I would really, really like to keep the explicit declaration of interfaces on attribute classes, so user knows which attributes are there! Javadocs does not show super interfaces. So I'd like to enforce.
   
   I would add a impl note to AttributeSource to clarify that interfaces should be explicitely declared and it should generally be avoided to extend attributes from each other.
   
   **We can also fix this for BytesRefAttribute: Remove the extends and just redeclare the duplicate method. By that we also ensure that you cannot wrongly combine the interfaces with others (conflicting method declarations).**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler edited a comment on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

Posted by GitBox <gi...@apache.org>.
uschindler edited a comment on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805693830


   >> IMO it is arguably strange that AttributeSource "crawls" up the class hierarchy but not the interfaces too?
   > 
   > It is strange and counterintuitive to me. I think all interfaces can and should be collected consistently (this means inherited interfaces too).
   
   We could change the ClassValue initializer as pointed out by Robert, yes. As it is cached, there is no problem with performance! Implementation is straight forward.
   
   But nevertheless, the additional checks I did in my tests are going far beyond that: The problem where this issue here comes from was explained in my first comment. When Michael Bush invented the attributes in Lucene 2.9, the idea and concept was quite clear: Every attribute should provide a way to store a property on the tokenstream. They should not interfere with each other. So originally the interface was defined to be just a POJO/bean with getters/setters to change the value. This concept allowed to combine multiple attributes in the same implementation class (what we do with our default implementation). If you change on attribute, another one should not change. So sharing functionality and methods between interfaces were not wanted. If Java would have allowed "final" on interface, we would have placed it on our attribute interfaces.
   
   With Lucene 4.0 we got a problem: The term can be represented as a byte[] or as a char[]. The indexer only wants bytes[]. But classical tokenstreams use UTF-16 characters. Also NumericTermAttribute was the first one who provided byte[] as a term. We now had a problem:
   - Indexer only wants byte[]
   - Tokenstreams want provide char[], sometimes (numerics) byte[]. But users don't want to deal with bytes and we did want to convert as seldom as possible
   
   And now the design did not fit that: How to create an interface that allows for this? And here the misconception started (it's partially my fault): We splitted the term attribute into actually three: TermToBytesRefAttribute with only a getter, CharTermAttribute implementing CharSequence and Appendable for easy usage, BytesTermAttribute holding a BytesRef. Problem was now that we did not want to copy char->byte all the time when somebody modifies term. So TermToBytesRefAttribute was a hack to be implemented by implementation class to have a "view" on the term (chars or bytes) from the viewpoint of indexer.
   
   This now brought problems that I am trying to verify with my tests added:
   - If you have a CharTermAttribute there must always be a TermToBytesRefAttribute available, and because they rely on each other: they must be implemented by same class (otherwise they wont be able to share data). Unfortunately theres a hacky special case in this ConcatGraphTokenFilter (again to minimize UTF-8 => UTF-16 copying).
   - the same for BytesTermAttribute
   
   There is nothing that enforces it! E.g., if you add a CharTermAttribute to a TokenStream, you implicitely add the TermToBytesRefattribute. If you then add a BytesTermAttribute, too, you're fcked up: You can set byte[] but they won't never go into index. This is why I added the tests. A single token stream with all filters on top should never have both attributes.
   
   The problem with the non-explicit interfaces is just one side-efect of the misconception described before!
   
   For documentation purposes I would really, really like to keep the explicit declaration of interfaces on attribute classes, so user knows which attributes are there! Javadocs does not show super interfaces. So I'd like to enforce it!
   
   I would add a impl note to AttributeSource to clarify that interfaces should be explicitely declared and it should generally be avoided to extend attributes from each other.
   
   **We can also fix this for BytesRefAttribute: Remove the extends and just redeclare the duplicate method. By that we also ensure that you cannot wrongly combine the interfaces with others (conflicting method declarations).**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org