You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/03/24 10:35:24 UTC

[GitHub] [lucene] uschindler edited a comment on pull request #38: LUCENE-9856: Static analysis take 3: Remove redundant interfaces

uschindler edited a comment on pull request #38:
URL: https://github.com/apache/lucene/pull/38#issuecomment-805693830


   >> IMO it is arguably strange that AttributeSource "crawls" up the class hierarchy but not the interfaces too?
   > 
   > It is strange and counterintuitive to me. I think all interfaces can and should be collected consistently (this means inherited interfaces too).
   
   We could change the ClassValue initializer as pointed out by Robert, yes. As it is cached, there is no problem with performance! Implementation is straight forward.
   
   But nevertheless, the additional checks I did in my tests are going far beyond that: The problem where this issue here comes from was explained in my first comment. When Michael Bush invented the attributes in Lucene 2.9, the idea and concept was quite clear: Every attribute should provide a way to store a property on the tokenstream. They should not interfere with each other. So originally the interface was defined to be just a POJO/bean with getters/setters to change the value. This concept allowed to combine multiple attributes in the same implementation class (what we do with our default implementation). If you change on attribute, another one should not change. So sharing functionality and methods between interfaces were not wanted. If Java would have allowed "final" on interface, we would have placed it on our attribute interfaces.
   
   With Lucene 4.0 we got a problem: The term can be represented as a byte[] or as a char[]. The indexer only wants bytes[]. But classical tokenstreams use UTF-16 characters. Also NumericTermAttribute was the first one who provided byte[] as a term. We now had a problem:
   - Indexer only wants byte[]
   - Tokenstreams want provide char[], sometimes (numerics) byte[]. But users don't want to deal with bytes and we did want to convert as seldom as possible
   
   And now the design did not fit that: How to create an interface that allows for this? And here the misconception started (it's partially my fault): We splitted the term attribute into actually three: TermToBytesRefAttribute with only a getter, CharTermAttribute implementing CharSequence and Appendable for easy usage, BytesTermAttribute holding a BytesRef. Problem was now that we did not want to copy char->byte all the time when somebody modifies term. So TermToBytesRefAttribute was a hack to be implemented by implementation class to have a "view" on the term (chars or bytes) from the viewpoint of indexer.
   
   This now brought problems that I am trying to verify with my tests added:
   - If you have a CharTermAttribute there must always be a TermToBytesRefAttribute available, and because they rely on each other: they must be implemented by same class (otherwise they wont be able to share data). Unfortunately theres a hacky special case in this ConcatGraphTokenFilter (again to minimize UTF-8 => UTF-16 copying).
   - the same for BytesTermAttribute
   
   There is nothing that enforces it! E.g., if you add a CharTermAttribute to a TokenStream, you implicitely add the TermToBytesRefattribute. If you then add a BytesTermAttribute, too, you're fcked up: You can set byte[] but they won't never go into index. This is why I added the tests. A single token stream with all filters on top should never have both attributes.
   
   The problem with the non-explicit interfaces is just one side-efect of the misconception described before!
   
   For documentation purposes I would really, really like to keep the explicit declaration of interfaces on attribute classes, so user knows which attributes are there! Javadocs does not show super interfaces. So I'd like to enforce it!
   
   I would add a impl note to AttributeSource to clarify that interfaces should be explicitely declared and it should generally be avoided to extend attributes from each other.
   
   **We can also fix this for BytesRefAttribute: Remove the extends and just redeclare the duplicate method. By that we also ensure that you cannot wrongly combine the interfaces with others (conflicting method declarations).**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org