You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Uwe Schindler (Jira)" <ji...@apache.org> on 2022/06/10 11:34:00 UTC

[jira] [Commented] (LUCENE-10610) RunAutomaton#hashCode() can easily cause hash collision for different Automatons

    [ https://issues.apache.org/jira/browse/LUCENE-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552702#comment-17552702 ] 

Uwe Schindler commented on LUCENE-10610:
----------------------------------------

The hashCode does not need to be unique. If 2 have same hash code the equals relation must be used.

But in general the hashcode should also contain the points, so yet it is a kind of bug - but only makes lookups ineffective. The problem is that it is expensive to create that from huge FSAs. Normally you should have a separate private field in the class marked "privat transient int hashCode" and you lazily cache the calculated hashcode there (works if object is unmodifiable). The Java String class has this to not recreate the hashCode on every call. The method does not need to be synchronized (although can concurrently updated), because the hash code is atomic size (32 bits) and it does not matter if 2 threads recreate the hash code at same time.

> RunAutomaton#hashCode() can easily cause hash collision for different Automatons
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-10610
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10610
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Tomoko Uchida
>            Priority: Minor
>
> Current RunAutomaton#hashCode() is:
> {code:java}
>   @Override
>   public int hashCode() {
>     final int prime = 31;
>     int result = 1;
>     result = prime * result + alphabetSize;
>     result = prime * result + points.length;
>     result = prime * result + size;
>     return result;
>   }
> {code}
> Since it does not take account of the contents of the {{points}} array, this returns the same value for different automatons when their alphabet size and state size are the same.
> For example, this test code passes.
> {code:java}
>   public void testHashCode() throws IOException {
>     PrefixQuery q1 = new PrefixQuery(new Term("field", "aba"));
>     PrefixQuery q2 = new PrefixQuery(new Term("field", "fee"));
>     assert q1.compiled.runAutomaton.hashCode() == q2.compiled.runAutomaton.hashCode();
>   }
> {code}
> I suspect this is a bug?
> Note that I think it's not a serious one; all callers of this {{hashCode()}} take account of additional information when calculating their own hash value, it seems there is no substantial impact on higher-level APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org