You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2005/10/30 15:36:56 UTC
[jira] Created: (LUCENE-460) hashCode improvements
hashCode improvements
---------------------
Key: LUCENE-460
URL: http://issues.apache.org/jira/browse/LUCENE-460
Project: Lucene - Java
Type: Improvement
Components: Search
Versions: CVS Nightly - Specify date in submission
Reporter: Yonik Seeley
Priority: Minor
Fix For: CVS Nightly - Specify date in submission
It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-460) hashCode improvements
Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356414 ]
Yonik Seeley commented on LUCENE-460:
-------------------------------------
Paul, I'm not sure I understand your point about "structure information added by the query parser is lost in the hash".
Let me rephrase my statement in case you misunderstood me:
- hash codes should strive to avoid collisions across Query hierarchy, not just within one specific
subclass. An example of hashCode implementations that don't do this are TermQuery(t) and SpanTermQuery(t) which will generate the exact same hash codes.
> What's the point of hash code uniqueness over subclasses?
The same point as hash code uniqueness (avoiding collisions) within a particular class - the more collisions you have, the more it will slow down hash based lookups. Things should still work if all hashCodes mapped to the same number, but it would be dog slow.
> hashCode improvements
> ---------------------
>
> Key: LUCENE-460
> URL: http://issues.apache.org/jira/browse/LUCENE-460
> Project: Lucene - Java
> Type: Improvement
> Components: Search
> Versions: CVS Nightly - Specify date in submission
> Reporter: Yonik Seeley
> Priority: Minor
> Fix For: CVS Nightly - Specify date in submission
>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-460) hashCode improvements
Posted by "paul.elschot (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356384 ]
paul.elschot commented on LUCENE-460:
-------------------------------------
> - hash codes should strive to be unique across the Query hierarchy, not just unique within one specific
> subclass. For example, TermQuery(t) and SpanTermQuery(t) will generate the exact same hash codes.
This has the disadvantage that the structure information added by the query parser is lost in the hash.
In case a hash is needed over the query text before parsing, one can use precisely that.
What's the point of hash code uniqueness over subclasses?
Regards,
Paul Elschot
> hashCode improvements
> ---------------------
>
> Key: LUCENE-460
> URL: http://issues.apache.org/jira/browse/LUCENE-460
> Project: Lucene - Java
> Type: Improvement
> Components: Search
> Versions: CVS Nightly - Specify date in submission
> Reporter: Yonik Seeley
> Priority: Minor
> Fix For: CVS Nightly - Specify date in submission
>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-460) hashCode improvements
Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12361200 ]
Yonik Seeley commented on LUCENE-460:
-------------------------------------
Some people have asked where some of the magic constants come from in the hashCodes:
> python -c "import random;print hex(random.getrandbits(32))[:-1]"
> Just a way of making some things unique.... let me know if you have a
> better idea on that.
>
> So those constants are just random numbers. I thought about trying to pick magic numbers to maximize hamming distances, etc,
> but it's more work and more likely to mess things up if you get it wrong. getClass().hashCode() would also work, but it would be
> slower.
> hashCode improvements
> ---------------------
>
> Key: LUCENE-460
> URL: http://issues.apache.org/jira/browse/LUCENE-460
> Project: Lucene - Java
> Type: Improvement
> Components: Search
> Versions: CVS Nightly - Specify date in submission
> Reporter: Yonik Seeley
> Priority: Minor
> Fix For: CVS Nightly - Specify date in submission
>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-460) hashCode improvements
Posted by "paul.elschot (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356432 ]
paul.elschot commented on LUCENE-460:
-------------------------------------
> For example, TermQuery(t) and SpanTermQuery(t) will generate the exact same hash codes.
I'm sorry, I misread this as an example of what a hash code should be. You meant this
to be an example of what is wrong with current hashcodes...
> hashCode improvements
> ---------------------
>
> Key: LUCENE-460
> URL: http://issues.apache.org/jira/browse/LUCENE-460
> Project: Lucene - Java
> Type: Improvement
> Components: Search
> Versions: CVS Nightly - Specify date in submission
> Reporter: Yonik Seeley
> Priority: Minor
> Fix For: CVS Nightly - Specify date in submission
>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-460) hashCode improvements
Posted by "Daniel Naber (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356327 ]
Daniel Naber commented on LUCENE-460:
-------------------------------------
"Currently, field[x TO y] will produce the same hasCode as field[y TO x]... not particularly important for RangeQuery, but you get the idea. "
Actually it is kind of important, because while [a TO z] will give hits [z TO a] won't.
> hashCode improvements
> ---------------------
>
> Key: LUCENE-460
> URL: http://issues.apache.org/jira/browse/LUCENE-460
> Project: Lucene - Java
> Type: Improvement
> Components: Search
> Versions: CVS Nightly - Specify date in submission
> Reporter: Yonik Seeley
> Priority: Minor
> Fix For: CVS Nightly - Specify date in submission
>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Closed: (LUCENE-460) hashCode improvements
Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-460?page=all ]
Yonik Seeley closed LUCENE-460:
-------------------------------
Resolution: Fixed
Assign To: Yonik Seeley
closing... I think I got most of these.
> hashCode improvements
> ---------------------
>
> Key: LUCENE-460
> URL: http://issues.apache.org/jira/browse/LUCENE-460
> Project: Lucene - Java
> Type: Improvement
> Components: Search
> Versions: CVS Nightly - Specify date in submission
> Reporter: Yonik Seeley
> Assignee: Yonik Seeley
> Priority: Minor
> Fix For: CVS Nightly - Specify date in submission
>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-460) hashCode improvements
Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356325 ]
Yonik Seeley commented on LUCENE-460:
-------------------------------------
A couple of guidelines off the top of my head...
- hash codes should strive to be unique across the Query hierarchy, not just unique within one specific subclass. For example, TermQuery(t) and SpanTermQuery(t) will generate the exact same hash codes.
- mix bits between different components that have any hashCode parts in common...
for example RangeQuery will produce the same hashCode whenever lowerTerm==upperTerm.
Also, field[x TO y] will produce the same hashCode for *any* field since the fieldname parts of the
terms will always cancel eachother out. This will also cause the hashCode of field{x TO x} to equal field:x
The hashCode of FilteredQuery will also cause many collisions because the bits aren't mixed inbetween
the query and the filter.
Remember that every query as a boost component... never just xor two query hashCodes together.
- make things position dependent.
Currently, field[x TO y] will produce the same hasCode as field[y TO x]... not particularly important for RangeQuery, but
you get the idea.
- don't be afraid of using "+" instead of "^". They both take a single CPU cycle, but "+" is not quite so easily (accidentally) reversed.
- flipping more than a single bit when hashing a boolean might be a good idea - it will make collisions harder.
http://www.concentric.net/~Ttwang/tech/inthash.htm is an interesting link on integer hash codes (what we are in effect doing when we combine multiple hash codes). Esp interesting is the section "Parallel Operations"
> hashCode improvements
> ---------------------
>
> Key: LUCENE-460
> URL: http://issues.apache.org/jira/browse/LUCENE-460
> Project: Lucene - Java
> Type: Improvement
> Components: Search
> Versions: CVS Nightly - Specify date in submission
> Reporter: Yonik Seeley
> Priority: Minor
> Fix For: CVS Nightly - Specify date in submission
>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-460) hashCode improvements
Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-460?page=comments#action_12356326 ]
Yonik Seeley commented on LUCENE-460:
-------------------------------------
Oh, and preserve entropy by using reversible integer hash functions (see the previous link).
- key ^= (key << a) | (key >>> b); for a,b in (17,16) (16,17) (14,19) (19,14) (13,20) (20,13) (10,23) (23,10) (8,25) (25,8)
- multiply by an odd
- addition
- xor
- rotates
> hashCode improvements
> ---------------------
>
> Key: LUCENE-460
> URL: http://issues.apache.org/jira/browse/LUCENE-460
> Project: Lucene - Java
> Type: Improvement
> Components: Search
> Versions: CVS Nightly - Specify date in submission
> Reporter: Yonik Seeley
> Priority: Minor
> Fix For: CVS Nightly - Specify date in submission
>
> It would be nice for all Query classes to implement hashCode and equals to enable them to be used as keys when caching.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org