You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2013/01/17 00:14:12 UTC

[jira] [Commented] (LUCENE-4690) remove hashing from NumericUtils.*ToPrefixCoded

    [ https://issues.apache.org/jira/browse/LUCENE-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555589#comment-13555589 ] 

Yonik Seeley commented on LUCENE-4690:
--------------------------------------

Current code:
{code}
  public static int longToPrefixCoded(final long val, final int shift, final BytesRef bytes) {
    if (shift>63 || shift<0)
      throw new IllegalArgumentException("Illegal shift value, must be 0..63");
    int hash, nChars = (63-shift)/7 + 1;
    bytes.offset = 0;
    bytes.length = nChars+1;
    if (bytes.bytes.length < bytes.length) {
      bytes.grow(NumericUtils.BUF_SIZE_LONG);
    }
    bytes.bytes[0] = (byte) (hash = (SHIFT_START_LONG + shift));
    long sortableBits = val ^ 0x8000000000000000L;
    sortableBits >>>= shift;
    while (nChars > 0) {
      // Store 7 bits per byte for compatibility
      // with UTF-8 encoding of terms
      bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f);
      sortableBits >>>= 7;
    }
    // calculate hash
    for (int i = 1; i < bytes.length; i++) {
      hash = 31*hash + bytes.bytes[i];
    }
    return hash;
  }
{code}

Proposed code template (i.e. for all of the *ToPrefixCoded methods):
{code}
  public void longToPrefixCodedBytes(final long val, final int shift, final BytesRef bytes) {
    assert (shift & ~0x3f) == 0;  // ensure shift is 0..63
    int nChars = (63-shift)/7 + 1;
    bytes.offset = 0;
    bytes.length = nChars+1;   // one extra for the byte that contains the shift info
    if (bytes.bytes.length < bytes.length) {
      bytes.bytes = new byte[NumericUtils.BUF_SIZE_LONG];  // use the max
    }
    bytes.bytes[0] = (byte)(SHIFT_START_LONG + shift);
    long sortableBits = val ^ 0x8000000000000000L;
    sortableBits >>>= shift;
    while (nChars > 0) {
      // Store 7 bits per byte for compatibility
      // with UTF-8 encoding of terms
      bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f);
      sortableBits >>>= 7;
    }
  }
{code}

Some of the changes:
 - Setting bytes.length to be larger than the current contained array temporarily puts BytesRef into an invalid state.  Calling any BytesRef methods (like grow) while it is in that invalid state is suspect.
 - replace grow with simple allocation.
   1. grow over-allocates all the time.  Most of the time (like here) it's wasted space.
   2. grow copies the previous buffer when allocating a bigger buffer.  This is wasted/unneeded here.
 - removes hash code calculation
                
> remove hashing from NumericUtils.*ToPrefixCoded 
> ------------------------------------------------
>
>                 Key: LUCENE-4690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4690
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>
> As far as I can tell nothing actually uses the hash codes generated by these methods (not even any tests).  If someone did want to generate a hash, it would be just as fast to do it on the BytesRef after the fact (or even faster from the input number itself).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org