You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2019/09/06 16:34:55 UTC

[GitHub] [accumulo] keith-turner commented on a change in pull request #1344: Fix #1043 Support stable ~del split points

keith-turner commented on a change in pull request #1344: Fix #1043 Support stable ~del split points
URL: https://github.com/apache/accumulo/pull/1344#discussion_r321757308
 
 

 ##########
 File path: core/src/main/java/org/apache/accumulo/core/metadata/schema/SortSkew.java
 ##########
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.accumulo.core.metadata.schema;
+
+/*
+ * A subprefix used to remove sort skew from some of the metadata generated entries, for example: file deletes
+ * prefixed with ~del.
+ */
+public class SortSkew {
+
+  // A specified length for the skew code used is necessary to parse the key correctly.
+  // The Hex code for an integer will always be <= 8
+  private static final int SORTSKEW_LENGTH = Integer.BYTES * 2;
+
+  // Create a left justified hex string for the path hashcode of a deterministic length.
+  // If necessary it is right padded with blanks
+  public static String getCode(String keypart) {
+    return String.format("%-" + SORTSKEW_LENGTH + "." + SORTSKEW_LENGTH + "h", keypart.hashCode());
 
 Review comment:
   In general the `hashCode()` method is not suitable for persistent data because there is no guarantee of consistency between processes.  The following is from the javadoc of `Object.hashCode()`. 
   
   ```
        * <li>Whenever it is invoked on the same object more than once during
        *     an execution of a Java application, the {@code hashCode} method
        *     must consistently return the same integer, provided no information
        *     used in {@code equals} comparisons on the object is modified.
        *     This integer need not remain consistent from one execution of an
        *     application to another execution of the same application.
   ```
    I think using [murmur32 is](https://en.wikipedia.org/wiki/MurmurHash) a good alternative because its very fast and suitable for persistent data.  Guava offers an easy to use murmur32 impl.
   
   During past performance testing I have found String format to be extremely slow.  Guava `Strings.padStart` offers much faster alternative for the simple case of padding zeros.  Below is untested example code that does the hashing and padding using Guava.
   
   ```java
       int hashCode = Hashing.murmur3_32().hashString(keypart, UTF_8).asInt();
       return Strings.padStart(Integer.toHexString(hashCode), SORTSKEW_LENGTH, '0');
   ```
   
   Also it would be nice to add some docs stating that the hashcodes are intended to be persisted and therefore any changes must consider previously persisted data.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services