You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2022/09/14 03:10:14 UTC

[GitHub] [hbase] binlijin opened a new pull request, #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

binlijin opened a new pull request, #4782:
URL: https://github.com/apache/hbase/pull/4782

   A simple implementation which only consider cell row and not cell qualifier. 
   
   00c7-202206201519-wx0t
   00c7-202206201519-wx0zcldi7lnsiyas-N
   00c7-202206201520-wx0re
   00c7-202206201520-wx0ulgrwi7d542tm-N
   00c7-202206201520-wx0x7
   00c7-202206201521
   00c7-202206201521-wx05xfbtw2mopyhs-C
   00c7-202206201521-wx08
   00c7-202206201521-wx0c
   00c7-202206201521-wx0go
   00c7-202206201522-wx0t
   00c8-202206200751-wx0ah4gnbwptdyna-F
   
   The prefix tree node is like this:
   <img width="818" alt="image" src="https://user-images.githubusercontent.com/1531353/190049685-ffd86881-2f32-4502-b793-414f1a9180db.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1261743710

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   1m  8s |  Docker mode activated.  |
   ||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  No case conflicting files found.  |
   | +1 :green_heart: |  hbaseanti  |   0m  0s |  Patch does not have any anti-patterns.  |
   | +1 :green_heart: |  @author  |   0m  0s |  The patch does not contain any @author tags.  |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 12s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 25s |  master passed  |
   | +1 :green_heart: |  compile  |   2m 49s |  master passed  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  master passed  |
   | +1 :green_heart: |  spotless  |   0m 41s |  branch has no errors when running spotless:check.  |
   | +1 :green_heart: |  spotbugs  |   1m 49s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 11s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 11s |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 47s |  the patch passed  |
   | -0 :warning: |  javac  |   0m 32s |  hbase-common generated 6 new + 1 unchanged - 1 fixed = 7 total (was 2)  |
   | -0 :warning: |  checkstyle  |   0m 16s |  hbase-common: The patch generated 38 new + 0 unchanged - 0 fixed = 38 total (was 0)  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  The patch has no whitespace issues.  |
   | +1 :green_heart: |  hadoopcheck  |   8m  7s |  Patch does not cause any errors with Hadoop 3.2.4 3.3.4.  |
   | +1 :green_heart: |  spotless  |   0m 41s |  patch has no errors when running spotless:check.  |
   | -1 :x: |  spotbugs  |   0m 37s |  hbase-common generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0)  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 19s |  The patch does not generate ASF License warnings.  |
   |  |   |  32m 10s |   |
   
   
   | Reason | Tests |
   |-------:|:------|
   | FindBugs | module:hbase-common |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 221] |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 230] |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 232] |
   |  |  Unread public/protected field:At PrefixTreeUtil.java:[line 564] |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-general-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile |
   | uname | Linux 6a88e50216b1 5.4.0-124-generic #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / 23a56331db |
   | Default Java | Temurin-1.8.0_345-b01 |
   | javac | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-general-check/output/diff-compile-javac-hbase-common.txt |
   | checkstyle | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-general-check/output/diff-checkstyle-hbase-common.txt |
   | spotbugs | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-general-check/output/new-spotbugs-hbase-common.html |
   | Max. process+thread count | 64 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/console |
   | versions | git=2.17.1 maven=3.6.3 spotbugs=4.7.2 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1246345957

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   1m 13s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  3s |  Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m  8s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 17s |  master passed  |
   | +1 :green_heart: |  compile  |   0m 56s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   3m 51s |  branch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 13s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 14s |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 56s |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 56s |  the patch passed  |
   | -1 :x: |  shadedjars  |   0m 49s |  patch has 10 errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  the patch passed  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 58s |  hbase-common in the patch passed.  |
   | -1 :x: |  unit  | 212m 53s |  hbase-server in the patch failed.  |
   |  |   | 230m 50s |   |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux 556a890704a0 5.4.0-124-generic #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / b3a04899a4 |
   | Default Java | AdoptOpenJDK-1.8.0_282-b08 |
   | shadedjars | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk8-hadoop3-check/output/patch-shadedjars.txt |
   | unit | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt |
   |  Test Results | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/testReport/ |
   | Max. process+thread count | 3236 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/console |
   | versions | git=2.17.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache9 commented on a diff in pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache9 commented on code in PR #4782:
URL: https://github.com/apache/hbase/pull/4782#discussion_r973735088


##########
hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixTreeUtil.java:
##########
@@ -0,0 +1,593 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.encoding;
+
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.ByteArrayOutputStream;
+import org.apache.hadoop.hbase.io.util.UFIntTool;
+import org.apache.hadoop.hbase.nio.ByteBuff;
+import org.apache.hadoop.hbase.util.ByteBufferUtils;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+@InterfaceAudience.Private
+public class PrefixTreeUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(PrefixTreeUtil.class);
+
+  /**
+   * Build tree from begin
+   * @return the tree
+   */
+  public static TokenizerNode buildPrefixTree(List<byte[]> rowKeys) {
+    // root node.
+    TokenizerNode node = new TokenizerNode();
+    int start = 0;
+    // Get max common prefix
+    int common = maxCommonPrefix(rowKeys, 0, rowKeys.size() - 1, 0);
+    if (common > 0) {
+      byte[] commonB = Bytes.copy(rowKeys.get(0), 0, common);
+      node.nodeData = commonB;
+      for (int i = 0; i < rowKeys.size(); i++) {
+        if (rowKeys.get(i).length == common) {
+          node.numOccurrences++;
+          if (node.index == null) {
+            node.index = new ArrayList<>(1);
+          }
+          node.index.add(i);
+          start = i + 1;
+        } else {
+          break;
+        }
+      }
+    } else {
+      // Only root node data can be empty.
+      node.nodeData = new byte[0];
+    }
+    constructAndSplitChild(node, rowKeys, start, rowKeys.size() - 1, common);
+    return node;
+  }
+
+  /**
+   * Calculate max common prefix
+   * @return the max common prefix num bytes
+   */
+  static int maxCommonPrefix(List<byte[]> rowKeys, int start, int end, int startPos) {
+    // only one entry.
+    if (start == end) {
+      return rowKeys.get(start).length - startPos;
+    }
+    int common = 0;
+    for (int round = 0; round <= rowKeys.get(start).length - startPos - 1; round++) {
+      boolean same = true;
+      for (int i = start + 1; i <= end; i++) {
+        if (startPos + common > rowKeys.get(i).length - 1) {
+          same = false;
+          break;
+        }
+        if (rowKeys.get(start)[startPos + common] != rowKeys.get(i)[startPos + common]) {
+          same = false;
+          break;
+        }
+      }
+      if (same) {
+        common++;
+      } else {
+        break;
+      }
+    }
+    return common;
+  }
+
+  /**
+   * No common prefix split it.
+   */
+  static void constructAndSplitChild(TokenizerNode node, List<byte[]> rowKeys, int start, int end,
+    int startPos) {
+    int middle = start;
+    for (int i = start + 1; i <= end; i++) {
+      if (startPos > rowKeys.get(i).length - 1) {
+        middle = i - 1;
+        break;
+      }
+      if (rowKeys.get(start)[startPos] != rowKeys.get(i)[startPos]) {
+        middle = i - 1;
+        break;
+      }
+    }
+    constructCommonNodeAndChild(node, rowKeys, start, middle, startPos);
+    if (middle + 1 <= end) {
+      // right
+      constructCommonNodeAndChild(node, rowKeys, middle + 1, end, startPos);
+    }
+  }
+
+  /**
+   * Get max common prefix as node and build children.
+   */
+  static TokenizerNode constructCommonNodeAndChild(TokenizerNode node, List<byte[]> rowKeys,
+    int start, int end, int startPos) {
+    int common = maxCommonPrefix(rowKeys, start, end, startPos);
+    if (common > 0) {
+      TokenizerNode child = new TokenizerNode();
+      child.parent = node;
+      node.children.add(child);
+      byte[] commonB = Bytes.copy(rowKeys.get(start), startPos, common);
+      child.nodeData = commonB;
+      int newStart = start;
+      for (int i = start; i <= end; i++) {
+        if (rowKeys.get(i).length == (startPos + common)) {
+          child.numOccurrences++;
+          if (child.index == null) {
+            child.index = new ArrayList<>(1);
+          }
+          child.index.add(i);
+          newStart = i + 1;
+        } else {
+          break;
+        }
+      }
+      if (start != end && newStart <= end) {
+        if (newStart == start) {
+          // no common prefix.
+          constructAndSplitChild(child, rowKeys, newStart, end, startPos + common);
+        } else {
+          // can have common prefix.
+          constructCommonNodeAndChild(child, rowKeys, newStart, end, startPos + common);
+        }
+      }
+    } else {
+      // no common prefix, split
+      constructAndSplitChild(node, rowKeys, start, end, startPos);
+    }
+    return node;
+  }
+
+  public static void getNodeMetaInfo(TokenizerNode node, TokenizerNodeMeta meta) {
+    if (node.nodeData.length > meta.maxNodeDataLength) {
+      meta.maxNodeDataLength = node.nodeData.length;
+    }
+    meta.totalNodeDataLength += node.nodeData.length;
+    meta.countNodeDataNum++;
+
+    if (node.children.size() > meta.maxChildNum) {
+      meta.maxChildNum = node.children.size();
+    }
+    meta.totalChildNum += node.children.size();
+    meta.countChildNum++;
+
+    if (node.numOccurrences > meta.maxNumOccurrences) {
+      meta.maxNumOccurrences = node.numOccurrences;
+    }
+    meta.countNumOccurrences++;
+    if (node.index != null) {
+      for (Integer entry : node.index) {
+        if (entry > meta.maxIndex) {
+          meta.maxIndex = entry;
+        }
+      }
+    }
+    if (node.children.isEmpty()) {
+      meta.leafNodes.add(node);
+      meta.countIndexNum++;
+    } else {
+      meta.nonLeafNodes.add(node);
+    }
+    for (TokenizerNode child : node.children) {
+      getNodeMetaInfo(child, meta);
+    }
+  }
+
+  public static void serializePrefixTree(TokenizerNode node, PrefixTreeDataWidth dataWidth,
+    ByteArrayOutputStream outputStream) throws IOException {
+    TokenizerNodeMeta meta = new TokenizerNodeMeta();
+    PrefixTreeUtil.getNodeMetaInfo(node, meta);
+    int totalLength = 0;
+    dataWidth.nodeDataLengthWidth = UFIntTool.numBytes(meta.maxNodeDataLength);
+    totalLength += meta.totalNodeDataLength;
+    totalLength += dataWidth.nodeDataLengthWidth * meta.countNodeDataNum;
+
+    dataWidth.fanOutWidth = UFIntTool.numBytes(meta.maxChildNum);
+    // fan Out
+    totalLength += dataWidth.fanOutWidth * meta.countChildNum;
+    // fan Byte
+    totalLength += meta.totalChildNum;
+
+    // nextnodeoffset
+    totalLength += 4 * meta.countChildNum;
+
+    dataWidth.occurrencesWidth = UFIntTool.numBytes(meta.maxNumOccurrences);
+    totalLength += dataWidth.occurrencesWidth * meta.countNumOccurrences;
+
+    dataWidth.indexWidth = UFIntTool.numBytes(meta.maxIndex);
+    totalLength += dataWidth.indexWidth * meta.countIndexNum;
+
+    dataWidth.childNodeOffsetWidth = UFIntTool.numBytes(totalLength);
+
+    // track the starting position of each node in final output
+    int negativeIndex = 0;
+    for (int i = meta.leafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode leaf = meta.leafNodes.get(i);
+      // no children
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + leaf.nodeData.length + dataWidth.fanOutWidth
+          + dataWidth.occurrencesWidth + leaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      leaf.nodeWidth = leafNodeWidth;
+      leaf.negativeIndex = negativeIndex;
+    }
+    for (int i = meta.nonLeafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode nonLeaf = meta.nonLeafNodes.get(i);
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + nonLeaf.nodeData.length + dataWidth.fanOutWidth
+          + nonLeaf.children.size() + nonLeaf.children.size() * dataWidth.childNodeOffsetWidth
+          + dataWidth.occurrencesWidth + nonLeaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      nonLeaf.nodeWidth = leafNodeWidth;
+      nonLeaf.negativeIndex = negativeIndex;
+    }
+
+    for (int i = 0; i < meta.nonLeafNodes.size(); i++) {
+      serialize(meta.nonLeafNodes.get(i), outputStream, dataWidth);
+    }
+    for (int i = 0; i < meta.leafNodes.size(); i++) {
+      serialize(meta.leafNodes.get(i), outputStream, dataWidth);
+    }
+  }
+
+  static void serialize(TokenizerNode node, ByteArrayOutputStream os, PrefixTreeDataWidth dataWidth)
+    throws IOException {
+    UFIntTool.writeBytes(dataWidth.nodeDataLengthWidth, node.nodeData.length, os);
+    os.write(node.nodeData, 0, node.nodeData.length);
+    UFIntTool.writeBytes(dataWidth.fanOutWidth, node.children.size(), os);
+    for (TokenizerNode child : node.children) {
+      // child's first byte.
+      os.write(child.nodeData[0]);
+    }
+    for (TokenizerNode child : node.children) {
+      UFIntTool.writeBytes(dataWidth.childNodeOffsetWidth, node.negativeIndex - child.negativeIndex,
+        os);
+    }
+    UFIntTool.writeBytes(dataWidth.occurrencesWidth, node.numOccurrences, os);
+    for (int i = 0; i < node.numOccurrences; i++) {
+      UFIntTool.writeBytes(dataWidth.indexWidth, node.index.get(i), os);
+    }
+  }
+
+  public static void serialize(DataOutput out, PrefixTreeDataWidth dataWidth) throws IOException {
+    out.writeByte(dataWidth.nodeDataLengthWidth);
+    out.writeByte(dataWidth.fanOutWidth);
+    out.writeByte(dataWidth.occurrencesWidth);
+    out.writeByte(dataWidth.indexWidth);
+    out.writeByte(dataWidth.childNodeOffsetWidth);
+  }
+
+  public static void deserialize(ByteBuff data, PrefixTreeDataWidth dataWidth) {
+    dataWidth.nodeDataLengthWidth = data.get();
+    dataWidth.fanOutWidth = data.get();
+    dataWidth.occurrencesWidth = data.get();
+    dataWidth.indexWidth = data.get();
+    dataWidth.childNodeOffsetWidth = data.get();
+  }
+
+  /**
+   * Get the node index, that search key >= index and search key < (index + 1)
+   */
+  public static int search(ByteBuffer data, int bbStartPos, byte[] skey, int keyStartPos,
+    PrefixTreeDataWidth meta) {
+    int nodeDataLength = getNodeDataLength(data, bbStartPos, meta);
+    int cs = ByteBufferUtils.compareTo(skey, keyStartPos,
+      Math.min(skey.length - keyStartPos, nodeDataLength), data,
+      bbStartPos + meta.nodeDataLengthWidth, nodeDataLength);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+
+    if (cs == 0) {
+      // continue search
+      if (fanOut == 0) {
+        // no children, should be numOccurrences > 0
+        int index = getNodeIndex(data, pos, 0, meta);
+        if (skey.length == keyStartPos + nodeDataLength) {
+          // == current node
+          return index;
+        } else {
+          // > current node.
+          return index;
+        }
+      }
+      if (skey.length > keyStartPos + nodeDataLength) {
+        int fanOffset = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength + meta.fanOutWidth;
+        byte searchForByte = skey[keyStartPos + nodeDataLength];
+
+        int fanIndexInBlock =
+          unsignedBinarySearch(data, fanOffset, fanOffset + fanOut, searchForByte);
+        int nodeOffsetStartPos = fanOffset + fanOut;
+        if (fanIndexInBlock >= 0) {
+          // found it, but need to adjust for position of fan in overall block
+          int fanIndex = fanIndexInBlock - fanOffset;
+          int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanIndex, meta);
+          return search(data, bbStartPos + nodeOffset, skey, keyStartPos + nodeDataLength, meta);
+        } else {
+          int fanIndex = fanIndexInBlock + fanOffset;// didn't find it, so compensate in reverse
+          int insertionPoint = (-fanIndex - 1) - 1;
+          if (insertionPoint < 0) {
+            // < first children
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+            return getFirstLeafNode(data, bbStartPos + nodeOffset, meta) - 1;
+          } else {
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, insertionPoint, meta);
+            return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+          }
+        }
+      } else {
+        // skey.length == keyStartPos + nodeDataLength
+        if (numOccurrences > 0) {
+          // == current node and current node is a leaf node.
+          return getNodeIndex(data, pos, 0, meta);
+        } else {
+          // need -1, == current node and current node not a leaf node.
+          return getFirstLeafNode(data, bbStartPos, meta) - 1;
+        }
+      }
+    } else if (cs > 0) {
+      // search key bigger than (>) current node, get biggest
+      if (fanOut == 0) {
+        if (numOccurrences > 0) {
+          if (numOccurrences == 1) {
+            return getNodeIndex(data, pos, 0, meta);
+          } else {
+            // TODO
+            throw new IllegalStateException(
+              "numOccurrences = " + numOccurrences + " > 1 not expected.");
+          }
+        } else {
+          throw new IllegalStateException(
+            "numOccurrences = " + numOccurrences + ", fanOut = " + fanOut + " not expected.");
+        }
+      } else {
+        return getLastLeafNode(data, bbStartPos, meta);
+      }
+    } else {
+      // search key small than (<) current node, get smallest.
+      if (numOccurrences > 0) {
+        return getNodeIndex(data, pos, 0, meta) - 1;
+      } else {
+        return getFirstLeafNode(data, bbStartPos, meta) - 1;
+      }
+    }
+  }
+
+  static int getNodeDataLength(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int dataLength = (int) UFIntTool.fromBytes(data, offset, meta.nodeDataLengthWidth);
+    return dataLength;
+  }
+
+  static int getNodeFanOut(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int fanOut = (int) UFIntTool.fromBytes(data, offset, meta.fanOutWidth);
+    return fanOut;
+  }
+
+  static int getNodeNumOccurrences(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int numOccurrences = (int) UFIntTool.fromBytes(data, offset, meta.occurrencesWidth);
+    return numOccurrences;
+  }
+
+  static int getNodeOffset(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeOffset = (int) UFIntTool.fromBytes(data, offset + (index * meta.childNodeOffsetWidth),
+      meta.childNodeOffsetWidth);
+    return nodeOffset;
+  }
+
+  static int getNodeIndex(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeIndex =
+      (int) UFIntTool.fromBytes(data, offset + (index * meta.indexWidth), meta.indexWidth);
+    return nodeIndex;
+  }
+
+  /**
+   * Get the node's first leaf node
+   */
+  static int getFirstLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0 || fanOut == 0) {
+      // return current node.
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+      return getFirstLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  /**
+   * Get the node's last leaf node
+   */
+  static int getLastLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    // int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (fanOut == 0) {
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanOut - 1, meta);
+      return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  public static int unsignedBinarySearch(ByteBuffer a, int fromIndex, int toIndex, byte key) {
+    int unsignedKey = key & 0xff;
+    int low = fromIndex;
+    int high = toIndex - 1;
+
+    while (low <= high) {
+      int mid = low + ((high - low) >> 1);
+      int midVal = a.get(mid) & 0xff;
+
+      if (midVal < unsignedKey) {
+        low = mid + 1;
+      } else if (midVal > unsignedKey) {
+        high = mid - 1;
+      } else {
+        return mid; // key found
+      }
+    }
+    return -(low + 1); // key not found.
+  }
+
+  public static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth dataWidth,
+    int index) {
+    return get(data, bbStartPos, dataWidth, index, new byte[0]);
+  }
+
+  static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    byte[] prefix) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    byte[] bdata = new byte[dataLength];
+    ByteBuffer dup = data.duplicate();
+    dup.position(bbStartPos + meta.nodeDataLengthWidth);
+    dup.get(bdata, 0, dataLength);
+    bdata = Bytes.add(prefix, bdata);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      }
+    }
+    if (fanOut == 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      } else {
+        throw new IllegalStateException(
+          "Unexpected, search index=" + index + ", but find to " + currentNodeIndex);
+      }
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int locateIndex = locateWhichChild(data, bbStartPos, meta, index, fanOut, nodeOffsetStartPos);
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, locateIndex, meta);
+      return get(data, bbStartPos + nodeOffset, meta, index, bdata);
+    }
+  }
+
+  static int locateWhichChild(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    int fanOut, int nodeOffsetStartPos) {
+    for (int i = 0; i < fanOut; i++) {
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, i, meta);
+      int lastLeafNode = getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+      if (lastLeafNode >= index) {
+        return i;
+      }
+    }
+    throw new IllegalStateException("Unexpected unable to find index=" + index);
+  }
+
+  public static class TokenizerNode {

Review Comment:
   I think the memory overhead is a bit large here? Is it possible to use a double array trie here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache9 commented on a diff in pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache9 commented on code in PR #4782:
URL: https://github.com/apache/hbase/pull/4782#discussion_r977135055


##########
hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixTreeUtil.java:
##########
@@ -0,0 +1,593 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.encoding;
+
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.ByteArrayOutputStream;
+import org.apache.hadoop.hbase.io.util.UFIntTool;
+import org.apache.hadoop.hbase.nio.ByteBuff;
+import org.apache.hadoop.hbase.util.ByteBufferUtils;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+@InterfaceAudience.Private
+public class PrefixTreeUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(PrefixTreeUtil.class);
+
+  /**
+   * Build tree from begin
+   * @return the tree
+   */
+  public static TokenizerNode buildPrefixTree(List<byte[]> rowKeys) {
+    // root node.
+    TokenizerNode node = new TokenizerNode();
+    int start = 0;
+    // Get max common prefix
+    int common = maxCommonPrefix(rowKeys, 0, rowKeys.size() - 1, 0);
+    if (common > 0) {
+      byte[] commonB = Bytes.copy(rowKeys.get(0), 0, common);
+      node.nodeData = commonB;
+      for (int i = 0; i < rowKeys.size(); i++) {
+        if (rowKeys.get(i).length == common) {
+          node.numOccurrences++;
+          if (node.index == null) {
+            node.index = new ArrayList<>(1);
+          }
+          node.index.add(i);
+          start = i + 1;
+        } else {
+          break;
+        }
+      }
+    } else {
+      // Only root node data can be empty.
+      node.nodeData = new byte[0];
+    }
+    constructAndSplitChild(node, rowKeys, start, rowKeys.size() - 1, common);
+    return node;
+  }
+
+  /**
+   * Calculate max common prefix
+   * @return the max common prefix num bytes
+   */
+  static int maxCommonPrefix(List<byte[]> rowKeys, int start, int end, int startPos) {
+    // only one entry.
+    if (start == end) {
+      return rowKeys.get(start).length - startPos;
+    }
+    int common = 0;
+    for (int round = 0; round <= rowKeys.get(start).length - startPos - 1; round++) {
+      boolean same = true;
+      for (int i = start + 1; i <= end; i++) {
+        if (startPos + common > rowKeys.get(i).length - 1) {
+          same = false;
+          break;
+        }
+        if (rowKeys.get(start)[startPos + common] != rowKeys.get(i)[startPos + common]) {
+          same = false;
+          break;
+        }
+      }
+      if (same) {
+        common++;
+      } else {
+        break;
+      }
+    }
+    return common;
+  }
+
+  /**
+   * No common prefix split it.
+   */
+  static void constructAndSplitChild(TokenizerNode node, List<byte[]> rowKeys, int start, int end,
+    int startPos) {
+    int middle = start;
+    for (int i = start + 1; i <= end; i++) {
+      if (startPos > rowKeys.get(i).length - 1) {
+        middle = i - 1;
+        break;
+      }
+      if (rowKeys.get(start)[startPos] != rowKeys.get(i)[startPos]) {
+        middle = i - 1;
+        break;
+      }
+    }
+    constructCommonNodeAndChild(node, rowKeys, start, middle, startPos);
+    if (middle + 1 <= end) {
+      // right
+      constructCommonNodeAndChild(node, rowKeys, middle + 1, end, startPos);
+    }
+  }
+
+  /**
+   * Get max common prefix as node and build children.
+   */
+  static TokenizerNode constructCommonNodeAndChild(TokenizerNode node, List<byte[]> rowKeys,
+    int start, int end, int startPos) {
+    int common = maxCommonPrefix(rowKeys, start, end, startPos);
+    if (common > 0) {
+      TokenizerNode child = new TokenizerNode();
+      child.parent = node;
+      node.children.add(child);
+      byte[] commonB = Bytes.copy(rowKeys.get(start), startPos, common);
+      child.nodeData = commonB;
+      int newStart = start;
+      for (int i = start; i <= end; i++) {
+        if (rowKeys.get(i).length == (startPos + common)) {
+          child.numOccurrences++;
+          if (child.index == null) {
+            child.index = new ArrayList<>(1);
+          }
+          child.index.add(i);
+          newStart = i + 1;
+        } else {
+          break;
+        }
+      }
+      if (start != end && newStart <= end) {
+        if (newStart == start) {
+          // no common prefix.
+          constructAndSplitChild(child, rowKeys, newStart, end, startPos + common);
+        } else {
+          // can have common prefix.
+          constructCommonNodeAndChild(child, rowKeys, newStart, end, startPos + common);
+        }
+      }
+    } else {
+      // no common prefix, split
+      constructAndSplitChild(node, rowKeys, start, end, startPos);
+    }
+    return node;
+  }
+
+  public static void getNodeMetaInfo(TokenizerNode node, TokenizerNodeMeta meta) {
+    if (node.nodeData.length > meta.maxNodeDataLength) {
+      meta.maxNodeDataLength = node.nodeData.length;
+    }
+    meta.totalNodeDataLength += node.nodeData.length;
+    meta.countNodeDataNum++;
+
+    if (node.children.size() > meta.maxChildNum) {
+      meta.maxChildNum = node.children.size();
+    }
+    meta.totalChildNum += node.children.size();
+    meta.countChildNum++;
+
+    if (node.numOccurrences > meta.maxNumOccurrences) {
+      meta.maxNumOccurrences = node.numOccurrences;
+    }
+    meta.countNumOccurrences++;
+    if (node.index != null) {
+      for (Integer entry : node.index) {
+        if (entry > meta.maxIndex) {
+          meta.maxIndex = entry;
+        }
+      }
+    }
+    if (node.children.isEmpty()) {
+      meta.leafNodes.add(node);
+      meta.countIndexNum++;
+    } else {
+      meta.nonLeafNodes.add(node);
+    }
+    for (TokenizerNode child : node.children) {
+      getNodeMetaInfo(child, meta);
+    }
+  }
+
+  public static void serializePrefixTree(TokenizerNode node, PrefixTreeDataWidth dataWidth,
+    ByteArrayOutputStream outputStream) throws IOException {
+    TokenizerNodeMeta meta = new TokenizerNodeMeta();
+    PrefixTreeUtil.getNodeMetaInfo(node, meta);
+    int totalLength = 0;
+    dataWidth.nodeDataLengthWidth = UFIntTool.numBytes(meta.maxNodeDataLength);
+    totalLength += meta.totalNodeDataLength;
+    totalLength += dataWidth.nodeDataLengthWidth * meta.countNodeDataNum;
+
+    dataWidth.fanOutWidth = UFIntTool.numBytes(meta.maxChildNum);
+    // fan Out
+    totalLength += dataWidth.fanOutWidth * meta.countChildNum;
+    // fan Byte
+    totalLength += meta.totalChildNum;
+
+    // nextnodeoffset
+    totalLength += 4 * meta.countChildNum;
+
+    dataWidth.occurrencesWidth = UFIntTool.numBytes(meta.maxNumOccurrences);
+    totalLength += dataWidth.occurrencesWidth * meta.countNumOccurrences;
+
+    dataWidth.indexWidth = UFIntTool.numBytes(meta.maxIndex);
+    totalLength += dataWidth.indexWidth * meta.countIndexNum;
+
+    dataWidth.childNodeOffsetWidth = UFIntTool.numBytes(totalLength);
+
+    // track the starting position of each node in final output
+    int negativeIndex = 0;
+    for (int i = meta.leafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode leaf = meta.leafNodes.get(i);
+      // no children
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + leaf.nodeData.length + dataWidth.fanOutWidth
+          + dataWidth.occurrencesWidth + leaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      leaf.nodeWidth = leafNodeWidth;
+      leaf.negativeIndex = negativeIndex;
+    }
+    for (int i = meta.nonLeafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode nonLeaf = meta.nonLeafNodes.get(i);
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + nonLeaf.nodeData.length + dataWidth.fanOutWidth
+          + nonLeaf.children.size() + nonLeaf.children.size() * dataWidth.childNodeOffsetWidth
+          + dataWidth.occurrencesWidth + nonLeaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      nonLeaf.nodeWidth = leafNodeWidth;
+      nonLeaf.negativeIndex = negativeIndex;
+    }
+
+    for (int i = 0; i < meta.nonLeafNodes.size(); i++) {
+      serialize(meta.nonLeafNodes.get(i), outputStream, dataWidth);
+    }
+    for (int i = 0; i < meta.leafNodes.size(); i++) {
+      serialize(meta.leafNodes.get(i), outputStream, dataWidth);
+    }
+  }
+
+  static void serialize(TokenizerNode node, ByteArrayOutputStream os, PrefixTreeDataWidth dataWidth)
+    throws IOException {
+    UFIntTool.writeBytes(dataWidth.nodeDataLengthWidth, node.nodeData.length, os);
+    os.write(node.nodeData, 0, node.nodeData.length);
+    UFIntTool.writeBytes(dataWidth.fanOutWidth, node.children.size(), os);
+    for (TokenizerNode child : node.children) {
+      // child's first byte.
+      os.write(child.nodeData[0]);
+    }
+    for (TokenizerNode child : node.children) {
+      UFIntTool.writeBytes(dataWidth.childNodeOffsetWidth, node.negativeIndex - child.negativeIndex,
+        os);
+    }
+    UFIntTool.writeBytes(dataWidth.occurrencesWidth, node.numOccurrences, os);
+    for (int i = 0; i < node.numOccurrences; i++) {
+      UFIntTool.writeBytes(dataWidth.indexWidth, node.index.get(i), os);
+    }
+  }
+
+  public static void serialize(DataOutput out, PrefixTreeDataWidth dataWidth) throws IOException {
+    out.writeByte(dataWidth.nodeDataLengthWidth);
+    out.writeByte(dataWidth.fanOutWidth);
+    out.writeByte(dataWidth.occurrencesWidth);
+    out.writeByte(dataWidth.indexWidth);
+    out.writeByte(dataWidth.childNodeOffsetWidth);
+  }
+
+  public static void deserialize(ByteBuff data, PrefixTreeDataWidth dataWidth) {
+    dataWidth.nodeDataLengthWidth = data.get();
+    dataWidth.fanOutWidth = data.get();
+    dataWidth.occurrencesWidth = data.get();
+    dataWidth.indexWidth = data.get();
+    dataWidth.childNodeOffsetWidth = data.get();
+  }
+
+  /**
+   * Get the node index, that search key >= index and search key < (index + 1)
+   */
+  public static int search(ByteBuffer data, int bbStartPos, byte[] skey, int keyStartPos,
+    PrefixTreeDataWidth meta) {
+    int nodeDataLength = getNodeDataLength(data, bbStartPos, meta);
+    int cs = ByteBufferUtils.compareTo(skey, keyStartPos,
+      Math.min(skey.length - keyStartPos, nodeDataLength), data,
+      bbStartPos + meta.nodeDataLengthWidth, nodeDataLength);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+
+    if (cs == 0) {
+      // continue search
+      if (fanOut == 0) {
+        // no children, should be numOccurrences > 0
+        int index = getNodeIndex(data, pos, 0, meta);
+        if (skey.length == keyStartPos + nodeDataLength) {
+          // == current node
+          return index;
+        } else {
+          // > current node.
+          return index;
+        }
+      }
+      if (skey.length > keyStartPos + nodeDataLength) {
+        int fanOffset = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength + meta.fanOutWidth;
+        byte searchForByte = skey[keyStartPos + nodeDataLength];
+
+        int fanIndexInBlock =
+          unsignedBinarySearch(data, fanOffset, fanOffset + fanOut, searchForByte);
+        int nodeOffsetStartPos = fanOffset + fanOut;
+        if (fanIndexInBlock >= 0) {
+          // found it, but need to adjust for position of fan in overall block
+          int fanIndex = fanIndexInBlock - fanOffset;
+          int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanIndex, meta);
+          return search(data, bbStartPos + nodeOffset, skey, keyStartPos + nodeDataLength, meta);
+        } else {
+          int fanIndex = fanIndexInBlock + fanOffset;// didn't find it, so compensate in reverse
+          int insertionPoint = (-fanIndex - 1) - 1;
+          if (insertionPoint < 0) {
+            // < first children
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+            return getFirstLeafNode(data, bbStartPos + nodeOffset, meta) - 1;
+          } else {
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, insertionPoint, meta);
+            return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+          }
+        }
+      } else {
+        // skey.length == keyStartPos + nodeDataLength
+        if (numOccurrences > 0) {
+          // == current node and current node is a leaf node.
+          return getNodeIndex(data, pos, 0, meta);
+        } else {
+          // need -1, == current node and current node not a leaf node.
+          return getFirstLeafNode(data, bbStartPos, meta) - 1;
+        }
+      }
+    } else if (cs > 0) {
+      // search key bigger than (>) current node, get biggest
+      if (fanOut == 0) {
+        if (numOccurrences > 0) {
+          if (numOccurrences == 1) {
+            return getNodeIndex(data, pos, 0, meta);
+          } else {
+            // TODO
+            throw new IllegalStateException(
+              "numOccurrences = " + numOccurrences + " > 1 not expected.");
+          }
+        } else {
+          throw new IllegalStateException(
+            "numOccurrences = " + numOccurrences + ", fanOut = " + fanOut + " not expected.");
+        }
+      } else {
+        return getLastLeafNode(data, bbStartPos, meta);
+      }
+    } else {
+      // search key small than (<) current node, get smallest.
+      if (numOccurrences > 0) {
+        return getNodeIndex(data, pos, 0, meta) - 1;
+      } else {
+        return getFirstLeafNode(data, bbStartPos, meta) - 1;
+      }
+    }
+  }
+
+  static int getNodeDataLength(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int dataLength = (int) UFIntTool.fromBytes(data, offset, meta.nodeDataLengthWidth);
+    return dataLength;
+  }
+
+  static int getNodeFanOut(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int fanOut = (int) UFIntTool.fromBytes(data, offset, meta.fanOutWidth);
+    return fanOut;
+  }
+
+  static int getNodeNumOccurrences(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int numOccurrences = (int) UFIntTool.fromBytes(data, offset, meta.occurrencesWidth);
+    return numOccurrences;
+  }
+
+  static int getNodeOffset(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeOffset = (int) UFIntTool.fromBytes(data, offset + (index * meta.childNodeOffsetWidth),
+      meta.childNodeOffsetWidth);
+    return nodeOffset;
+  }
+
+  static int getNodeIndex(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeIndex =
+      (int) UFIntTool.fromBytes(data, offset + (index * meta.indexWidth), meta.indexWidth);
+    return nodeIndex;
+  }
+
+  /**
+   * Get the node's first leaf node
+   */
+  static int getFirstLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0 || fanOut == 0) {
+      // return current node.
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+      return getFirstLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  /**
+   * Get the node's last leaf node
+   */
+  static int getLastLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    // int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (fanOut == 0) {
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanOut - 1, meta);
+      return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  public static int unsignedBinarySearch(ByteBuffer a, int fromIndex, int toIndex, byte key) {
+    int unsignedKey = key & 0xff;
+    int low = fromIndex;
+    int high = toIndex - 1;
+
+    while (low <= high) {
+      int mid = low + ((high - low) >> 1);
+      int midVal = a.get(mid) & 0xff;
+
+      if (midVal < unsignedKey) {
+        low = mid + 1;
+      } else if (midVal > unsignedKey) {
+        high = mid - 1;
+      } else {
+        return mid; // key found
+      }
+    }
+    return -(low + 1); // key not found.
+  }
+
+  public static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth dataWidth,
+    int index) {
+    return get(data, bbStartPos, dataWidth, index, new byte[0]);
+  }
+
+  static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    byte[] prefix) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    byte[] bdata = new byte[dataLength];
+    ByteBuffer dup = data.duplicate();
+    dup.position(bbStartPos + meta.nodeDataLengthWidth);
+    dup.get(bdata, 0, dataLength);
+    bdata = Bytes.add(prefix, bdata);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      }
+    }
+    if (fanOut == 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      } else {
+        throw new IllegalStateException(
+          "Unexpected, search index=" + index + ", but find to " + currentNodeIndex);
+      }
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int locateIndex = locateWhichChild(data, bbStartPos, meta, index, fanOut, nodeOffsetStartPos);
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, locateIndex, meta);
+      return get(data, bbStartPos + nodeOffset, meta, index, bdata);
+    }
+  }
+
+  static int locateWhichChild(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    int fanOut, int nodeOffsetStartPos) {
+    for (int i = 0; i < fanOut; i++) {
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, i, meta);
+      int lastLeafNode = getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+      if (lastLeafNode >= index) {
+        return i;
+      }
+    }
+    throw new IllegalStateException("Unexpected unable to find index=" + index);
+  }
+
+  public static class TokenizerNode {

Review Comment:
   OK, so this TokenizerNode is just for construct the trie and serialize it, when reading, we will use the compact format and read the ByteBuffer directly instead of deserialize it to a TokenizerNode? Please add more comments so others could know this when they are not familiar with the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by "Apache-HBase (via GitHub)" <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1423352794

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   1m  6s |  Docker mode activated.  |
   ||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  No case conflicting files found.  |
   | +1 :green_heart: |  hbaseanti  |   0m  0s |  Patch does not have any anti-patterns.  |
   | +1 :green_heart: |  @author  |   0m  0s |  The patch does not contain any @author tags.  |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 13s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   4m 38s |  master passed  |
   | +1 :green_heart: |  compile  |   3m 16s |  master passed  |
   | +1 :green_heart: |  checkstyle  |   0m 53s |  master passed  |
   | +1 :green_heart: |  spotless  |   0m 49s |  branch has no errors when running spotless:check.  |
   | +1 :green_heart: |  spotbugs  |   2m 26s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 11s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   3m 53s |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 54s |  the patch passed  |
   | -0 :warning: |  javac  |   0m 44s |  hbase-common generated 6 new + 33 unchanged - 1 fixed = 39 total (was 34)  |
   | -0 :warning: |  checkstyle  |   0m 16s |  hbase-common: The patch generated 38 new + 0 unchanged - 0 fixed = 38 total (was 0)  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  The patch has no whitespace issues.  |
   | +1 :green_heart: |  hadoopcheck  |  13m 47s |  Patch does not cause any errors with Hadoop 3.2.4 3.3.4.  |
   | +1 :green_heart: |  spotless  |   0m 45s |  patch has no errors when running spotless:check.  |
   | -1 :x: |  spotbugs  |   0m 47s |  hbase-common generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0)  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 16s |  The patch does not generate ASF License warnings.  |
   |  |   |  47m 43s |   |
   
   
   | Reason | Tests |
   |-------:|:------|
   | FindBugs | module:hbase-common |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 221] |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 230] |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 232] |
   |  |  Unread public/protected field:At PrefixTreeUtil.java:[line 564] |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile |
   | uname | Linux 38fdb8f24936 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / 6a34aa8195 |
   | Default Java | Eclipse Adoptium-11.0.17+8 |
   | javac | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/diff-compile-javac-hbase-common.txt |
   | checkstyle | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/diff-checkstyle-hbase-common.txt |
   | spotbugs | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/new-spotbugs-hbase-common.html |
   | Max. process+thread count | 85 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/console |
   | versions | git=2.34.1 maven=3.8.6 spotbugs=4.7.3 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] binlijin commented on a diff in pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

binlijin commented on code in PR #4782:
URL: https://github.com/apache/hbase/pull/4782#discussion_r975991653


##########
hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixTreeUtil.java:
##########
@@ -0,0 +1,593 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.encoding;
+
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.ByteArrayOutputStream;
+import org.apache.hadoop.hbase.io.util.UFIntTool;
+import org.apache.hadoop.hbase.nio.ByteBuff;
+import org.apache.hadoop.hbase.util.ByteBufferUtils;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+@InterfaceAudience.Private
+public class PrefixTreeUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(PrefixTreeUtil.class);
+
+  /**
+   * Build tree from begin
+   * @return the tree
+   */
+  public static TokenizerNode buildPrefixTree(List<byte[]> rowKeys) {
+    // root node.
+    TokenizerNode node = new TokenizerNode();
+    int start = 0;
+    // Get max common prefix
+    int common = maxCommonPrefix(rowKeys, 0, rowKeys.size() - 1, 0);
+    if (common > 0) {
+      byte[] commonB = Bytes.copy(rowKeys.get(0), 0, common);
+      node.nodeData = commonB;
+      for (int i = 0; i < rowKeys.size(); i++) {
+        if (rowKeys.get(i).length == common) {
+          node.numOccurrences++;
+          if (node.index == null) {
+            node.index = new ArrayList<>(1);
+          }
+          node.index.add(i);
+          start = i + 1;
+        } else {
+          break;
+        }
+      }
+    } else {
+      // Only root node data can be empty.
+      node.nodeData = new byte[0];
+    }
+    constructAndSplitChild(node, rowKeys, start, rowKeys.size() - 1, common);
+    return node;
+  }
+
+  /**
+   * Calculate max common prefix
+   * @return the max common prefix num bytes
+   */
+  static int maxCommonPrefix(List<byte[]> rowKeys, int start, int end, int startPos) {
+    // only one entry.
+    if (start == end) {
+      return rowKeys.get(start).length - startPos;
+    }
+    int common = 0;
+    for (int round = 0; round <= rowKeys.get(start).length - startPos - 1; round++) {
+      boolean same = true;
+      for (int i = start + 1; i <= end; i++) {
+        if (startPos + common > rowKeys.get(i).length - 1) {
+          same = false;
+          break;
+        }
+        if (rowKeys.get(start)[startPos + common] != rowKeys.get(i)[startPos + common]) {
+          same = false;
+          break;
+        }
+      }
+      if (same) {
+        common++;
+      } else {
+        break;
+      }
+    }
+    return common;
+  }
+
+  /**
+   * No common prefix split it.
+   */
+  static void constructAndSplitChild(TokenizerNode node, List<byte[]> rowKeys, int start, int end,
+    int startPos) {
+    int middle = start;
+    for (int i = start + 1; i <= end; i++) {
+      if (startPos > rowKeys.get(i).length - 1) {
+        middle = i - 1;
+        break;
+      }
+      if (rowKeys.get(start)[startPos] != rowKeys.get(i)[startPos]) {
+        middle = i - 1;
+        break;
+      }
+    }
+    constructCommonNodeAndChild(node, rowKeys, start, middle, startPos);
+    if (middle + 1 <= end) {
+      // right
+      constructCommonNodeAndChild(node, rowKeys, middle + 1, end, startPos);
+    }
+  }
+
+  /**
+   * Get max common prefix as node and build children.
+   */
+  static TokenizerNode constructCommonNodeAndChild(TokenizerNode node, List<byte[]> rowKeys,
+    int start, int end, int startPos) {
+    int common = maxCommonPrefix(rowKeys, start, end, startPos);
+    if (common > 0) {
+      TokenizerNode child = new TokenizerNode();
+      child.parent = node;
+      node.children.add(child);
+      byte[] commonB = Bytes.copy(rowKeys.get(start), startPos, common);
+      child.nodeData = commonB;
+      int newStart = start;
+      for (int i = start; i <= end; i++) {
+        if (rowKeys.get(i).length == (startPos + common)) {
+          child.numOccurrences++;
+          if (child.index == null) {
+            child.index = new ArrayList<>(1);
+          }
+          child.index.add(i);
+          newStart = i + 1;
+        } else {
+          break;
+        }
+      }
+      if (start != end && newStart <= end) {
+        if (newStart == start) {
+          // no common prefix.
+          constructAndSplitChild(child, rowKeys, newStart, end, startPos + common);
+        } else {
+          // can have common prefix.
+          constructCommonNodeAndChild(child, rowKeys, newStart, end, startPos + common);
+        }
+      }
+    } else {
+      // no common prefix, split
+      constructAndSplitChild(node, rowKeys, start, end, startPos);
+    }
+    return node;
+  }
+
+  public static void getNodeMetaInfo(TokenizerNode node, TokenizerNodeMeta meta) {
+    if (node.nodeData.length > meta.maxNodeDataLength) {
+      meta.maxNodeDataLength = node.nodeData.length;
+    }
+    meta.totalNodeDataLength += node.nodeData.length;
+    meta.countNodeDataNum++;
+
+    if (node.children.size() > meta.maxChildNum) {
+      meta.maxChildNum = node.children.size();
+    }
+    meta.totalChildNum += node.children.size();
+    meta.countChildNum++;
+
+    if (node.numOccurrences > meta.maxNumOccurrences) {
+      meta.maxNumOccurrences = node.numOccurrences;
+    }
+    meta.countNumOccurrences++;
+    if (node.index != null) {
+      for (Integer entry : node.index) {
+        if (entry > meta.maxIndex) {
+          meta.maxIndex = entry;
+        }
+      }
+    }
+    if (node.children.isEmpty()) {
+      meta.leafNodes.add(node);
+      meta.countIndexNum++;
+    } else {
+      meta.nonLeafNodes.add(node);
+    }
+    for (TokenizerNode child : node.children) {
+      getNodeMetaInfo(child, meta);
+    }
+  }
+
+  public static void serializePrefixTree(TokenizerNode node, PrefixTreeDataWidth dataWidth,
+    ByteArrayOutputStream outputStream) throws IOException {
+    TokenizerNodeMeta meta = new TokenizerNodeMeta();
+    PrefixTreeUtil.getNodeMetaInfo(node, meta);
+    int totalLength = 0;
+    dataWidth.nodeDataLengthWidth = UFIntTool.numBytes(meta.maxNodeDataLength);
+    totalLength += meta.totalNodeDataLength;
+    totalLength += dataWidth.nodeDataLengthWidth * meta.countNodeDataNum;
+
+    dataWidth.fanOutWidth = UFIntTool.numBytes(meta.maxChildNum);
+    // fan Out
+    totalLength += dataWidth.fanOutWidth * meta.countChildNum;
+    // fan Byte
+    totalLength += meta.totalChildNum;
+
+    // nextnodeoffset
+    totalLength += 4 * meta.countChildNum;
+
+    dataWidth.occurrencesWidth = UFIntTool.numBytes(meta.maxNumOccurrences);
+    totalLength += dataWidth.occurrencesWidth * meta.countNumOccurrences;
+
+    dataWidth.indexWidth = UFIntTool.numBytes(meta.maxIndex);
+    totalLength += dataWidth.indexWidth * meta.countIndexNum;
+
+    dataWidth.childNodeOffsetWidth = UFIntTool.numBytes(totalLength);
+
+    // track the starting position of each node in final output
+    int negativeIndex = 0;
+    for (int i = meta.leafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode leaf = meta.leafNodes.get(i);
+      // no children
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + leaf.nodeData.length + dataWidth.fanOutWidth
+          + dataWidth.occurrencesWidth + leaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      leaf.nodeWidth = leafNodeWidth;
+      leaf.negativeIndex = negativeIndex;
+    }
+    for (int i = meta.nonLeafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode nonLeaf = meta.nonLeafNodes.get(i);
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + nonLeaf.nodeData.length + dataWidth.fanOutWidth
+          + nonLeaf.children.size() + nonLeaf.children.size() * dataWidth.childNodeOffsetWidth
+          + dataWidth.occurrencesWidth + nonLeaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      nonLeaf.nodeWidth = leafNodeWidth;
+      nonLeaf.negativeIndex = negativeIndex;
+    }
+
+    for (int i = 0; i < meta.nonLeafNodes.size(); i++) {
+      serialize(meta.nonLeafNodes.get(i), outputStream, dataWidth);
+    }
+    for (int i = 0; i < meta.leafNodes.size(); i++) {
+      serialize(meta.leafNodes.get(i), outputStream, dataWidth);
+    }
+  }
+
+  static void serialize(TokenizerNode node, ByteArrayOutputStream os, PrefixTreeDataWidth dataWidth)
+    throws IOException {
+    UFIntTool.writeBytes(dataWidth.nodeDataLengthWidth, node.nodeData.length, os);
+    os.write(node.nodeData, 0, node.nodeData.length);
+    UFIntTool.writeBytes(dataWidth.fanOutWidth, node.children.size(), os);
+    for (TokenizerNode child : node.children) {
+      // child's first byte.
+      os.write(child.nodeData[0]);
+    }
+    for (TokenizerNode child : node.children) {
+      UFIntTool.writeBytes(dataWidth.childNodeOffsetWidth, node.negativeIndex - child.negativeIndex,
+        os);
+    }
+    UFIntTool.writeBytes(dataWidth.occurrencesWidth, node.numOccurrences, os);
+    for (int i = 0; i < node.numOccurrences; i++) {
+      UFIntTool.writeBytes(dataWidth.indexWidth, node.index.get(i), os);
+    }
+  }
+
+  public static void serialize(DataOutput out, PrefixTreeDataWidth dataWidth) throws IOException {
+    out.writeByte(dataWidth.nodeDataLengthWidth);
+    out.writeByte(dataWidth.fanOutWidth);
+    out.writeByte(dataWidth.occurrencesWidth);
+    out.writeByte(dataWidth.indexWidth);
+    out.writeByte(dataWidth.childNodeOffsetWidth);
+  }
+
+  public static void deserialize(ByteBuff data, PrefixTreeDataWidth dataWidth) {
+    dataWidth.nodeDataLengthWidth = data.get();
+    dataWidth.fanOutWidth = data.get();
+    dataWidth.occurrencesWidth = data.get();
+    dataWidth.indexWidth = data.get();
+    dataWidth.childNodeOffsetWidth = data.get();
+  }
+
+  /**
+   * Get the node index, that search key >= index and search key < (index + 1)
+   */
+  public static int search(ByteBuffer data, int bbStartPos, byte[] skey, int keyStartPos,
+    PrefixTreeDataWidth meta) {
+    int nodeDataLength = getNodeDataLength(data, bbStartPos, meta);
+    int cs = ByteBufferUtils.compareTo(skey, keyStartPos,
+      Math.min(skey.length - keyStartPos, nodeDataLength), data,
+      bbStartPos + meta.nodeDataLengthWidth, nodeDataLength);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+
+    if (cs == 0) {
+      // continue search
+      if (fanOut == 0) {
+        // no children, should be numOccurrences > 0
+        int index = getNodeIndex(data, pos, 0, meta);
+        if (skey.length == keyStartPos + nodeDataLength) {
+          // == current node
+          return index;
+        } else {
+          // > current node.
+          return index;
+        }
+      }
+      if (skey.length > keyStartPos + nodeDataLength) {
+        int fanOffset = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength + meta.fanOutWidth;
+        byte searchForByte = skey[keyStartPos + nodeDataLength];
+
+        int fanIndexInBlock =
+          unsignedBinarySearch(data, fanOffset, fanOffset + fanOut, searchForByte);
+        int nodeOffsetStartPos = fanOffset + fanOut;
+        if (fanIndexInBlock >= 0) {
+          // found it, but need to adjust for position of fan in overall block
+          int fanIndex = fanIndexInBlock - fanOffset;
+          int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanIndex, meta);
+          return search(data, bbStartPos + nodeOffset, skey, keyStartPos + nodeDataLength, meta);
+        } else {
+          int fanIndex = fanIndexInBlock + fanOffset;// didn't find it, so compensate in reverse
+          int insertionPoint = (-fanIndex - 1) - 1;
+          if (insertionPoint < 0) {
+            // < first children
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+            return getFirstLeafNode(data, bbStartPos + nodeOffset, meta) - 1;
+          } else {
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, insertionPoint, meta);
+            return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+          }
+        }
+      } else {
+        // skey.length == keyStartPos + nodeDataLength
+        if (numOccurrences > 0) {
+          // == current node and current node is a leaf node.
+          return getNodeIndex(data, pos, 0, meta);
+        } else {
+          // need -1, == current node and current node not a leaf node.
+          return getFirstLeafNode(data, bbStartPos, meta) - 1;
+        }
+      }
+    } else if (cs > 0) {
+      // search key bigger than (>) current node, get biggest
+      if (fanOut == 0) {
+        if (numOccurrences > 0) {
+          if (numOccurrences == 1) {
+            return getNodeIndex(data, pos, 0, meta);
+          } else {
+            // TODO
+            throw new IllegalStateException(
+              "numOccurrences = " + numOccurrences + " > 1 not expected.");
+          }
+        } else {
+          throw new IllegalStateException(
+            "numOccurrences = " + numOccurrences + ", fanOut = " + fanOut + " not expected.");
+        }
+      } else {
+        return getLastLeafNode(data, bbStartPos, meta);
+      }
+    } else {
+      // search key small than (<) current node, get smallest.
+      if (numOccurrences > 0) {
+        return getNodeIndex(data, pos, 0, meta) - 1;
+      } else {
+        return getFirstLeafNode(data, bbStartPos, meta) - 1;
+      }
+    }
+  }
+
+  static int getNodeDataLength(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int dataLength = (int) UFIntTool.fromBytes(data, offset, meta.nodeDataLengthWidth);
+    return dataLength;
+  }
+
+  static int getNodeFanOut(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int fanOut = (int) UFIntTool.fromBytes(data, offset, meta.fanOutWidth);
+    return fanOut;
+  }
+
+  static int getNodeNumOccurrences(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int numOccurrences = (int) UFIntTool.fromBytes(data, offset, meta.occurrencesWidth);
+    return numOccurrences;
+  }
+
+  static int getNodeOffset(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeOffset = (int) UFIntTool.fromBytes(data, offset + (index * meta.childNodeOffsetWidth),
+      meta.childNodeOffsetWidth);
+    return nodeOffset;
+  }
+
+  static int getNodeIndex(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeIndex =
+      (int) UFIntTool.fromBytes(data, offset + (index * meta.indexWidth), meta.indexWidth);
+    return nodeIndex;
+  }
+
+  /**
+   * Get the node's first leaf node
+   */
+  static int getFirstLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0 || fanOut == 0) {
+      // return current node.
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+      return getFirstLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  /**
+   * Get the node's last leaf node
+   */
+  static int getLastLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    // int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (fanOut == 0) {
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanOut - 1, meta);
+      return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  public static int unsignedBinarySearch(ByteBuffer a, int fromIndex, int toIndex, byte key) {
+    int unsignedKey = key & 0xff;
+    int low = fromIndex;
+    int high = toIndex - 1;
+
+    while (low <= high) {
+      int mid = low + ((high - low) >> 1);
+      int midVal = a.get(mid) & 0xff;
+
+      if (midVal < unsignedKey) {
+        low = mid + 1;
+      } else if (midVal > unsignedKey) {
+        high = mid - 1;
+      } else {
+        return mid; // key found
+      }
+    }
+    return -(low + 1); // key not found.
+  }
+
+  public static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth dataWidth,
+    int index) {
+    return get(data, bbStartPos, dataWidth, index, new byte[0]);
+  }
+
+  static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    byte[] prefix) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    byte[] bdata = new byte[dataLength];
+    ByteBuffer dup = data.duplicate();
+    dup.position(bbStartPos + meta.nodeDataLengthWidth);
+    dup.get(bdata, 0, dataLength);
+    bdata = Bytes.add(prefix, bdata);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      }
+    }
+    if (fanOut == 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      } else {
+        throw new IllegalStateException(
+          "Unexpected, search index=" + index + ", but find to " + currentNodeIndex);
+      }
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int locateIndex = locateWhichChild(data, bbStartPos, meta, index, fanOut, nodeOffsetStartPos);
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, locateIndex, meta);
+      return get(data, bbStartPos + nodeOffset, meta, index, bdata);
+    }
+  }
+
+  static int locateWhichChild(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    int fanOut, int nodeOffsetStartPos) {
+    for (int i = 0; i < fanOut; i++) {
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, i, meta);
+      int lastLeafNode = getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+      if (lastLeafNode >= index) {
+        return i;
+      }
+    }
+    throw new IllegalStateException("Unexpected unable to find index=" + index);
+  }
+
+  public static class TokenizerNode {

Review Comment:
   IndexEncodedSeeker is used for reading, and read ByteBuffer data direct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1246200056

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   1m 12s |  Docker mode activated.  |
   ||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  No case conflicting files found.  |
   | +1 :green_heart: |  hbaseanti  |   0m  0s |  Patch does not have any anti-patterns.  |
   | +1 :green_heart: |  @author  |   0m  0s |  The patch does not contain any @author tags.  |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 11s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 29s |  master passed  |
   | +1 :green_heart: |  compile  |   2m 46s |  master passed  |
   | +1 :green_heart: |  checkstyle  |   0m 47s |  master passed  |
   | +1 :green_heart: |  spotless  |   0m 41s |  branch has no errors when running spotless:check.  |
   | -1 :x: |  spotbugs  |   1m 17s |  hbase-server in master has 1 extant spotbugs warnings.  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 11s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 12s |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 45s |  the patch passed  |
   | -0 :warning: |  javac  |   0m 33s |  hbase-common generated 4 new + 1 unchanged - 1 fixed = 5 total (was 2)  |
   | -0 :warning: |  checkstyle  |   0m 15s |  hbase-common: The patch generated 49 new + 0 unchanged - 0 fixed = 49 total (was 0)  |
   | -0 :warning: |  checkstyle  |   0m 31s |  hbase-server: The patch generated 1 new + 12 unchanged - 0 fixed = 13 total (was 12)  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  The patch has no whitespace issues.  |
   | +1 :green_heart: |  hadoopcheck  |   8m  2s |  Patch does not cause any errors with Hadoop 3.2.4 3.3.4.  |
   | -1 :x: |  spotless  |   0m 14s |  patch has 68 errors when running spotless:check, run spotless:apply to fix.  |
   | -1 :x: |  spotbugs  |   0m 37s |  hbase-common generated 6 new + 0 unchanged - 0 fixed = 6 total (was 0)  |
   | +1 :green_heart: |  spotbugs  |   1m 25s |  hbase-server generated 0 new + 0 unchanged - 1 fixed = 0 total (was 1)  |
   ||| _ Other Tests _ |
   | -1 :x: |  asflicense  |   0m 15s |  The patch generated 1 ASF License warnings.  |
   |  |   |  31m 56s |   |
   
   
   | Reason | Tests |
   |-------:|:------|
   | FindBugs | module:hbase-common |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 205] |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 214] |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 216] |
   |  |  Unread public/protected field:At PrefixTreeUtil.java:[line 553] |
   |  |  Unread public/protected field:At PrefixTreeUtil.java:[line 561] |
   |  |  Unread public/protected field:At PrefixTreeUtil.java:[line 538] |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile |
   | uname | Linux 707a30a34f89 5.4.0-124-generic #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / b3a04899a4 |
   | Default Java | AdoptOpenJDK-1.8.0_282-b08 |
   | spotbugs | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/branch-spotbugs-hbase-server-warnings.html |
   | javac | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/diff-compile-javac-hbase-common.txt |
   | checkstyle | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/diff-checkstyle-hbase-common.txt |
   | checkstyle | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/diff-checkstyle-hbase-server.txt |
   | spotless | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/patch-spotless.txt |
   | spotbugs | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/new-spotbugs-hbase-common.html |
   | asflicense | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-general-check/output/patch-asflicense-problems.txt |
   | Max. process+thread count | 69 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/console |
   | versions | git=2.17.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1246339613

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   1m  0s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  2s |  Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 11s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 44s |  master passed  |
   | +1 :green_heart: |  compile  |   1m  5s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   3m 51s |  branch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 14s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 31s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  3s |  the patch passed  |
   | +1 :green_heart: |  javac  |   1m  3s |  the patch passed  |
   | -1 :x: |  shadedjars  |   0m 56s |  patch has 10 errors when building our shaded downstream artifacts.  |
   | -0 :warning: |  javadoc  |   0m 16s |  hbase-common generated 8 new + 4 unchanged - 0 fixed = 12 total (was 4)  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m  6s |  hbase-common in the patch passed.  |
   | -1 :x: |  unit  | 205m 17s |  hbase-server in the patch failed.  |
   |  |   | 224m 25s |   |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux b5d791470132 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / b3a04899a4 |
   | Default Java | AdoptOpenJDK-11.0.10+9 |
   | shadedjars | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk11-hadoop3-check/output/patch-shadedjars.txt |
   | javadoc | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk11-hadoop3-check/output/diff-javadoc-javadoc-hbase-common.txt |
   | unit | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt |
   |  Test Results | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/testReport/ |
   | Max. process+thread count | 3030 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/console |
   | versions | git=2.17.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] binlijin commented on a diff in pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

binlijin commented on code in PR #4782:
URL: https://github.com/apache/hbase/pull/4782#discussion_r978205684


##########
hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixTreeUtil.java:
##########
@@ -0,0 +1,593 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.encoding;
+
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.ByteArrayOutputStream;
+import org.apache.hadoop.hbase.io.util.UFIntTool;
+import org.apache.hadoop.hbase.nio.ByteBuff;
+import org.apache.hadoop.hbase.util.ByteBufferUtils;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+@InterfaceAudience.Private
+public class PrefixTreeUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(PrefixTreeUtil.class);
+
+  /**
+   * Build tree from begin
+   * @return the tree
+   */
+  public static TokenizerNode buildPrefixTree(List<byte[]> rowKeys) {
+    // root node.
+    TokenizerNode node = new TokenizerNode();
+    int start = 0;
+    // Get max common prefix
+    int common = maxCommonPrefix(rowKeys, 0, rowKeys.size() - 1, 0);
+    if (common > 0) {
+      byte[] commonB = Bytes.copy(rowKeys.get(0), 0, common);
+      node.nodeData = commonB;
+      for (int i = 0; i < rowKeys.size(); i++) {
+        if (rowKeys.get(i).length == common) {
+          node.numOccurrences++;
+          if (node.index == null) {
+            node.index = new ArrayList<>(1);
+          }
+          node.index.add(i);
+          start = i + 1;
+        } else {
+          break;
+        }
+      }
+    } else {
+      // Only root node data can be empty.
+      node.nodeData = new byte[0];
+    }
+    constructAndSplitChild(node, rowKeys, start, rowKeys.size() - 1, common);
+    return node;
+  }
+
+  /**
+   * Calculate max common prefix
+   * @return the max common prefix num bytes
+   */
+  static int maxCommonPrefix(List<byte[]> rowKeys, int start, int end, int startPos) {
+    // only one entry.
+    if (start == end) {
+      return rowKeys.get(start).length - startPos;
+    }
+    int common = 0;
+    for (int round = 0; round <= rowKeys.get(start).length - startPos - 1; round++) {
+      boolean same = true;
+      for (int i = start + 1; i <= end; i++) {
+        if (startPos + common > rowKeys.get(i).length - 1) {
+          same = false;
+          break;
+        }
+        if (rowKeys.get(start)[startPos + common] != rowKeys.get(i)[startPos + common]) {
+          same = false;
+          break;
+        }
+      }
+      if (same) {
+        common++;
+      } else {
+        break;
+      }
+    }
+    return common;
+  }
+
+  /**
+   * No common prefix split it.
+   */
+  static void constructAndSplitChild(TokenizerNode node, List<byte[]> rowKeys, int start, int end,
+    int startPos) {
+    int middle = start;
+    for (int i = start + 1; i <= end; i++) {
+      if (startPos > rowKeys.get(i).length - 1) {
+        middle = i - 1;
+        break;
+      }
+      if (rowKeys.get(start)[startPos] != rowKeys.get(i)[startPos]) {
+        middle = i - 1;
+        break;
+      }
+    }
+    constructCommonNodeAndChild(node, rowKeys, start, middle, startPos);
+    if (middle + 1 <= end) {
+      // right
+      constructCommonNodeAndChild(node, rowKeys, middle + 1, end, startPos);
+    }
+  }
+
+  /**
+   * Get max common prefix as node and build children.
+   */
+  static TokenizerNode constructCommonNodeAndChild(TokenizerNode node, List<byte[]> rowKeys,
+    int start, int end, int startPos) {
+    int common = maxCommonPrefix(rowKeys, start, end, startPos);
+    if (common > 0) {
+      TokenizerNode child = new TokenizerNode();
+      child.parent = node;
+      node.children.add(child);
+      byte[] commonB = Bytes.copy(rowKeys.get(start), startPos, common);
+      child.nodeData = commonB;
+      int newStart = start;
+      for (int i = start; i <= end; i++) {
+        if (rowKeys.get(i).length == (startPos + common)) {
+          child.numOccurrences++;
+          if (child.index == null) {
+            child.index = new ArrayList<>(1);
+          }
+          child.index.add(i);
+          newStart = i + 1;
+        } else {
+          break;
+        }
+      }
+      if (start != end && newStart <= end) {
+        if (newStart == start) {
+          // no common prefix.
+          constructAndSplitChild(child, rowKeys, newStart, end, startPos + common);
+        } else {
+          // can have common prefix.
+          constructCommonNodeAndChild(child, rowKeys, newStart, end, startPos + common);
+        }
+      }
+    } else {
+      // no common prefix, split
+      constructAndSplitChild(node, rowKeys, start, end, startPos);
+    }
+    return node;
+  }
+
+  public static void getNodeMetaInfo(TokenizerNode node, TokenizerNodeMeta meta) {
+    if (node.nodeData.length > meta.maxNodeDataLength) {
+      meta.maxNodeDataLength = node.nodeData.length;
+    }
+    meta.totalNodeDataLength += node.nodeData.length;
+    meta.countNodeDataNum++;
+
+    if (node.children.size() > meta.maxChildNum) {
+      meta.maxChildNum = node.children.size();
+    }
+    meta.totalChildNum += node.children.size();
+    meta.countChildNum++;
+
+    if (node.numOccurrences > meta.maxNumOccurrences) {
+      meta.maxNumOccurrences = node.numOccurrences;
+    }
+    meta.countNumOccurrences++;
+    if (node.index != null) {
+      for (Integer entry : node.index) {
+        if (entry > meta.maxIndex) {
+          meta.maxIndex = entry;
+        }
+      }
+    }
+    if (node.children.isEmpty()) {
+      meta.leafNodes.add(node);
+      meta.countIndexNum++;
+    } else {
+      meta.nonLeafNodes.add(node);
+    }
+    for (TokenizerNode child : node.children) {
+      getNodeMetaInfo(child, meta);
+    }
+  }
+
+  public static void serializePrefixTree(TokenizerNode node, PrefixTreeDataWidth dataWidth,
+    ByteArrayOutputStream outputStream) throws IOException {
+    TokenizerNodeMeta meta = new TokenizerNodeMeta();
+    PrefixTreeUtil.getNodeMetaInfo(node, meta);
+    int totalLength = 0;
+    dataWidth.nodeDataLengthWidth = UFIntTool.numBytes(meta.maxNodeDataLength);
+    totalLength += meta.totalNodeDataLength;
+    totalLength += dataWidth.nodeDataLengthWidth * meta.countNodeDataNum;
+
+    dataWidth.fanOutWidth = UFIntTool.numBytes(meta.maxChildNum);
+    // fan Out
+    totalLength += dataWidth.fanOutWidth * meta.countChildNum;
+    // fan Byte
+    totalLength += meta.totalChildNum;
+
+    // nextnodeoffset
+    totalLength += 4 * meta.countChildNum;
+
+    dataWidth.occurrencesWidth = UFIntTool.numBytes(meta.maxNumOccurrences);
+    totalLength += dataWidth.occurrencesWidth * meta.countNumOccurrences;
+
+    dataWidth.indexWidth = UFIntTool.numBytes(meta.maxIndex);
+    totalLength += dataWidth.indexWidth * meta.countIndexNum;
+
+    dataWidth.childNodeOffsetWidth = UFIntTool.numBytes(totalLength);
+
+    // track the starting position of each node in final output
+    int negativeIndex = 0;
+    for (int i = meta.leafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode leaf = meta.leafNodes.get(i);
+      // no children
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + leaf.nodeData.length + dataWidth.fanOutWidth
+          + dataWidth.occurrencesWidth + leaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      leaf.nodeWidth = leafNodeWidth;
+      leaf.negativeIndex = negativeIndex;
+    }
+    for (int i = meta.nonLeafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode nonLeaf = meta.nonLeafNodes.get(i);
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + nonLeaf.nodeData.length + dataWidth.fanOutWidth
+          + nonLeaf.children.size() + nonLeaf.children.size() * dataWidth.childNodeOffsetWidth
+          + dataWidth.occurrencesWidth + nonLeaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      nonLeaf.nodeWidth = leafNodeWidth;
+      nonLeaf.negativeIndex = negativeIndex;
+    }
+
+    for (int i = 0; i < meta.nonLeafNodes.size(); i++) {
+      serialize(meta.nonLeafNodes.get(i), outputStream, dataWidth);
+    }
+    for (int i = 0; i < meta.leafNodes.size(); i++) {
+      serialize(meta.leafNodes.get(i), outputStream, dataWidth);
+    }
+  }
+
+  static void serialize(TokenizerNode node, ByteArrayOutputStream os, PrefixTreeDataWidth dataWidth)
+    throws IOException {
+    UFIntTool.writeBytes(dataWidth.nodeDataLengthWidth, node.nodeData.length, os);
+    os.write(node.nodeData, 0, node.nodeData.length);
+    UFIntTool.writeBytes(dataWidth.fanOutWidth, node.children.size(), os);
+    for (TokenizerNode child : node.children) {
+      // child's first byte.
+      os.write(child.nodeData[0]);
+    }
+    for (TokenizerNode child : node.children) {
+      UFIntTool.writeBytes(dataWidth.childNodeOffsetWidth, node.negativeIndex - child.negativeIndex,
+        os);
+    }
+    UFIntTool.writeBytes(dataWidth.occurrencesWidth, node.numOccurrences, os);
+    for (int i = 0; i < node.numOccurrences; i++) {
+      UFIntTool.writeBytes(dataWidth.indexWidth, node.index.get(i), os);
+    }
+  }
+
+  public static void serialize(DataOutput out, PrefixTreeDataWidth dataWidth) throws IOException {
+    out.writeByte(dataWidth.nodeDataLengthWidth);
+    out.writeByte(dataWidth.fanOutWidth);
+    out.writeByte(dataWidth.occurrencesWidth);
+    out.writeByte(dataWidth.indexWidth);
+    out.writeByte(dataWidth.childNodeOffsetWidth);
+  }
+
+  public static void deserialize(ByteBuff data, PrefixTreeDataWidth dataWidth) {
+    dataWidth.nodeDataLengthWidth = data.get();
+    dataWidth.fanOutWidth = data.get();
+    dataWidth.occurrencesWidth = data.get();
+    dataWidth.indexWidth = data.get();
+    dataWidth.childNodeOffsetWidth = data.get();
+  }
+
+  /**
+   * Get the node index, that search key >= index and search key < (index + 1)
+   */
+  public static int search(ByteBuffer data, int bbStartPos, byte[] skey, int keyStartPos,
+    PrefixTreeDataWidth meta) {
+    int nodeDataLength = getNodeDataLength(data, bbStartPos, meta);
+    int cs = ByteBufferUtils.compareTo(skey, keyStartPos,
+      Math.min(skey.length - keyStartPos, nodeDataLength), data,
+      bbStartPos + meta.nodeDataLengthWidth, nodeDataLength);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+
+    if (cs == 0) {
+      // continue search
+      if (fanOut == 0) {
+        // no children, should be numOccurrences > 0
+        int index = getNodeIndex(data, pos, 0, meta);
+        if (skey.length == keyStartPos + nodeDataLength) {
+          // == current node
+          return index;
+        } else {
+          // > current node.
+          return index;
+        }
+      }
+      if (skey.length > keyStartPos + nodeDataLength) {
+        int fanOffset = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength + meta.fanOutWidth;
+        byte searchForByte = skey[keyStartPos + nodeDataLength];
+
+        int fanIndexInBlock =
+          unsignedBinarySearch(data, fanOffset, fanOffset + fanOut, searchForByte);
+        int nodeOffsetStartPos = fanOffset + fanOut;
+        if (fanIndexInBlock >= 0) {
+          // found it, but need to adjust for position of fan in overall block
+          int fanIndex = fanIndexInBlock - fanOffset;
+          int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanIndex, meta);
+          return search(data, bbStartPos + nodeOffset, skey, keyStartPos + nodeDataLength, meta);
+        } else {
+          int fanIndex = fanIndexInBlock + fanOffset;// didn't find it, so compensate in reverse
+          int insertionPoint = (-fanIndex - 1) - 1;
+          if (insertionPoint < 0) {
+            // < first children
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+            return getFirstLeafNode(data, bbStartPos + nodeOffset, meta) - 1;
+          } else {
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, insertionPoint, meta);
+            return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+          }
+        }
+      } else {
+        // skey.length == keyStartPos + nodeDataLength
+        if (numOccurrences > 0) {
+          // == current node and current node is a leaf node.
+          return getNodeIndex(data, pos, 0, meta);
+        } else {
+          // need -1, == current node and current node not a leaf node.
+          return getFirstLeafNode(data, bbStartPos, meta) - 1;
+        }
+      }
+    } else if (cs > 0) {
+      // search key bigger than (>) current node, get biggest
+      if (fanOut == 0) {
+        if (numOccurrences > 0) {
+          if (numOccurrences == 1) {
+            return getNodeIndex(data, pos, 0, meta);
+          } else {
+            // TODO
+            throw new IllegalStateException(
+              "numOccurrences = " + numOccurrences + " > 1 not expected.");
+          }
+        } else {
+          throw new IllegalStateException(
+            "numOccurrences = " + numOccurrences + ", fanOut = " + fanOut + " not expected.");
+        }
+      } else {
+        return getLastLeafNode(data, bbStartPos, meta);
+      }
+    } else {
+      // search key small than (<) current node, get smallest.
+      if (numOccurrences > 0) {
+        return getNodeIndex(data, pos, 0, meta) - 1;
+      } else {
+        return getFirstLeafNode(data, bbStartPos, meta) - 1;
+      }
+    }
+  }
+
+  static int getNodeDataLength(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int dataLength = (int) UFIntTool.fromBytes(data, offset, meta.nodeDataLengthWidth);
+    return dataLength;
+  }
+
+  static int getNodeFanOut(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int fanOut = (int) UFIntTool.fromBytes(data, offset, meta.fanOutWidth);
+    return fanOut;
+  }
+
+  static int getNodeNumOccurrences(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int numOccurrences = (int) UFIntTool.fromBytes(data, offset, meta.occurrencesWidth);
+    return numOccurrences;
+  }
+
+  static int getNodeOffset(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeOffset = (int) UFIntTool.fromBytes(data, offset + (index * meta.childNodeOffsetWidth),
+      meta.childNodeOffsetWidth);
+    return nodeOffset;
+  }
+
+  static int getNodeIndex(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeIndex =
+      (int) UFIntTool.fromBytes(data, offset + (index * meta.indexWidth), meta.indexWidth);
+    return nodeIndex;
+  }
+
+  /**
+   * Get the node's first leaf node
+   */
+  static int getFirstLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0 || fanOut == 0) {
+      // return current node.
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+      return getFirstLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  /**
+   * Get the node's last leaf node
+   */
+  static int getLastLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    // int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (fanOut == 0) {
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanOut - 1, meta);
+      return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  public static int unsignedBinarySearch(ByteBuffer a, int fromIndex, int toIndex, byte key) {
+    int unsignedKey = key & 0xff;
+    int low = fromIndex;
+    int high = toIndex - 1;
+
+    while (low <= high) {
+      int mid = low + ((high - low) >> 1);
+      int midVal = a.get(mid) & 0xff;
+
+      if (midVal < unsignedKey) {
+        low = mid + 1;
+      } else if (midVal > unsignedKey) {
+        high = mid - 1;
+      } else {
+        return mid; // key found
+      }
+    }
+    return -(low + 1); // key not found.
+  }
+
+  public static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth dataWidth,
+    int index) {
+    return get(data, bbStartPos, dataWidth, index, new byte[0]);
+  }
+
+  static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    byte[] prefix) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    byte[] bdata = new byte[dataLength];
+    ByteBuffer dup = data.duplicate();
+    dup.position(bbStartPos + meta.nodeDataLengthWidth);
+    dup.get(bdata, 0, dataLength);
+    bdata = Bytes.add(prefix, bdata);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      }
+    }
+    if (fanOut == 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      } else {
+        throw new IllegalStateException(
+          "Unexpected, search index=" + index + ", but find to " + currentNodeIndex);
+      }
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int locateIndex = locateWhichChild(data, bbStartPos, meta, index, fanOut, nodeOffsetStartPos);
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, locateIndex, meta);
+      return get(data, bbStartPos + nodeOffset, meta, index, bdata);
+    }
+  }
+
+  static int locateWhichChild(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    int fanOut, int nodeOffsetStartPos) {
+    for (int i = 0; i < fanOut; i++) {
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, i, meta);
+      int lastLeafNode = getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+      if (lastLeafNode >= index) {
+        return i;
+      }
+    }
+    throw new IllegalStateException("Unexpected unable to find index=" + index);
+  }
+
+  public static class TokenizerNode {

Review Comment:
   ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by "Apache-HBase (via GitHub)" <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1423518074

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   0m 55s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  3s |  Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 11s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   3m 39s |  master passed  |
   | +1 :green_heart: |  compile  |   1m  5s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   4m 28s |  branch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 13s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   3m 25s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  5s |  the patch passed  |
   | +1 :green_heart: |  javac  |   1m  5s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   4m 26s |  patch has no errors when building our shaded downstream artifacts.  |
   | -0 :warning: |  javadoc  |   0m 16s |  hbase-common generated 18 new + 4 unchanged - 0 fixed = 22 total (was 4)  |
   ||| _ Other Tests _ |
   | -1 :x: |  unit  |   1m 32s |  hbase-common in the patch failed.  |
   | +1 :green_heart: |  unit  | 211m  3s |  hbase-server in the patch passed.  |
   |  |   | 237m 49s |   |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux c0b832a15149 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / 6a34aa8195 |
   | Default Java | Eclipse Adoptium-11.0.17+8 |
   | javadoc | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk11-hadoop3-check/output/diff-javadoc-javadoc-hbase-common.txt |
   | unit | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-common.txt |
   |  Test Results | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/testReport/ |
   | Max. process+thread count | 2533 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/console |
   | versions | git=2.34.1 maven=3.8.6 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1261913474

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   1m  6s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  4s |  Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 18s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 13s |  master passed  |
   | +1 :green_heart: |  compile  |   0m 57s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   3m 48s |  branch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 14s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 15s |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 57s |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 57s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   3m 48s |  patch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  the patch passed  |
   ||| _ Other Tests _ |
   | -1 :x: |  unit  |   1m 11s |  hbase-common in the patch failed.  |
   | +1 :green_heart: |  unit  | 218m 20s |  hbase-server in the patch passed.  |
   |  |   | 238m 14s |   |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux a8610d78f4e6 5.4.0-124-generic #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / 23a56331db |
   | Default Java | Temurin-1.8.0_345-b01 |
   | unit | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-common.txt |
   |  Test Results | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/testReport/ |
   | Max. process+thread count | 2561 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/console |
   | versions | git=2.17.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] binlijin commented on a diff in pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

binlijin commented on code in PR #4782:
URL: https://github.com/apache/hbase/pull/4782#discussion_r974197836


##########
hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixTreeUtil.java:
##########
@@ -0,0 +1,593 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.encoding;
+
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.ByteArrayOutputStream;
+import org.apache.hadoop.hbase.io.util.UFIntTool;
+import org.apache.hadoop.hbase.nio.ByteBuff;
+import org.apache.hadoop.hbase.util.ByteBufferUtils;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+@InterfaceAudience.Private
+public class PrefixTreeUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(PrefixTreeUtil.class);
+
+  /**
+   * Build tree from begin
+   * @return the tree
+   */
+  public static TokenizerNode buildPrefixTree(List<byte[]> rowKeys) {
+    // root node.
+    TokenizerNode node = new TokenizerNode();
+    int start = 0;
+    // Get max common prefix
+    int common = maxCommonPrefix(rowKeys, 0, rowKeys.size() - 1, 0);
+    if (common > 0) {
+      byte[] commonB = Bytes.copy(rowKeys.get(0), 0, common);
+      node.nodeData = commonB;
+      for (int i = 0; i < rowKeys.size(); i++) {
+        if (rowKeys.get(i).length == common) {
+          node.numOccurrences++;
+          if (node.index == null) {
+            node.index = new ArrayList<>(1);
+          }
+          node.index.add(i);
+          start = i + 1;
+        } else {
+          break;
+        }
+      }
+    } else {
+      // Only root node data can be empty.
+      node.nodeData = new byte[0];
+    }
+    constructAndSplitChild(node, rowKeys, start, rowKeys.size() - 1, common);
+    return node;
+  }
+
+  /**
+   * Calculate max common prefix
+   * @return the max common prefix num bytes
+   */
+  static int maxCommonPrefix(List<byte[]> rowKeys, int start, int end, int startPos) {
+    // only one entry.
+    if (start == end) {
+      return rowKeys.get(start).length - startPos;
+    }
+    int common = 0;
+    for (int round = 0; round <= rowKeys.get(start).length - startPos - 1; round++) {
+      boolean same = true;
+      for (int i = start + 1; i <= end; i++) {
+        if (startPos + common > rowKeys.get(i).length - 1) {
+          same = false;
+          break;
+        }
+        if (rowKeys.get(start)[startPos + common] != rowKeys.get(i)[startPos + common]) {
+          same = false;
+          break;
+        }
+      }
+      if (same) {
+        common++;
+      } else {
+        break;
+      }
+    }
+    return common;
+  }
+
+  /**
+   * No common prefix split it.
+   */
+  static void constructAndSplitChild(TokenizerNode node, List<byte[]> rowKeys, int start, int end,
+    int startPos) {
+    int middle = start;
+    for (int i = start + 1; i <= end; i++) {
+      if (startPos > rowKeys.get(i).length - 1) {
+        middle = i - 1;
+        break;
+      }
+      if (rowKeys.get(start)[startPos] != rowKeys.get(i)[startPos]) {
+        middle = i - 1;
+        break;
+      }
+    }
+    constructCommonNodeAndChild(node, rowKeys, start, middle, startPos);
+    if (middle + 1 <= end) {
+      // right
+      constructCommonNodeAndChild(node, rowKeys, middle + 1, end, startPos);
+    }
+  }
+
+  /**
+   * Get max common prefix as node and build children.
+   */
+  static TokenizerNode constructCommonNodeAndChild(TokenizerNode node, List<byte[]> rowKeys,
+    int start, int end, int startPos) {
+    int common = maxCommonPrefix(rowKeys, start, end, startPos);
+    if (common > 0) {
+      TokenizerNode child = new TokenizerNode();
+      child.parent = node;
+      node.children.add(child);
+      byte[] commonB = Bytes.copy(rowKeys.get(start), startPos, common);
+      child.nodeData = commonB;
+      int newStart = start;
+      for (int i = start; i <= end; i++) {
+        if (rowKeys.get(i).length == (startPos + common)) {
+          child.numOccurrences++;
+          if (child.index == null) {
+            child.index = new ArrayList<>(1);
+          }
+          child.index.add(i);
+          newStart = i + 1;
+        } else {
+          break;
+        }
+      }
+      if (start != end && newStart <= end) {
+        if (newStart == start) {
+          // no common prefix.
+          constructAndSplitChild(child, rowKeys, newStart, end, startPos + common);
+        } else {
+          // can have common prefix.
+          constructCommonNodeAndChild(child, rowKeys, newStart, end, startPos + common);
+        }
+      }
+    } else {
+      // no common prefix, split
+      constructAndSplitChild(node, rowKeys, start, end, startPos);
+    }
+    return node;
+  }
+
+  public static void getNodeMetaInfo(TokenizerNode node, TokenizerNodeMeta meta) {
+    if (node.nodeData.length > meta.maxNodeDataLength) {
+      meta.maxNodeDataLength = node.nodeData.length;
+    }
+    meta.totalNodeDataLength += node.nodeData.length;
+    meta.countNodeDataNum++;
+
+    if (node.children.size() > meta.maxChildNum) {
+      meta.maxChildNum = node.children.size();
+    }
+    meta.totalChildNum += node.children.size();
+    meta.countChildNum++;
+
+    if (node.numOccurrences > meta.maxNumOccurrences) {
+      meta.maxNumOccurrences = node.numOccurrences;
+    }
+    meta.countNumOccurrences++;
+    if (node.index != null) {
+      for (Integer entry : node.index) {
+        if (entry > meta.maxIndex) {
+          meta.maxIndex = entry;
+        }
+      }
+    }
+    if (node.children.isEmpty()) {
+      meta.leafNodes.add(node);
+      meta.countIndexNum++;
+    } else {
+      meta.nonLeafNodes.add(node);
+    }
+    for (TokenizerNode child : node.children) {
+      getNodeMetaInfo(child, meta);
+    }
+  }
+
+  public static void serializePrefixTree(TokenizerNode node, PrefixTreeDataWidth dataWidth,
+    ByteArrayOutputStream outputStream) throws IOException {
+    TokenizerNodeMeta meta = new TokenizerNodeMeta();
+    PrefixTreeUtil.getNodeMetaInfo(node, meta);
+    int totalLength = 0;
+    dataWidth.nodeDataLengthWidth = UFIntTool.numBytes(meta.maxNodeDataLength);
+    totalLength += meta.totalNodeDataLength;
+    totalLength += dataWidth.nodeDataLengthWidth * meta.countNodeDataNum;
+
+    dataWidth.fanOutWidth = UFIntTool.numBytes(meta.maxChildNum);
+    // fan Out
+    totalLength += dataWidth.fanOutWidth * meta.countChildNum;
+    // fan Byte
+    totalLength += meta.totalChildNum;
+
+    // nextnodeoffset
+    totalLength += 4 * meta.countChildNum;
+
+    dataWidth.occurrencesWidth = UFIntTool.numBytes(meta.maxNumOccurrences);
+    totalLength += dataWidth.occurrencesWidth * meta.countNumOccurrences;
+
+    dataWidth.indexWidth = UFIntTool.numBytes(meta.maxIndex);
+    totalLength += dataWidth.indexWidth * meta.countIndexNum;
+
+    dataWidth.childNodeOffsetWidth = UFIntTool.numBytes(totalLength);
+
+    // track the starting position of each node in final output
+    int negativeIndex = 0;
+    for (int i = meta.leafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode leaf = meta.leafNodes.get(i);
+      // no children
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + leaf.nodeData.length + dataWidth.fanOutWidth
+          + dataWidth.occurrencesWidth + leaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      leaf.nodeWidth = leafNodeWidth;
+      leaf.negativeIndex = negativeIndex;
+    }
+    for (int i = meta.nonLeafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode nonLeaf = meta.nonLeafNodes.get(i);
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + nonLeaf.nodeData.length + dataWidth.fanOutWidth
+          + nonLeaf.children.size() + nonLeaf.children.size() * dataWidth.childNodeOffsetWidth
+          + dataWidth.occurrencesWidth + nonLeaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      nonLeaf.nodeWidth = leafNodeWidth;
+      nonLeaf.negativeIndex = negativeIndex;
+    }
+
+    for (int i = 0; i < meta.nonLeafNodes.size(); i++) {
+      serialize(meta.nonLeafNodes.get(i), outputStream, dataWidth);
+    }
+    for (int i = 0; i < meta.leafNodes.size(); i++) {
+      serialize(meta.leafNodes.get(i), outputStream, dataWidth);
+    }
+  }
+
+  static void serialize(TokenizerNode node, ByteArrayOutputStream os, PrefixTreeDataWidth dataWidth)
+    throws IOException {
+    UFIntTool.writeBytes(dataWidth.nodeDataLengthWidth, node.nodeData.length, os);
+    os.write(node.nodeData, 0, node.nodeData.length);
+    UFIntTool.writeBytes(dataWidth.fanOutWidth, node.children.size(), os);
+    for (TokenizerNode child : node.children) {
+      // child's first byte.
+      os.write(child.nodeData[0]);
+    }
+    for (TokenizerNode child : node.children) {
+      UFIntTool.writeBytes(dataWidth.childNodeOffsetWidth, node.negativeIndex - child.negativeIndex,
+        os);
+    }
+    UFIntTool.writeBytes(dataWidth.occurrencesWidth, node.numOccurrences, os);
+    for (int i = 0; i < node.numOccurrences; i++) {
+      UFIntTool.writeBytes(dataWidth.indexWidth, node.index.get(i), os);
+    }
+  }
+
+  public static void serialize(DataOutput out, PrefixTreeDataWidth dataWidth) throws IOException {
+    out.writeByte(dataWidth.nodeDataLengthWidth);
+    out.writeByte(dataWidth.fanOutWidth);
+    out.writeByte(dataWidth.occurrencesWidth);
+    out.writeByte(dataWidth.indexWidth);
+    out.writeByte(dataWidth.childNodeOffsetWidth);
+  }
+
+  public static void deserialize(ByteBuff data, PrefixTreeDataWidth dataWidth) {
+    dataWidth.nodeDataLengthWidth = data.get();
+    dataWidth.fanOutWidth = data.get();
+    dataWidth.occurrencesWidth = data.get();
+    dataWidth.indexWidth = data.get();
+    dataWidth.childNodeOffsetWidth = data.get();
+  }
+
+  /**
+   * Get the node index, that search key >= index and search key < (index + 1)
+   */
+  public static int search(ByteBuffer data, int bbStartPos, byte[] skey, int keyStartPos,
+    PrefixTreeDataWidth meta) {
+    int nodeDataLength = getNodeDataLength(data, bbStartPos, meta);
+    int cs = ByteBufferUtils.compareTo(skey, keyStartPos,
+      Math.min(skey.length - keyStartPos, nodeDataLength), data,
+      bbStartPos + meta.nodeDataLengthWidth, nodeDataLength);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+
+    if (cs == 0) {
+      // continue search
+      if (fanOut == 0) {
+        // no children, should be numOccurrences > 0
+        int index = getNodeIndex(data, pos, 0, meta);
+        if (skey.length == keyStartPos + nodeDataLength) {
+          // == current node
+          return index;
+        } else {
+          // > current node.
+          return index;
+        }
+      }
+      if (skey.length > keyStartPos + nodeDataLength) {
+        int fanOffset = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength + meta.fanOutWidth;
+        byte searchForByte = skey[keyStartPos + nodeDataLength];
+
+        int fanIndexInBlock =
+          unsignedBinarySearch(data, fanOffset, fanOffset + fanOut, searchForByte);
+        int nodeOffsetStartPos = fanOffset + fanOut;
+        if (fanIndexInBlock >= 0) {
+          // found it, but need to adjust for position of fan in overall block
+          int fanIndex = fanIndexInBlock - fanOffset;
+          int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanIndex, meta);
+          return search(data, bbStartPos + nodeOffset, skey, keyStartPos + nodeDataLength, meta);
+        } else {
+          int fanIndex = fanIndexInBlock + fanOffset;// didn't find it, so compensate in reverse
+          int insertionPoint = (-fanIndex - 1) - 1;
+          if (insertionPoint < 0) {
+            // < first children
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+            return getFirstLeafNode(data, bbStartPos + nodeOffset, meta) - 1;
+          } else {
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, insertionPoint, meta);
+            return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+          }
+        }
+      } else {
+        // skey.length == keyStartPos + nodeDataLength
+        if (numOccurrences > 0) {
+          // == current node and current node is a leaf node.
+          return getNodeIndex(data, pos, 0, meta);
+        } else {
+          // need -1, == current node and current node not a leaf node.
+          return getFirstLeafNode(data, bbStartPos, meta) - 1;
+        }
+      }
+    } else if (cs > 0) {
+      // search key bigger than (>) current node, get biggest
+      if (fanOut == 0) {
+        if (numOccurrences > 0) {
+          if (numOccurrences == 1) {
+            return getNodeIndex(data, pos, 0, meta);
+          } else {
+            // TODO
+            throw new IllegalStateException(
+              "numOccurrences = " + numOccurrences + " > 1 not expected.");
+          }
+        } else {
+          throw new IllegalStateException(
+            "numOccurrences = " + numOccurrences + ", fanOut = " + fanOut + " not expected.");
+        }
+      } else {
+        return getLastLeafNode(data, bbStartPos, meta);
+      }
+    } else {
+      // search key small than (<) current node, get smallest.
+      if (numOccurrences > 0) {
+        return getNodeIndex(data, pos, 0, meta) - 1;
+      } else {
+        return getFirstLeafNode(data, bbStartPos, meta) - 1;
+      }
+    }
+  }
+
+  static int getNodeDataLength(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int dataLength = (int) UFIntTool.fromBytes(data, offset, meta.nodeDataLengthWidth);
+    return dataLength;
+  }
+
+  static int getNodeFanOut(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int fanOut = (int) UFIntTool.fromBytes(data, offset, meta.fanOutWidth);
+    return fanOut;
+  }
+
+  static int getNodeNumOccurrences(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int numOccurrences = (int) UFIntTool.fromBytes(data, offset, meta.occurrencesWidth);
+    return numOccurrences;
+  }
+
+  static int getNodeOffset(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeOffset = (int) UFIntTool.fromBytes(data, offset + (index * meta.childNodeOffsetWidth),
+      meta.childNodeOffsetWidth);
+    return nodeOffset;
+  }
+
+  static int getNodeIndex(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeIndex =
+      (int) UFIntTool.fromBytes(data, offset + (index * meta.indexWidth), meta.indexWidth);
+    return nodeIndex;
+  }
+
+  /**
+   * Get the node's first leaf node
+   */
+  static int getFirstLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0 || fanOut == 0) {
+      // return current node.
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+      return getFirstLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  /**
+   * Get the node's last leaf node
+   */
+  static int getLastLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    // int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (fanOut == 0) {
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanOut - 1, meta);
+      return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  public static int unsignedBinarySearch(ByteBuffer a, int fromIndex, int toIndex, byte key) {
+    int unsignedKey = key & 0xff;
+    int low = fromIndex;
+    int high = toIndex - 1;
+
+    while (low <= high) {
+      int mid = low + ((high - low) >> 1);
+      int midVal = a.get(mid) & 0xff;
+
+      if (midVal < unsignedKey) {
+        low = mid + 1;
+      } else if (midVal > unsignedKey) {
+        high = mid - 1;
+      } else {
+        return mid; // key found
+      }
+    }
+    return -(low + 1); // key not found.
+  }
+
+  public static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth dataWidth,
+    int index) {
+    return get(data, bbStartPos, dataWidth, index, new byte[0]);
+  }
+
+  static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    byte[] prefix) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    byte[] bdata = new byte[dataLength];
+    ByteBuffer dup = data.duplicate();
+    dup.position(bbStartPos + meta.nodeDataLengthWidth);
+    dup.get(bdata, 0, dataLength);
+    bdata = Bytes.add(prefix, bdata);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      }
+    }
+    if (fanOut == 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      } else {
+        throw new IllegalStateException(
+          "Unexpected, search index=" + index + ", but find to " + currentNodeIndex);
+      }
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int locateIndex = locateWhichChild(data, bbStartPos, meta, index, fanOut, nodeOffsetStartPos);
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, locateIndex, meta);
+      return get(data, bbStartPos + nodeOffset, meta, index, bdata);
+    }
+  }
+
+  static int locateWhichChild(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    int fanOut, int nodeOffsetStartPos) {
+    for (int i = 0; i < fanOut; i++) {
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, i, meta);
+      int lastLeafNode = getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+      if (lastLeafNode >= index) {
+        return i;
+      }
+    }
+    throw new IllegalStateException("Unexpected unable to find index=" + index);
+  }
+
+  public static class TokenizerNode {

Review Comment:
   This is only need when serialize, and will not use it when reading.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1246600047

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   0m 59s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  3s |  Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 11s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 43s |  master passed  |
   | +1 :green_heart: |  compile  |   1m  5s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   3m 49s |  branch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 14s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 30s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  5s |  the patch passed  |
   | +1 :green_heart: |  javac  |   1m  5s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   3m 44s |  patch has no errors when building our shaded downstream artifacts.  |
   | -0 :warning: |  javadoc  |   0m 16s |  hbase-common generated 8 new + 4 unchanged - 0 fixed = 12 total (was 4)  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m  6s |  hbase-common in the patch passed.  |
   | +1 :green_heart: |  unit  | 203m 15s |  hbase-server in the patch passed.  |
   |  |   | 225m  5s |   |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux 11c8f7680a3f 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / b3a04899a4 |
   | Default Java | AdoptOpenJDK-11.0.10+9 |
   | javadoc | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-jdk11-hadoop3-check/output/diff-javadoc-javadoc-hbase-common.txt |
   |  Test Results | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/testReport/ |
   | Max. process+thread count | 2488 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/console |
   | versions | git=2.17.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1246630706

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   1m 11s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  4s |  Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m  7s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 13s |  master passed  |
   | +1 :green_heart: |  compile  |   0m 56s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   3m 49s |  branch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 37s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 13s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 13s |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 54s |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 54s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   3m 49s |  patch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  the patch passed  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 57s |  hbase-common in the patch passed.  |
   | -1 :x: |  unit  | 233m  3s |  hbase-server in the patch failed.  |
   |  |   | 254m  7s |   |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux 03d7e146ad75 5.4.0-124-generic #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / b3a04899a4 |
   | Default Java | AdoptOpenJDK-1.8.0_282-b08 |
   | unit | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt |
   |  Test Results | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/testReport/ |
   | Max. process+thread count | 3217 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/console |
   | versions | git=2.17.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by "Apache-HBase (via GitHub)" <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1423521689

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   2m 23s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  4s |  Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m  8s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 50s |  master passed  |
   | +1 :green_heart: |  compile  |   0m 56s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   4m 18s |  branch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 13s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 49s |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 56s |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 56s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   4m 19s |  patch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 38s |  the patch passed  |
   ||| _ Other Tests _ |
   | -1 :x: |  unit  |   1m 12s |  hbase-common in the patch failed.  |
   | +1 :green_heart: |  unit  | 217m 35s |  hbase-server in the patch passed.  |
   |  |   | 243m  5s |   |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux cbf78ba1f5f8 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / 6a34aa8195 |
   | Default Java | Temurin-1.8.0_352-b08 |
   | unit | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-common.txt |
   |  Test Results | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/testReport/ |
   | Max. process+thread count | 2569 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/1/console |
   | versions | git=2.34.1 maven=3.8.6 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1261906844

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   2m 47s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  2s |  Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 15s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 37s |  master passed  |
   | +1 :green_heart: |  compile  |   1m  3s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   3m 44s |  branch has no errors when building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 14s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 35s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  2s |  the patch passed  |
   | +1 :green_heart: |  javac  |   1m  2s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   3m 47s |  patch has no errors when building our shaded downstream artifacts.  |
   | -0 :warning: |  javadoc  |   0m 16s |  hbase-common generated 18 new + 4 unchanged - 0 fixed = 22 total (was 4)  |
   ||| _ Other Tests _ |
   | -1 :x: |  unit  |   1m 30s |  hbase-common in the patch failed.  |
   | -1 :x: |  unit  | 209m 27s |  hbase-server in the patch failed.  |
   |  |   | 232m  1s |   |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux dc8410d9b731 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / 23a56331db |
   | Default Java | Eclipse Adoptium-11.0.16.1+1 |
   | javadoc | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-jdk11-hadoop3-check/output/diff-javadoc-javadoc-hbase-common.txt |
   | unit | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-common.txt |
   | unit | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt |
   |  Test Results | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/testReport/ |
   | Max. process+thread count | 2412 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/3/console |
   | versions | git=2.17.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] binlijin commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

binlijin commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1261726564

   The first version is simple, index block only contains cell's row, so one row's all cells need to store in one HFileBlock. The version 2 support row + qualifier + timestamp + type which keep consistent with the current situation, one row's all cells can store in multiple HFileBlock.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache-HBase commented on pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache-HBase commented on PR #4782:
URL: https://github.com/apache/hbase/pull/4782#issuecomment-1246378818

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |:----:|----------:|--------:|:--------|
   | +0 :ok: |  reexec  |   1m 12s |  Docker mode activated.  |
   ||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  No case conflicting files found.  |
   | +1 :green_heart: |  hbaseanti  |   0m  0s |  Patch does not have any anti-patterns.  |
   | +1 :green_heart: |  @author  |   0m  0s |  The patch does not contain any @author tags.  |
   ||| _ master Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 10s |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |   2m 10s |  master passed  |
   | +1 :green_heart: |  compile  |   2m 45s |  master passed  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  master passed  |
   | +1 :green_heart: |  spotless  |   0m 38s |  branch has no errors when running spotless:check.  |
   | -1 :x: |  spotbugs  |   1m 17s |  hbase-server in master has 1 extant spotbugs warnings.  |
   ||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 12s |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 10s |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 46s |  the patch passed  |
   | -0 :warning: |  javac  |   0m 32s |  hbase-common generated 4 new + 1 unchanged - 1 fixed = 5 total (was 2)  |
   | -0 :warning: |  checkstyle  |   0m 14s |  hbase-common: The patch generated 30 new + 0 unchanged - 0 fixed = 30 total (was 0)  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  The patch has no whitespace issues.  |
   | +1 :green_heart: |  hadoopcheck  |   7m 55s |  Patch does not cause any errors with Hadoop 3.2.4 3.3.4.  |
   | +1 :green_heart: |  spotless  |   0m 39s |  patch has no errors when running spotless:check.  |
   | -1 :x: |  spotbugs  |   0m 37s |  hbase-common generated 6 new + 0 unchanged - 0 fixed = 6 total (was 0)  |
   | +1 :green_heart: |  spotbugs  |   1m 25s |  hbase-server generated 0 new + 0 unchanged - 1 fixed = 0 total (was 1)  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 19s |  The patch does not generate ASF License warnings.  |
   |  |   |  31m 42s |   |
   
   
   | Reason | Tests |
   |-------:|:------|
   | FindBugs | module:hbase-common |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 221] |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 230] |
   |  |  Unread public/protected field:At PrefixTreeIndexBlockEncoder.java:[line 232] |
   |  |  Unread public/protected field:At PrefixTreeUtil.java:[line 550] |
   |  |  Unread public/protected field:At PrefixTreeUtil.java:[line 558] |
   |  |  Unread public/protected field:At PrefixTreeUtil.java:[line 535] |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-general-check/output/Dockerfile |
   | GITHUB PR | https://github.com/apache/hbase/pull/4782 |
   | Optional Tests | dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile |
   | uname | Linux fe7381b8c86e 5.4.0-124-generic #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / b3a04899a4 |
   | Default Java | AdoptOpenJDK-1.8.0_282-b08 |
   | spotbugs | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-general-check/output/branch-spotbugs-hbase-server-warnings.html |
   | javac | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-general-check/output/diff-compile-javac-hbase-common.txt |
   | checkstyle | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-general-check/output/diff-checkstyle-hbase-common.txt |
   | spotbugs | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/artifact/yetus-general-check/output/new-spotbugs-hbase-common.html |
   | Max. process+thread count | 64 (vs. ulimit of 30000) |
   | modules | C: hbase-common hbase-server U: . |
   | Console output | https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4782/2/console |
   | versions | git=2.17.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hbase] Apache9 commented on a diff in pull request #4782: HBASE-27329 Introduce prefix tree index block encoding use less space

Posted by GitBox <gi...@apache.org>.

Apache9 commented on code in PR #4782:
URL: https://github.com/apache/hbase/pull/4782#discussion_r974247593


##########
hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixTreeUtil.java:
##########
@@ -0,0 +1,593 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.encoding;
+
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.ByteArrayOutputStream;
+import org.apache.hadoop.hbase.io.util.UFIntTool;
+import org.apache.hadoop.hbase.nio.ByteBuff;
+import org.apache.hadoop.hbase.util.ByteBufferUtils;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.yetus.audience.InterfaceAudience;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+@InterfaceAudience.Private
+public class PrefixTreeUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(PrefixTreeUtil.class);
+
+  /**
+   * Build tree from begin
+   * @return the tree
+   */
+  public static TokenizerNode buildPrefixTree(List<byte[]> rowKeys) {
+    // root node.
+    TokenizerNode node = new TokenizerNode();
+    int start = 0;
+    // Get max common prefix
+    int common = maxCommonPrefix(rowKeys, 0, rowKeys.size() - 1, 0);
+    if (common > 0) {
+      byte[] commonB = Bytes.copy(rowKeys.get(0), 0, common);
+      node.nodeData = commonB;
+      for (int i = 0; i < rowKeys.size(); i++) {
+        if (rowKeys.get(i).length == common) {
+          node.numOccurrences++;
+          if (node.index == null) {
+            node.index = new ArrayList<>(1);
+          }
+          node.index.add(i);
+          start = i + 1;
+        } else {
+          break;
+        }
+      }
+    } else {
+      // Only root node data can be empty.
+      node.nodeData = new byte[0];
+    }
+    constructAndSplitChild(node, rowKeys, start, rowKeys.size() - 1, common);
+    return node;
+  }
+
+  /**
+   * Calculate max common prefix
+   * @return the max common prefix num bytes
+   */
+  static int maxCommonPrefix(List<byte[]> rowKeys, int start, int end, int startPos) {
+    // only one entry.
+    if (start == end) {
+      return rowKeys.get(start).length - startPos;
+    }
+    int common = 0;
+    for (int round = 0; round <= rowKeys.get(start).length - startPos - 1; round++) {
+      boolean same = true;
+      for (int i = start + 1; i <= end; i++) {
+        if (startPos + common > rowKeys.get(i).length - 1) {
+          same = false;
+          break;
+        }
+        if (rowKeys.get(start)[startPos + common] != rowKeys.get(i)[startPos + common]) {
+          same = false;
+          break;
+        }
+      }
+      if (same) {
+        common++;
+      } else {
+        break;
+      }
+    }
+    return common;
+  }
+
+  /**
+   * No common prefix split it.
+   */
+  static void constructAndSplitChild(TokenizerNode node, List<byte[]> rowKeys, int start, int end,
+    int startPos) {
+    int middle = start;
+    for (int i = start + 1; i <= end; i++) {
+      if (startPos > rowKeys.get(i).length - 1) {
+        middle = i - 1;
+        break;
+      }
+      if (rowKeys.get(start)[startPos] != rowKeys.get(i)[startPos]) {
+        middle = i - 1;
+        break;
+      }
+    }
+    constructCommonNodeAndChild(node, rowKeys, start, middle, startPos);
+    if (middle + 1 <= end) {
+      // right
+      constructCommonNodeAndChild(node, rowKeys, middle + 1, end, startPos);
+    }
+  }
+
+  /**
+   * Get max common prefix as node and build children.
+   */
+  static TokenizerNode constructCommonNodeAndChild(TokenizerNode node, List<byte[]> rowKeys,
+    int start, int end, int startPos) {
+    int common = maxCommonPrefix(rowKeys, start, end, startPos);
+    if (common > 0) {
+      TokenizerNode child = new TokenizerNode();
+      child.parent = node;
+      node.children.add(child);
+      byte[] commonB = Bytes.copy(rowKeys.get(start), startPos, common);
+      child.nodeData = commonB;
+      int newStart = start;
+      for (int i = start; i <= end; i++) {
+        if (rowKeys.get(i).length == (startPos + common)) {
+          child.numOccurrences++;
+          if (child.index == null) {
+            child.index = new ArrayList<>(1);
+          }
+          child.index.add(i);
+          newStart = i + 1;
+        } else {
+          break;
+        }
+      }
+      if (start != end && newStart <= end) {
+        if (newStart == start) {
+          // no common prefix.
+          constructAndSplitChild(child, rowKeys, newStart, end, startPos + common);
+        } else {
+          // can have common prefix.
+          constructCommonNodeAndChild(child, rowKeys, newStart, end, startPos + common);
+        }
+      }
+    } else {
+      // no common prefix, split
+      constructAndSplitChild(node, rowKeys, start, end, startPos);
+    }
+    return node;
+  }
+
+  public static void getNodeMetaInfo(TokenizerNode node, TokenizerNodeMeta meta) {
+    if (node.nodeData.length > meta.maxNodeDataLength) {
+      meta.maxNodeDataLength = node.nodeData.length;
+    }
+    meta.totalNodeDataLength += node.nodeData.length;
+    meta.countNodeDataNum++;
+
+    if (node.children.size() > meta.maxChildNum) {
+      meta.maxChildNum = node.children.size();
+    }
+    meta.totalChildNum += node.children.size();
+    meta.countChildNum++;
+
+    if (node.numOccurrences > meta.maxNumOccurrences) {
+      meta.maxNumOccurrences = node.numOccurrences;
+    }
+    meta.countNumOccurrences++;
+    if (node.index != null) {
+      for (Integer entry : node.index) {
+        if (entry > meta.maxIndex) {
+          meta.maxIndex = entry;
+        }
+      }
+    }
+    if (node.children.isEmpty()) {
+      meta.leafNodes.add(node);
+      meta.countIndexNum++;
+    } else {
+      meta.nonLeafNodes.add(node);
+    }
+    for (TokenizerNode child : node.children) {
+      getNodeMetaInfo(child, meta);
+    }
+  }
+
+  public static void serializePrefixTree(TokenizerNode node, PrefixTreeDataWidth dataWidth,
+    ByteArrayOutputStream outputStream) throws IOException {
+    TokenizerNodeMeta meta = new TokenizerNodeMeta();
+    PrefixTreeUtil.getNodeMetaInfo(node, meta);
+    int totalLength = 0;
+    dataWidth.nodeDataLengthWidth = UFIntTool.numBytes(meta.maxNodeDataLength);
+    totalLength += meta.totalNodeDataLength;
+    totalLength += dataWidth.nodeDataLengthWidth * meta.countNodeDataNum;
+
+    dataWidth.fanOutWidth = UFIntTool.numBytes(meta.maxChildNum);
+    // fan Out
+    totalLength += dataWidth.fanOutWidth * meta.countChildNum;
+    // fan Byte
+    totalLength += meta.totalChildNum;
+
+    // nextnodeoffset
+    totalLength += 4 * meta.countChildNum;
+
+    dataWidth.occurrencesWidth = UFIntTool.numBytes(meta.maxNumOccurrences);
+    totalLength += dataWidth.occurrencesWidth * meta.countNumOccurrences;
+
+    dataWidth.indexWidth = UFIntTool.numBytes(meta.maxIndex);
+    totalLength += dataWidth.indexWidth * meta.countIndexNum;
+
+    dataWidth.childNodeOffsetWidth = UFIntTool.numBytes(totalLength);
+
+    // track the starting position of each node in final output
+    int negativeIndex = 0;
+    for (int i = meta.leafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode leaf = meta.leafNodes.get(i);
+      // no children
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + leaf.nodeData.length + dataWidth.fanOutWidth
+          + dataWidth.occurrencesWidth + leaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      leaf.nodeWidth = leafNodeWidth;
+      leaf.negativeIndex = negativeIndex;
+    }
+    for (int i = meta.nonLeafNodes.size() - 1; i >= 0; i--) {
+      TokenizerNode nonLeaf = meta.nonLeafNodes.get(i);
+      int leafNodeWidth =
+        dataWidth.nodeDataLengthWidth + nonLeaf.nodeData.length + dataWidth.fanOutWidth
+          + nonLeaf.children.size() + nonLeaf.children.size() * dataWidth.childNodeOffsetWidth
+          + dataWidth.occurrencesWidth + nonLeaf.numOccurrences * dataWidth.indexWidth;
+      negativeIndex += leafNodeWidth;
+      nonLeaf.nodeWidth = leafNodeWidth;
+      nonLeaf.negativeIndex = negativeIndex;
+    }
+
+    for (int i = 0; i < meta.nonLeafNodes.size(); i++) {
+      serialize(meta.nonLeafNodes.get(i), outputStream, dataWidth);
+    }
+    for (int i = 0; i < meta.leafNodes.size(); i++) {
+      serialize(meta.leafNodes.get(i), outputStream, dataWidth);
+    }
+  }
+
+  static void serialize(TokenizerNode node, ByteArrayOutputStream os, PrefixTreeDataWidth dataWidth)
+    throws IOException {
+    UFIntTool.writeBytes(dataWidth.nodeDataLengthWidth, node.nodeData.length, os);
+    os.write(node.nodeData, 0, node.nodeData.length);
+    UFIntTool.writeBytes(dataWidth.fanOutWidth, node.children.size(), os);
+    for (TokenizerNode child : node.children) {
+      // child's first byte.
+      os.write(child.nodeData[0]);
+    }
+    for (TokenizerNode child : node.children) {
+      UFIntTool.writeBytes(dataWidth.childNodeOffsetWidth, node.negativeIndex - child.negativeIndex,
+        os);
+    }
+    UFIntTool.writeBytes(dataWidth.occurrencesWidth, node.numOccurrences, os);
+    for (int i = 0; i < node.numOccurrences; i++) {
+      UFIntTool.writeBytes(dataWidth.indexWidth, node.index.get(i), os);
+    }
+  }
+
+  public static void serialize(DataOutput out, PrefixTreeDataWidth dataWidth) throws IOException {
+    out.writeByte(dataWidth.nodeDataLengthWidth);
+    out.writeByte(dataWidth.fanOutWidth);
+    out.writeByte(dataWidth.occurrencesWidth);
+    out.writeByte(dataWidth.indexWidth);
+    out.writeByte(dataWidth.childNodeOffsetWidth);
+  }
+
+  public static void deserialize(ByteBuff data, PrefixTreeDataWidth dataWidth) {
+    dataWidth.nodeDataLengthWidth = data.get();
+    dataWidth.fanOutWidth = data.get();
+    dataWidth.occurrencesWidth = data.get();
+    dataWidth.indexWidth = data.get();
+    dataWidth.childNodeOffsetWidth = data.get();
+  }
+
+  /**
+   * Get the node index, that search key >= index and search key < (index + 1)
+   */
+  public static int search(ByteBuffer data, int bbStartPos, byte[] skey, int keyStartPos,
+    PrefixTreeDataWidth meta) {
+    int nodeDataLength = getNodeDataLength(data, bbStartPos, meta);
+    int cs = ByteBufferUtils.compareTo(skey, keyStartPos,
+      Math.min(skey.length - keyStartPos, nodeDataLength), data,
+      bbStartPos + meta.nodeDataLengthWidth, nodeDataLength);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+
+    if (cs == 0) {
+      // continue search
+      if (fanOut == 0) {
+        // no children, should be numOccurrences > 0
+        int index = getNodeIndex(data, pos, 0, meta);
+        if (skey.length == keyStartPos + nodeDataLength) {
+          // == current node
+          return index;
+        } else {
+          // > current node.
+          return index;
+        }
+      }
+      if (skey.length > keyStartPos + nodeDataLength) {
+        int fanOffset = bbStartPos + meta.nodeDataLengthWidth + nodeDataLength + meta.fanOutWidth;
+        byte searchForByte = skey[keyStartPos + nodeDataLength];
+
+        int fanIndexInBlock =
+          unsignedBinarySearch(data, fanOffset, fanOffset + fanOut, searchForByte);
+        int nodeOffsetStartPos = fanOffset + fanOut;
+        if (fanIndexInBlock >= 0) {
+          // found it, but need to adjust for position of fan in overall block
+          int fanIndex = fanIndexInBlock - fanOffset;
+          int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanIndex, meta);
+          return search(data, bbStartPos + nodeOffset, skey, keyStartPos + nodeDataLength, meta);
+        } else {
+          int fanIndex = fanIndexInBlock + fanOffset;// didn't find it, so compensate in reverse
+          int insertionPoint = (-fanIndex - 1) - 1;
+          if (insertionPoint < 0) {
+            // < first children
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+            return getFirstLeafNode(data, bbStartPos + nodeOffset, meta) - 1;
+          } else {
+            int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, insertionPoint, meta);
+            return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+          }
+        }
+      } else {
+        // skey.length == keyStartPos + nodeDataLength
+        if (numOccurrences > 0) {
+          // == current node and current node is a leaf node.
+          return getNodeIndex(data, pos, 0, meta);
+        } else {
+          // need -1, == current node and current node not a leaf node.
+          return getFirstLeafNode(data, bbStartPos, meta) - 1;
+        }
+      }
+    } else if (cs > 0) {
+      // search key bigger than (>) current node, get biggest
+      if (fanOut == 0) {
+        if (numOccurrences > 0) {
+          if (numOccurrences == 1) {
+            return getNodeIndex(data, pos, 0, meta);
+          } else {
+            // TODO
+            throw new IllegalStateException(
+              "numOccurrences = " + numOccurrences + " > 1 not expected.");
+          }
+        } else {
+          throw new IllegalStateException(
+            "numOccurrences = " + numOccurrences + ", fanOut = " + fanOut + " not expected.");
+        }
+      } else {
+        return getLastLeafNode(data, bbStartPos, meta);
+      }
+    } else {
+      // search key small than (<) current node, get smallest.
+      if (numOccurrences > 0) {
+        return getNodeIndex(data, pos, 0, meta) - 1;
+      } else {
+        return getFirstLeafNode(data, bbStartPos, meta) - 1;
+      }
+    }
+  }
+
+  static int getNodeDataLength(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int dataLength = (int) UFIntTool.fromBytes(data, offset, meta.nodeDataLengthWidth);
+    return dataLength;
+  }
+
+  static int getNodeFanOut(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int fanOut = (int) UFIntTool.fromBytes(data, offset, meta.fanOutWidth);
+    return fanOut;
+  }
+
+  static int getNodeNumOccurrences(ByteBuffer data, int offset, PrefixTreeDataWidth meta) {
+    int numOccurrences = (int) UFIntTool.fromBytes(data, offset, meta.occurrencesWidth);
+    return numOccurrences;
+  }
+
+  static int getNodeOffset(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeOffset = (int) UFIntTool.fromBytes(data, offset + (index * meta.childNodeOffsetWidth),
+      meta.childNodeOffsetWidth);
+    return nodeOffset;
+  }
+
+  static int getNodeIndex(ByteBuffer data, int offset, int index, PrefixTreeDataWidth meta) {
+    int nodeIndex =
+      (int) UFIntTool.fromBytes(data, offset + (index * meta.indexWidth), meta.indexWidth);
+    return nodeIndex;
+  }
+
+  /**
+   * Get the node's first leaf node
+   */
+  static int getFirstLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0 || fanOut == 0) {
+      // return current node.
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, 0, meta);
+      return getFirstLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  /**
+   * Get the node's last leaf node
+   */
+  static int getLastLeafNode(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    // int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (fanOut == 0) {
+      return getNodeIndex(data, pos, 0, meta);
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, fanOut - 1, meta);
+      return getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+    }
+  }
+
+  public static int unsignedBinarySearch(ByteBuffer a, int fromIndex, int toIndex, byte key) {
+    int unsignedKey = key & 0xff;
+    int low = fromIndex;
+    int high = toIndex - 1;
+
+    while (low <= high) {
+      int mid = low + ((high - low) >> 1);
+      int midVal = a.get(mid) & 0xff;
+
+      if (midVal < unsignedKey) {
+        low = mid + 1;
+      } else if (midVal > unsignedKey) {
+        high = mid - 1;
+      } else {
+        return mid; // key found
+      }
+    }
+    return -(low + 1); // key not found.
+  }
+
+  public static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth dataWidth,
+    int index) {
+    return get(data, bbStartPos, dataWidth, index, new byte[0]);
+  }
+
+  static byte[] get(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    byte[] prefix) {
+    int dataLength = getNodeDataLength(data, bbStartPos, meta);
+    byte[] bdata = new byte[dataLength];
+    ByteBuffer dup = data.duplicate();
+    dup.position(bbStartPos + meta.nodeDataLengthWidth);
+    dup.get(bdata, 0, dataLength);
+    bdata = Bytes.add(prefix, bdata);
+
+    int pos = bbStartPos + meta.nodeDataLengthWidth + dataLength;
+    int fanOut = getNodeFanOut(data, pos, meta);
+    pos += meta.fanOutWidth + fanOut + fanOut * meta.childNodeOffsetWidth;
+    int numOccurrences = getNodeNumOccurrences(data, pos, meta);
+    pos += meta.occurrencesWidth;
+    if (numOccurrences > 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      }
+    }
+    if (fanOut == 0) {
+      int currentNodeIndex = getNodeIndex(data, pos, 0, meta);
+      if (currentNodeIndex == index) {
+        return bdata;
+      } else {
+        throw new IllegalStateException(
+          "Unexpected, search index=" + index + ", but find to " + currentNodeIndex);
+      }
+    } else {
+      int nodeOffsetStartPos =
+        bbStartPos + meta.nodeDataLengthWidth + dataLength + meta.fanOutWidth + fanOut;
+      int locateIndex = locateWhichChild(data, bbStartPos, meta, index, fanOut, nodeOffsetStartPos);
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, locateIndex, meta);
+      return get(data, bbStartPos + nodeOffset, meta, index, bdata);
+    }
+  }
+
+  static int locateWhichChild(ByteBuffer data, int bbStartPos, PrefixTreeDataWidth meta, int index,
+    int fanOut, int nodeOffsetStartPos) {
+    for (int i = 0; i < fanOut; i++) {
+      int nodeOffset = getNodeOffset(data, nodeOffsetStartPos, i, meta);
+      int lastLeafNode = getLastLeafNode(data, bbStartPos + nodeOffset, meta);
+      if (lastLeafNode >= index) {
+        return i;
+      }
+    }
+    throw new IllegalStateException("Unexpected unable to find index=" + index);
+  }
+
+  public static class TokenizerNode {

Review Comment:
   Then what is the IndexEncodedSeeker used for?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] HBASE-27329 Introduce prefix tree index block encoding use less space [hbase]

Posted by "bbeaudreault (via GitHub)" <gi...@apache.org>.

bbeaudreault commented on code in PR #4782:
URL: https://github.com/apache/hbase/pull/4782#discussion_r1459886922


##########
hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixTreeIndexBlockEncoderV2.java:
##########
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.io.encoding;
+
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.CellComparator;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.PrivateCellUtil;
+import org.apache.hadoop.hbase.io.ByteArrayOutputStream;
+import org.apache.hadoop.hbase.io.encoding.PrefixTreeUtil.PrefixTreeDataWidth;
+import org.apache.hadoop.hbase.io.encoding.PrefixTreeUtil.TokenizerNode;
+import org.apache.hadoop.hbase.io.util.UFIntTool;
+import org.apache.hadoop.hbase.nio.ByteBuff;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.hadoop.hbase.util.ClassSize;
+import org.apache.hadoop.hbase.util.ObjectIntPair;
+import org.apache.yetus.audience.InterfaceAudience;
+
+@InterfaceAudience.Private
+public class PrefixTreeIndexBlockEncoderV2 implements IndexBlockEncoder {

Review Comment:
   Why are there 2 versions of this encoder? What's the difference? When to use one vs the other?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org