You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/03/09 23:28:56 UTC

[GitHub] [orc] autumnust opened a new pull request #651: [ORC-757] HashTable dictionary

autumnust opened a new pull request #651:
URL: https://github.com/apache/orc/pull/651


   This PR: 
   - Add a straightforward implementation for `Dictionary` using hash table. 
   - Refactored RB-Tree to make code reusable, like `VisitorContextImpl`. Moving it into `DictionaryUtils.java` to make is sharable between different implementation of `Dictionary` interface. 
   - Enabling hash-based dictionary for existing tests that enables dictionary-encoding. 
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   We find RB-Tree based dictionary implementation being slow in our production workload. The performance comparison for the new hash-table based implementation will be done as part of ORC-50. 
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Mostly tested with added unit tests for hash-table and enabled hash-table based dictionary in some of the existing tests.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-832850321


   > > > Updated the PR. @omalley @pgaref please take another look, thanks.
   > > 
   > > 
   > > Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
   > > Left some mostly minor comments, let me know what you think!
   > 
   > PR seems in pretty good shape already, @autumnust anything to polish/benchmark before going in? Whats the plan?
   
   @pgaref  thanks for getting back. I was trying to finalize the bucketSize thing and tune things a bit more. Will publish a new benchmark later today. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613585262



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();

Review comment:
       Thanks for catching, removed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620451284



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  // containing all keys every seen in bytes.
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // containing starting offset of the key (in byte) in the byte array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashBuckets;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int BUCKET_SIZE = 20;

Review comment:
       How did we come up with 20 here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-833094987


   Also tried putting the init dict size as a parameter in the benchmark: 
   ```Benchmark                                    (dictImpl)  (initSize)  (upperBound)  Mode  Cnt       Score       Error  Units
   ORCWriterBenchMark.dictBench                     RBTREE        4096         10000  avgt    5   40363.137 ± 28516.301  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        4096          2500  avgt    5   24392.270 ±  8096.477  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        4096           500  avgt    5   20316.419 ±  2917.176  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192         10000  avgt    5   33112.917 ±  4040.524  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192          2500  avgt    5   27269.018 ±  1593.083  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192           500  avgt    5   21326.469 ±  1373.140  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240         10000  avgt    5   33538.635 ±  5781.321  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240          2500  avgt    5   27548.238 ±  3307.047  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240           500  avgt    5   21053.961 ±  3118.891  us/op
   **ORCWriterBenchMark.dictBench                       HASH        4096         10000  avgt    5   19614.326 ±  1304.425  us/op**
   ORCWriterBenchMark.dictBench                       HASH        4096          2500  avgt    5   11529.653 ±  1851.182  us/op
   ORCWriterBenchMark.dictBench                       HASH        4096           500  avgt    5    9108.816 ±   837.123  us/op
   **ORCWriterBenchMark.dictBench                       HASH        8192         10000  avgt    5   15611.967 ±   583.613  us/op**
   ORCWriterBenchMark.dictBench                       HASH        8192          2500  avgt    5   13396.318 ±  3114.460  us/op
   ORCWriterBenchMark.dictBench                       HASH        8192           500  avgt    5   10742.425 ±  1031.070  us/op
   **ORCWriterBenchMark.dictBench                       HASH       10240         10000  avgt    5   17044.101 ±  1671.182  us/op**
   ORCWriterBenchMark.dictBench                       HASH       10240          2500  avgt    5   13767.572 ±   196.728  us/op
   ORCWriterBenchMark.dictBench                       HASH       10240           500  avgt    5   11120.604 ±   305.075  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096         10000  avgt    5    4327.766 ±  1747.765  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096          2500  avgt    5    4390.480 ±  2545.236  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096           500  avgt    5   12315.912 ±  8071.684  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192         10000  avgt    5    5529.683 ±  4187.802  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192          2500  avgt    5    5461.490 ±   914.698  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192           500  avgt    5    4745.401 ±  1097.454  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240         10000  avgt    5    4734.983 ±   257.299  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240          2500  avgt    5    4776.043 ±   690.286  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240           500  avgt    5    4750.625 ±   440.191  us/op```
   
   Obviously it is not the case that the larger the init size the better the performance (considering the cost of resizing and the overhead brought from this in terms of metadata size). Again the good size will be relevant to the common record size seen by the writer (which together with the stripe size determines the number of entries within a stripe). From the benchmark above, enlarging the size could only be beneficial if the collision is common (e.g. in the `HASH` case when upper bound is larger (10k)) but the benefits isn't that consistent among different value of `upperBound`. 
   
   I am therefore putting `initSize` as a config in `OrcConf` and keep the default as the original. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613637335



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;

Review comment:
       yeah given Owen's recommendation on using linear-probing, I will see if I could abstract this class a bit so that adding another collision-resolution impl. will be simpler. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust edited a comment on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-832850321


   > > > Updated the PR. @omalley @pgaref please take another look, thanks.
   > > 
   > > 
   > > Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
   > > Left some mostly minor comments, let me know what you think!
   > 
   > PR seems in pretty good shape already, @autumnust anything to polish/benchmark before going in? Whats the plan?
   
   @pgaref  thanks for getting back. I was trying to finalize the bucketSize thing and tune things a bit more. Will publish a new benchmark later today. Beyond this PR, I am planning to try the linear probing as the collision resolution method and do a benchmark with current impl. to see if there's any space for improvement. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620449979



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table

Review comment:
       Not sure I understand the size reduction comment




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613598496



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;

Review comment:
       Let me wrap up the benchmark for this version and we could then compare the linear-probing approach. But yeah I agree that linear probing seems more promising. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613636689



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();

Review comment:
       Shrink it to a smaller bucket in the current chaining implementation. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref edited a comment on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-842585252


   > Gentle ping ... @pgaref
   
   Hey @autumnust thanks for pinging me, got distracted by various tasks last week!
   Also thanks for updating the benchmark, this helps a lot. Some comments:
   
   - I would expect NONE dictImpl  bench results to be identical across runs, why is 4096  X 500  run 3x the others?
   - Seems like an init size of 4k could be a good default conf in most of the cases -- would it make sense remove the extra conf completely if we can avoid it? 
   - Minor: I would rename upperBound to distinctCount
   
   Let me know what you think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r626174029



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  // containing all keys every seen in bytes.
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // containing starting offset of the key (in byte) in the byte array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashBuckets;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int BUCKET_SIZE = 20;

Review comment:
       Can you elaborate a bit more on how 5/6 is deduced?  the particular point I am confusing is its relation to the batch size, since dictionary size is more associated with stripe size right ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-824554472


   Updated the PR. @omalley  @pgaref  please take another look, thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-820145038


   Thanks @pgaref @omalley  both for your comments!  Please kindly take a second look for some pf the response. For those resolved conversation I've already addressed locally and will post together with all other comments once they are settled.
   
   I am targeting to adding the benchmark after that as part of this PR, and add a version with linear-probing as comparison (probably another PR). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-848350035


   > > Gentle ping ... @pgaref
   > 
   > Hey @autumnust thanks for pinging me, got distracted by various tasks last week!
   > Also thanks for updating the benchmark, this helps a lot. Some comments:
   > 
   > * I would expect NONE dictImpl  bench results to be identical across runs, why is 4096  X 500  run 3x the others?
   > * Seems like an init size of 4k could be a good default conf in most of the cases -- would it make sense remove the extra conf completely if we can avoid it?
   > * Minor: I would rename upperBound to distinctCount
   > 
   > Let me know what you think
   
   Hi @pgaref  thanks for getting back to me. Yeah the 3x number in 4096 X 500 case is weird as I am no long able to reproduce it. Not sure if it is just some noise since I am running this benchmark locally. 
   I updated the PR for the rest of comments, they all make sense to me, especially the point to keep the configuration minimal ( one of the other projects that I am working on have this problem that results in a lot of confusion from beginner users) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-854805204


   Hi, @pgaref . Is this merged? Could you check GitHub Action status?
   - https://github.com/apache/orc/actions/workflows/build_and_test.yml?query=branch%3Amain


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r598714046



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;

Review comment:
       As a broader note, would it make sense to move the base functionality to an abstract HT class as we do for RedBlackTree? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r600844787



##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+  private DictionaryUtils() {
+    // Utility class does nothing in constructor
+  }
+
+  public static void getTextInternal(Text result, int position, DynamicIntArray keyOffsets, DynamicByteArray byteArray) {
+    int offset = keyOffsets.get(position);
+    int length;
+    if (position + 1 == keyOffsets.size()) {
+      length = byteArray.size() - offset;
+    } else {
+      length = keyOffsets.get(position + 1) - offset;
+    }
+    byteArray.setText(result, offset, length);
+  }
+
+  static class VisitorContextImpl implements Dictionary.VisitorContext {

Review comment:
       Sorry, I left off the context of why. We could pull the common code into a single class between the red-black tree and hash table.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun closed pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #651:
URL: https://github.com/apache/orc/pull/651


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613780163



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);

Review comment:
       Just to ensure, are you asking the line `newKey.set(bytes, offset, length);` in `add(byte[] bytes, int offset, int length)` can be removed since `newKey` will anyway be set here? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r621720755



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table

Review comment:
       Removed this comment as it was meant for something that no longer valid 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-832534534


   > > Updated the PR. @omalley @pgaref please take another look, thanks.
   > 
   > Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
   > Left some mostly minor comments, let me know what you think!
   
   PR seems in pretty good shape already, @autumnust  anything to polish/benchmark before going in? Whats the plan?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-854906883


   Thanks, @pgaref !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620453553



##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+  /**
+   * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+   *
+   */
+  private static class SimpleHashDictionary extends StringHashTableDictionary {
+    public SimpleHashDictionary(int initialCapacity) {
+      super(initialCapacity);
+    }
+
+    /**
+     * Obtain the prefix for each string as the hash value.
+     * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+     * this way we know the order of the traverse() method.
+     */
+    @Override
+    int getIndex(Text text) {
+      String s = text.toString();
+      int underscore = s.indexOf("_");
+      return Integer.parseInt(text.toString().substring(0, underscore));
+    }
+  }
+
+  @Test
+  public void test1()
+      throws Exception {
+    SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+    // Non-resize trivial cases
+    Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
+    Text text = new Text();
+    hashTableDictionary.getText(text, 0);
+    Assert.assertEquals("2_Alice", text.toString());
+    hashTableDictionary.getText(text, 1);
+    Assert.assertEquals("3_Bob", text.toString());
+    hashTableDictionary.getText(text, 2);
+    Assert.assertEquals("1_Cindy", text.toString());
+
+    // entering the fourth and fifth element which triggers rehash
+    Assert.assertEquals(3, hashTableDictionary.add(new Text("0_David")));
+    hashTableDictionary.getText(text, 3);
+    Assert.assertEquals("0_David", text.toString());
+    Assert.assertEquals(4, hashTableDictionary.add(new Text("4_Eason")));
+    hashTableDictionary.getText(text, 4);
+    Assert.assertEquals("4_Eason", text.toString());
+
+    // Re-ensure no all previously existed string still have correct encoded value
+    hashTableDictionary.getText(text, 0);
+    Assert.assertEquals("2_Alice", text.toString());
+    hashTableDictionary.getText(text, 1);
+    Assert.assertEquals("3_Bob", text.toString());
+    hashTableDictionary.getText(text, 2);
+    Assert.assertEquals("1_Cindy", text.toString());
+
+
+    // The order of words are based on each string's prefix given their index in the hashArray will be based on that.
+    TestStringRedBlackTree
+        .checkContents(hashTableDictionary, new int[]{3, 2, 0, 1, 4}, "0_David", "1_Cindy", "2_Alice", "3_Bob",
+            "4_Eason");
+

Review comment:
       Just for correctness making sure that both dictionaries write the same strings -- you are doing the same thing above so not that important




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620443005



##########
File path: java/bench/core/src/java/org/apache/orc/bench/core/IOCounters.java
##########
@@ -48,23 +50,26 @@ public void print() {
     if (recordCounters != null) {
       recordCounters.print();
     }
-    System.out.println("Reads: " + reads);
-    System.out.println("Bytes: " + bytesRead);
+    System.out.println("io: " + io);
+    System.out.println("Bytes: " + bytesIO);
   }
 
   public double bytesPerRecord() {
     return recordCounters == null || recordCounters.records == 0 ?
-        0 : ((double) bytesRead) / recordCounters.records;
+        0 : ((double) bytesIO) / recordCounters.records;
   }
 
   public long records() {
     return recordCounters == null || recordCounters.invocations == 0 ?
         0 : recordCounters.records / recordCounters.invocations;
   }
 
-  public long reads() {
+  /**
+   * Capture the number of I/O on average in each invocation.
+   */
+  public long iOs() {

Review comment:
       IOps?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-842536951


   Gentle ping ... @pgaref  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613780163



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);

Review comment:
       Just to ensure, are you asking the line `newKey.set(bytes, offset, length);` in `add(byte[] bytes, int offset, int length)` can be removed since `newKey` will anyway be set here? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-837051398


   @pgaref  @omalley  Gentle ping for the review. Please let me know if there's more comments, thanks ! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
omalley commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-806242027


   Of course, it would be really good to have a benchmark in the benchmark module that lets us test the two implementations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r622051809



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

Review comment:
       lets add another comment here please




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613611182



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);
+
+    Text tmpText = new Text();
+    for (int i = 0; i < candidateArray.size(); i++) {
+      getText(tmpText, candidateArray.get(i));
+      if (tmpText.equals(newKey)) {
+        return candidateArray.get(i);
+      }
+    }
+
+    // if making it here, it means no match.
+    int len = newKey.getLength();
+    int currIdx = keyOffsets.size();
+    keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+    candidateArray.add(currIdx);
+    return currIdx;
+  }
+
+  private void resizeIfNeeded() {
+    if (keyOffsets.size() >= threshold) {
+      int oldCapacity = keyOffsets.size();
+      int newCapacity = (oldCapacity << 1) + 1;
+      doResize(newCapacity);
+      this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+    }
+  }
+
+  @Override
+  public int size() {
+    return keyOffsets.size();
+  }
+
+  /**
+   * Compute the hash value and find the corresponding index.
+   *
+   */
+  int getIndex(Text text) {
+    return (text.hashCode() & 0x7FFFFFFF) % capacity;

Review comment:
       @pgaref  it was meant to remove the highest sign bits to ensure the dividend is always a positive value. But yeah I will take Owen's suggestion to use the Math library instead for better readability. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r626906494



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

Review comment:
       Will address. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust edited a comment on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-833094987


   Also tried putting the init dict size as a parameter in the benchmark: 
   <pre>Benchmark                                    (dictImpl)  (initSize)  (upperBound)  Mode  Cnt       Score       Error  Units
   ORCWriterBenchMark.dictBench                     RBTREE        4096         10000  avgt    5   40363.137 ± 28516.301  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        4096          2500  avgt    5   24392.270 ±  8096.477  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        4096           500  avgt    5   20316.419 ±  2917.176  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192         10000  avgt    5   33112.917 ±  4040.524  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192          2500  avgt    5   27269.018 ±  1593.083  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192           500  avgt    5   21326.469 ±  1373.140  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240         10000  avgt    5   33538.635 ±  5781.321  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240          2500  avgt    5   27548.238 ±  3307.047  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240           500  avgt    5   21053.961 ±  3118.891  us/op
   <b>ORCWriterBenchMark.dictBench                       HASH        4096         10000  avgt    5   19614.326 ±  1304.425  us/op</b>
   ORCWriterBenchMark.dictBench                       HASH        4096          2500  avgt    5   11529.653 ±  1851.182  us/op
   ORCWriterBenchMark.dictBench                       HASH        4096           500  avgt    5    9108.816 ±   837.123  us/op
   <b>ORCWriterBenchMark.dictBench                       HASH        8192         10000  avgt    5   15611.967 ±   583.613  us/op</b>
   ORCWriterBenchMark.dictBench                       HASH        8192          2500  avgt    5   13396.318 ±  3114.460  us/op
   ORCWriterBenchMark.dictBench                       HASH        8192           500  avgt    5   10742.425 ±  1031.070  us/op
   <b>ORCWriterBenchMark.dictBench                       HASH       10240         10000  avgt    5   17044.101 ±  1671.182  us/op</b>
   ORCWriterBenchMark.dictBench                       HASH       10240          2500  avgt    5   13767.572 ±   196.728  us/op
   ORCWriterBenchMark.dictBench                       HASH       10240           500  avgt    5   11120.604 ±   305.075  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096         10000  avgt    5    4327.766 ±  1747.765  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096          2500  avgt    5    4390.480 ±  2545.236  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096           500  avgt    5   12315.912 ±  8071.684  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192         10000  avgt    5    5529.683 ±  4187.802  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192          2500  avgt    5    5461.490 ±   914.698  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192           500  avgt    5    4745.401 ±  1097.454  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240         10000  avgt    5    4734.983 ±   257.299  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240          2500  avgt    5    4776.043 ±   690.286  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240           500  avgt    5    4750.625 ±   440.191  us/op</pre>
   
   Obviously it is not the case that the larger the init size the better the performance (considering the cost of resizing and the overhead brought from this in terms of metadata size). Again the good size will be relevant to the common record size seen by the writer (which together with the stripe size determines the number of entries within a stripe). From the benchmark above, enlarging the size could only be beneficial if the collision is common (e.g. in the `HASH` case when upper bound is larger (10k)) but the benefits isn't that consistent among different value of `upperBound`. 
   
   I am therefore putting `initSize` as a config in `OrcConf` and keep the default as the original. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r600838359



##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+  private DictionaryUtils() {
+    // Utility class does nothing in constructor
+  }
+
+  public static void getTextInternal(Text result, int position, DynamicIntArray keyOffsets, DynamicByteArray byteArray) {
+    int offset = keyOffsets.get(position);
+    int length;
+    if (position + 1 == keyOffsets.size()) {
+      length = byteArray.size() - offset;
+    } else {
+      length = keyOffsets.get(position + 1) - offset;
+    }
+    byteArray.setText(result, offset, length);
+  }
+
+  static class VisitorContextImpl implements Dictionary.VisitorContext {

Review comment:
       I'd think that a common class (named DictionaryStorage?) that implemented VisitorContext and had the byteArray and keyOffsets would make sense. Each of the dictionaries could have a field that held the reference.

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;

Review comment:
       I suspect that we'd get better performance from using linear probing into a single DynamicIntArray, although we might want to drop the max load factor.

##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+  private DictionaryUtils() {

Review comment:
       Actually, I don't mind this style for utility classes, since it prevents anyone from accidentally creating a useless instance. That said, I suspect that this should probably be refactored.

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();

Review comment:
       Yeah, that is too large. If we have more than a handful of collisions the table is too small or the function isn't good.

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);
+
+    Text tmpText = new Text();
+    for (int i = 0; i < candidateArray.size(); i++) {
+      getText(tmpText, candidateArray.get(i));
+      if (tmpText.equals(newKey)) {
+        return candidateArray.get(i);
+      }
+    }
+
+    // if making it here, it means no match.
+    int len = newKey.getLength();
+    int currIdx = keyOffsets.size();
+    keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+    candidateArray.add(currIdx);
+    return currIdx;
+  }
+
+  private void resizeIfNeeded() {
+    if (keyOffsets.size() >= threshold) {
+      int oldCapacity = keyOffsets.size();
+      int newCapacity = (oldCapacity << 1) + 1;
+      doResize(newCapacity);
+      this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+    }
+  }
+
+  @Override
+  public int size() {
+    return keyOffsets.size();
+  }
+
+  /**
+   * Compute the hash value and find the corresponding index.
+   *
+   */
+  int getIndex(Text text) {
+    return (text.hashCode() & 0x7FFFFFFF) % capacity;

Review comment:
       I'd be tempted to use Math.floorMod here, which would always be positive.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613763763



##########
File path: java/core/src/test/org/apache/orc/TestStringDictionary.java
##########
@@ -114,10 +117,8 @@ public void testTooManyDistinct() throws Exception {
     }
   }
 
-  @Test
-  public void testHalfDistinct() throws Exception {
+  public void testHalfDistinctHelper(Configuration conf) throws Exception {

Review comment:
       Yes, that makes sense, will address.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r621720363



##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/ORCWriterBenchMark.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench.hive;
+
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.core.IOCounters;
+import org.apache.orc.bench.core.OrcBenchmark;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.runner.Runner;
+
+import com.google.auto.service.AutoService;
+
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@State(Scope.Thread)
+@AutoService(OrcBenchmark.class)
+public class ORCWriterBenchMark implements OrcBenchmark {
+  private static final Path root = Utilities.getBenchmarkRoot();
+  private Path dumpDir = new Path(root, "dumpDir");
+  private List<VectorizedRowBatch> batches = new ArrayList<>();
+
+  @Override
+  public String getName() {
+    return "write";
+  }
+
+  @Override
+  public String getDescription() {
+    return "Benchmark ORC Writer with different options";
+  }
+
+  @Setup(Level.Trial)
+  public void prepareData() {
+    TypeDescription schema = TypeDescription.fromString("struct<str:string>");
+    VectorizedRowBatch batch = schema.createRowBatch();
+    BytesColumnVector strVector = (BytesColumnVector) batch.cols[0];
+
+    Random rand = new Random();
+    for (int i = 0; i < 32 * 1024; i++) {
+      if (batch.size == batch.getMaxSize()) {
+        batches.add(batch);
+        batch = schema.createRowBatch();
+        strVector = (BytesColumnVector) batch.cols[0];
+      }
+      int randomNum = rand.nextInt(10000 - 10 + 1) + 10;

Review comment:
       Addressed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613781297



##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+  /**
+   * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+   *
+   */
+  private static class SimpleHashDictionary extends StringHashTableDictionary {
+    public SimpleHashDictionary(int initialCapacity) {
+      super(initialCapacity);
+    }
+
+    /**
+     * Obtain the prefix for each string as the hash value.
+     * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+     * this way we know the order of the traverse() method.
+     */
+    @Override
+    int getIndex(Text text) {
+      String s = text.toString();
+      int underscore = s.indexOf("_");
+      return Integer.parseInt(text.toString().substring(0, underscore));
+    }
+  }
+
+  @Test
+  public void test1()
+      throws Exception {
+    SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+    // Non-resize trivial cases
+    Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+

Review comment:
       Hmm maybe I am missing something, but why byte check is needed since it is the content of `Text` we cared. Isn't add-then-read enough for the correctness purpose ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620451724



##########
File path: java/core/src/java/org/apache/orc/impl/VisitorContextImpl.java
##########
@@ -0,0 +1,74 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ *

Review comment:
       Missing doc?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r622051672



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  // containing all keys every seen in bytes.
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // containing starting offset of the key (in byte) in the byte array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashBuckets;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int BUCKET_SIZE = 20;

Review comment:
       Was just thinking that for the default batches of  size 1024, we use Dict encoding if less than 20% of are distinct we would need around 0.6 - 0.7 of the batch size?
   In that sense bucketSize of 5 or 6 could be enough right?
   
   Might be missing something but lets add a comment here behind the logic




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r626177117



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  // containing all keys every seen in bytes.
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // containing starting offset of the key (in byte) in the byte array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashBuckets;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int BUCKET_SIZE = 20;

Review comment:
       My rough estimate: 
   Given default stripe size 64MB, assuming each row were around 500B, 
   
   `4096 * 0.75 (capacity without resize) * avgBucketSize * 5 (20% distinct) = 64 * 1024 * 1024 / 500`
   
   avgBucketSize ~8 in this case. Let me know if it makes sense/converge with your thoughts 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620448557



##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/ORCWriterBenchMark.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench.hive;
+
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.core.IOCounters;
+import org.apache.orc.bench.core.OrcBenchmark;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.runner.Runner;
+
+import com.google.auto.service.AutoService;
+
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@State(Scope.Thread)
+@AutoService(OrcBenchmark.class)
+public class ORCWriterBenchMark implements OrcBenchmark {
+  private static final Path root = Utilities.getBenchmarkRoot();
+  private Path dumpDir = new Path(root, "dumpDir");
+  private List<VectorizedRowBatch> batches = new ArrayList<>();
+
+  @Override
+  public String getName() {
+    return "write";
+  }
+
+  @Override
+  public String getDescription() {
+    return "Benchmark ORC Writer with different options";
+  }
+
+  @Setup(Level.Trial)
+  public void prepareData() {
+    TypeDescription schema = TypeDescription.fromString("struct<str:string>");
+    VectorizedRowBatch batch = schema.createRowBatch();
+    BytesColumnVector strVector = (BytesColumnVector) batch.cols[0];
+
+    Random rand = new Random();
+    for (int i = 0; i < 32 * 1024; i++) {
+      if (batch.size == batch.getMaxSize()) {
+        batches.add(batch);
+        batch = schema.createRowBatch();
+        strVector = (BytesColumnVector) batch.cols[0];
+      }
+      int randomNum = rand.nextInt(10000 - 10 + 1) + 10;

Review comment:
       Ideally I would like to see how the number of distinct dict keys is affecting performance per data structure.
   Can we make the above configurable during data generation?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-854855642


   > Hi, @pgaref . Is this merged? Could you check GitHub Action status?
   > 
   > * https://github.com/apache/orc/actions/workflows/build_and_test.yml?query=branch%3Amain
   > 
   > <img alt="Screen Shot 2021-06-04 at 8 44 43 AM" width="490" src="https://user-images.githubusercontent.com/9700541/120828249-27eea800-c511-11eb-9bd8-eecde81a61a8.png">
   
   Hey @dongjoon-hyun  yes this is merged but broke some tests when I manually set HASH as the default dictionary.
   This should be resolved now with ORC-810


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613779065



##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+  /**
+   * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+   *
+   */
+  private static class SimpleHashDictionary extends StringHashTableDictionary {
+    public SimpleHashDictionary(int initialCapacity) {
+      super(initialCapacity);
+    }
+
+    /**
+     * Obtain the prefix for each string as the hash value.
+     * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+     * this way we know the order of the traverse() method.
+     */
+    @Override
+    int getIndex(Text text) {
+      String s = text.toString();
+      int underscore = s.indexOf("_");
+      return Integer.parseInt(text.toString().substring(0, underscore));
+    }
+  }
+
+  @Test
+  public void test1()
+      throws Exception {
+    SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+    // Non-resize trivial cases
+    Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
+    Text text = new Text();
+    hashTableDictionary.getText(text, 0);
+    Assert.assertEquals("2_Alice", text.toString());
+    hashTableDictionary.getText(text, 1);
+    Assert.assertEquals("3_Bob", text.toString());
+    hashTableDictionary.getText(text, 2);
+    Assert.assertEquals("1_Cindy", text.toString());
+
+    // entering the fourth and fifth element which triggers rehash
+    Assert.assertEquals(3, hashTableDictionary.add(new Text("0_David")));
+    hashTableDictionary.getText(text, 3);
+    Assert.assertEquals("0_David", text.toString());
+    Assert.assertEquals(4, hashTableDictionary.add(new Text("4_Eason")));
+    hashTableDictionary.getText(text, 4);
+    Assert.assertEquals("4_Eason", text.toString());
+
+    // Re-ensure no all previously existed string still have correct encoded value
+    hashTableDictionary.getText(text, 0);
+    Assert.assertEquals("2_Alice", text.toString());
+    hashTableDictionary.getText(text, 1);
+    Assert.assertEquals("3_Bob", text.toString());
+    hashTableDictionary.getText(text, 2);
+    Assert.assertEquals("1_Cindy", text.toString());
+
+
+    // The order of words are based on each string's prefix given their index in the hashArray will be based on that.
+    TestStringRedBlackTree
+        .checkContents(hashTableDictionary, new int[]{3, 2, 0, 1, 4}, "0_David", "1_Cindy", "2_Alice", "3_Bob",
+            "4_Eason");
+

Review comment:
       Not sure if I understand it. Why do we need to compare different implementations here?  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613780510



##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+  /**
+   * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+   *
+   */
+  private static class SimpleHashDictionary extends StringHashTableDictionary {
+    public SimpleHashDictionary(int initialCapacity) {
+      super(initialCapacity);
+    }
+
+    /**
+     * Obtain the prefix for each string as the hash value.
+     * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+     * this way we know the order of the traverse() method.
+     */
+    @Override
+    int getIndex(Text text) {

Review comment:
       Will address. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613430970



##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+  private DictionaryUtils() {

Review comment:
       +1. I was getting the feedback to add private constructor for utility class at some point and inherent this "convention" afterwards. Here is the SO page on this topic: https://stackoverflow.com/questions/14398747/hide-utility-class-constructor-utility-classes-should-not-have-a-public-or-def 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust edited a comment on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-832850321


   > > > Updated the PR. @omalley @pgaref please take another look, thanks.
   > > 
   > > 
   > > Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
   > > Left some mostly minor comments, let me know what you think!
   > 
   > PR seems in pretty good shape already, @autumnust anything to polish/benchmark before going in? Whats the plan?
   
   @pgaref  thanks for getting back. I was trying to finalize the bucketSize thing and tune things a bit more (for the dict size). Will publish a new benchmark later today. Beyond this PR, I am planning to try the linear probing as the collision resolution method and do a benchmark with current impl. to see if there's any space for improvement. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613635755



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);
+
+    Text tmpText = new Text();
+    for (int i = 0; i < candidateArray.size(); i++) {
+      getText(tmpText, candidateArray.get(i));
+      if (tmpText.equals(newKey)) {
+        return candidateArray.get(i);
+      }
+    }
+
+    // if making it here, it means no match.
+    int len = newKey.getLength();
+    int currIdx = keyOffsets.size();
+    keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+    candidateArray.add(currIdx);
+    return currIdx;
+  }
+
+  private void resizeIfNeeded() {
+    if (keyOffsets.size() >= threshold) {
+      int oldCapacity = keyOffsets.size();
+      int newCapacity = (oldCapacity << 1) + 1;
+      doResize(newCapacity);
+      this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+    }
+  }
+
+  @Override
+  public int size() {
+    return keyOffsets.size();
+  }
+
+  /**
+   * Compute the hash value and find the corresponding index.
+   *
+   */
+  int getIndex(Text text) {
+    return (text.hashCode() & 0x7FFFFFFF) % capacity;
+  }
+
+  // Resize the hash table, re-hash all the existing keys.
+  // byteArray and keyOffsetsArray don't have to be re-filled.
+  private void doResize(int newSize) {
+    DynamicIntArray[] resizedHashArray = new DynamicIntArray[newSize];
+    for (int i = 0; i < newSize; i++) {
+      resizedHashArray[i] = new DynamicIntArray();
+    }
+
+    Text tmpText = new Text();
+    for (int i = 0; i < capacity; i++) {
+      DynamicIntArray intArray = hashArray[i];
+      int bucketSize = intArray.size();
+      if (bucketSize > 0) {
+        for (int j = 0; j < bucketSize; j++) {
+          getText(tmpText, intArray.get(j));
+          int newIndex = getIndex(tmpText);
+          resizedHashArray[newIndex].add(intArray.get(j));
+        }
+      }
+    }
+
+    Arrays.fill(hashArray, null);

Review comment:
       I was thinking to ensure everything to be null just in case that some elements in the array is still be referenced so that the space doesn't get garbage-collected. Do you think that's a fair ? I don't have strong opinion about it though. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r626395478



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  // containing all keys every seen in bytes.
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // containing starting offset of the key (in byte) in the byte array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashBuckets;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int BUCKET_SIZE = 20;

Review comment:
       Hey @autumnust  you are right, it is more associated with the stripe size so your estimation makes sense. Thanks for clarifying!  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620439426



##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/ORCWriterBenchMark.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench.hive;
+
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.core.IOCounters;
+import org.apache.orc.bench.core.OrcBenchmark;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.runner.Runner;
+
+import com.google.auto.service.AutoService;
+
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@State(Scope.Thread)
+@AutoService(OrcBenchmark.class)
+public class ORCWriterBenchMark implements OrcBenchmark {
+  private static final Path root = Utilities.getBenchmarkRoot();
+  private Path dumpDir = new Path(root, "dumpDir");
+  private List<VectorizedRowBatch> batches = new ArrayList<>();
+
+  @Override
+  public String getName() {
+    return "write";
+  }
+
+  @Override
+  public String getDescription() {
+    return "Benchmark ORC Writer with different options";

Review comment:
       DICT options for now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-828170668


   Here is the new jmh result: 
   
   ```# Run complete. Total time: 00:10:35
   Benchmark                                    (dictImpl)  (upperBound)  Mode  Cnt       Score      Error  Units
   ORCWriterBenchMark.dictBench                     RBTREE         10000  avgt    5   28939.068 ± 3080.947  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord      RBTREE         10000  avgt    5      49.963                 #
   ORCWriterBenchMark.dictBench:ops                 RBTREE         10000  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord           RBTREE         10000  avgt    5       0.883 ±    0.094  us/op
   ORCWriterBenchMark.dictBench:records             RBTREE         10000  avgt    5  163840.000                 #
   ORCWriterBenchMark.dictBench                     RBTREE          2500  avgt    5   21998.781 ± 1448.300  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord      RBTREE          2500  avgt    5      23.532                 #
   ORCWriterBenchMark.dictBench:ops                 RBTREE          2500  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord           RBTREE          2500  avgt    5       0.671 ±    0.044  us/op
   ORCWriterBenchMark.dictBench:records             RBTREE          2500  avgt    5  163840.000                 #
   ORCWriterBenchMark.dictBench                     RBTREE           500  avgt    5   17730.281 ± 4574.132  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord      RBTREE           500  avgt    5      13.156                 #
   ORCWriterBenchMark.dictBench:ops                 RBTREE           500  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord           RBTREE           500  avgt    5       0.541 ±    0.140  us/op
   ORCWriterBenchMark.dictBench:records             RBTREE           500  avgt    5  163840.000                 #
   ORCWriterBenchMark.dictBench                       HASH         10000  avgt    5   21269.613 ± 4137.763  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord        HASH         10000  avgt    5      42.268                 #
   ORCWriterBenchMark.dictBench:ops                   HASH         10000  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord             HASH         10000  avgt    5       0.649 ±    0.126  us/op
   ORCWriterBenchMark.dictBench:records               HASH         10000  avgt    5  163840.000                 #
   ORCWriterBenchMark.dictBench                       HASH          2500  avgt    5   11586.898 ± 4075.783  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord        HASH          2500  avgt    5      17.692                 #
   ORCWriterBenchMark.dictBench:ops                   HASH          2500  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord             HASH          2500  avgt    5       0.354 ±    0.124  us/op
   ORCWriterBenchMark.dictBench:records               HASH          2500  avgt    5  163840.000                 #
   ORCWriterBenchMark.dictBench                       HASH           500  avgt    5    9646.080 ± 2279.530  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord        HASH           500  avgt    5      11.613                 #
   ORCWriterBenchMark.dictBench:ops                   HASH           500  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord             HASH           500  avgt    5       0.294 ±    0.070  us/op
   ORCWriterBenchMark.dictBench:records               HASH           500  avgt    5  163840.000                 #
   ORCWriterBenchMark.dictBench                       NONE         10000  avgt    5    4077.675 ±  117.606  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord        NONE         10000  avgt    5      50.146                 #
   ORCWriterBenchMark.dictBench:ops                   NONE         10000  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord             NONE         10000  avgt    5       0.124 ±    0.004  us/op
   ORCWriterBenchMark.dictBench:records               NONE         10000  avgt    5  163840.000                 #
   ORCWriterBenchMark.dictBench                       NONE          2500  avgt    5    4607.634 ± 1163.084  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord        NONE          2500  avgt    5      50.146                 #
   ORCWriterBenchMark.dictBench:ops                   NONE          2500  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord             NONE          2500  avgt    5       0.141 ±    0.035  us/op
   ORCWriterBenchMark.dictBench:records               NONE          2500  avgt    5  163840.000                 #
   ORCWriterBenchMark.dictBench                       NONE           500  avgt    5    3783.059 ±  367.511  us/op
   ORCWriterBenchMark.dictBench:bytesPerRecord        NONE           500  avgt    5      50.146                 #
   ORCWriterBenchMark.dictBench:ops                   NONE           500  avgt    5         ≈ 0                 #
   ORCWriterBenchMark.dictBench:perRecord             NONE           500  avgt    5       0.115 ±    0.011  us/op
   ORCWriterBenchMark.dictBench:records               NONE           500  avgt    5  163840.000                 #
   ```
   
   Unfortunately the previous implementation had a bug which end up with great locality (but incorrect). HASH is still much better than RB-Tree but obviously we needs to iterate further to improve it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-842585252


   > 10000
   
   Hey @autumnust thanks for pinging me, got distracted by various tasks last week!
   Also thanks for updating the benchmark, this helps a lot. Some comments:
   
   - I would expect NONE dictImpl  bench results to be identical across runs, why is 4096  X 500  run 3x the others?
   - Seems like an init size of 4k could be a good default conf in most of the cases -- would it make sense remove the extra conf completely if we can avoid it? 
   - Minor: I would rename upperBound to distinctCount
   
   Let me know what you think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r621639554



##########
File path: java/bench/core/src/java/org/apache/orc/bench/core/IOCounters.java
##########
@@ -48,23 +50,26 @@ public void print() {
     if (recordCounters != null) {
       recordCounters.print();
     }
-    System.out.println("Reads: " + reads);
-    System.out.println("Bytes: " + bytesRead);
+    System.out.println("io: " + io);
+    System.out.println("Bytes: " + bytesIO);
   }
 
   public double bytesPerRecord() {
     return recordCounters == null || recordCounters.records == 0 ?
-        0 : ((double) bytesRead) / recordCounters.records;
+        0 : ((double) bytesIO) / recordCounters.records;
   }
 
   public long records() {
     return recordCounters == null || recordCounters.invocations == 0 ?
         0 : recordCounters.records / recordCounters.invocations;
   }
 
-  public long reads() {
+  /**
+   * Capture the number of I/O on average in each invocation.
+   */
+  public long iOs() {

Review comment:
       thanks that's a much better name ! 

##########
File path: java/bench/core/src/java/org/apache/orc/bench/core/IOCounters.java
##########
@@ -48,23 +50,26 @@ public void print() {
     if (recordCounters != null) {
       recordCounters.print();
     }
-    System.out.println("Reads: " + reads);
-    System.out.println("Bytes: " + bytesRead);
+    System.out.println("io: " + io);
+    System.out.println("Bytes: " + bytesIO);
   }
 
   public double bytesPerRecord() {
     return recordCounters == null || recordCounters.records == 0 ?
-        0 : ((double) bytesRead) / recordCounters.records;
+        0 : ((double) bytesIO) / recordCounters.records;
   }
 
   public long records() {
     return recordCounters == null || recordCounters.invocations == 0 ?
         0 : recordCounters.records / recordCounters.invocations;
   }
 
-  public long reads() {
+  /**
+   * Capture the number of I/O on average in each invocation.
+   */
+  public long iOs() {

Review comment:
       Second thoughts: there's no "per second" so I may just called it `ops`, wdyt ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r598636479



##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+  private DictionaryUtils() {

Review comment:
       Remove empty constructor and add comment as class doc?

##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+  private DictionaryUtils() {
+    // Utility class does nothing in constructor
+  }
+
+  public static void getTextInternal(Text result, int position, DynamicIntArray keyOffsets, DynamicByteArray byteArray) {

Review comment:
       doc method please?

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();

Review comment:
       resizeIfNeeded called twice here -- believe its safe to remove this call

##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+  /**
+   * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+   *
+   */
+  private static class SimpleHashDictionary extends StringHashTableDictionary {
+    public SimpleHashDictionary(int initialCapacity) {
+      super(initialCapacity);
+    }
+
+    /**
+     * Obtain the prefix for each string as the hash value.
+     * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+     * this way we know the order of the traverse() method.
+     */
+    @Override
+    int getIndex(Text text) {

Review comment:
       Shall we have a test where the actual HashFunction is used as well?

##########
File path: java/core/src/test/org/apache/orc/TestStringDictionary.java
##########
@@ -114,10 +117,8 @@ public void testTooManyDistinct() throws Exception {
     }
   }
 
-  @Test
-  public void testHalfDistinct() throws Exception {
+  public void testHalfDistinctHelper(Configuration conf) throws Exception {

Review comment:
       Shall we make this a Parametrized test where @Parameter is going to be the DICT implementations -- other tests with dict encoding could also benefit from this

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

Review comment:
       explain MAX_ARRAY_SIZE limit and how this relates to Hash MASK below ignoring the first bits

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();

Review comment:
       No sure each bucket should be initialized to default  8K ints -- are we expecting that many collisions?

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);
+
+    Text tmpText = new Text();
+    for (int i = 0; i < candidateArray.size(); i++) {
+      getText(tmpText, candidateArray.get(i));
+      if (tmpText.equals(newKey)) {
+        return candidateArray.get(i);
+      }
+    }
+
+    // if making it here, it means no match.
+    int len = newKey.getLength();
+    int currIdx = keyOffsets.size();
+    keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+    candidateArray.add(currIdx);
+    return currIdx;
+  }
+
+  private void resizeIfNeeded() {
+    if (keyOffsets.size() >= threshold) {
+      int oldCapacity = keyOffsets.size();
+      int newCapacity = (oldCapacity << 1) + 1;
+      doResize(newCapacity);
+      this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+    }
+  }
+
+  @Override
+  public int size() {
+    return keyOffsets.size();
+  }
+
+  /**
+   * Compute the hash value and find the corresponding index.
+   *
+   */
+  int getIndex(Text text) {
+    return (text.hashCode() & 0x7FFFFFFF) % capacity;
+  }
+
+  // Resize the hash table, re-hash all the existing keys.
+  // byteArray and keyOffsetsArray don't have to be re-filled.
+  private void doResize(int newSize) {
+    DynamicIntArray[] resizedHashArray = new DynamicIntArray[newSize];
+    for (int i = 0; i < newSize; i++) {
+      resizedHashArray[i] = new DynamicIntArray();
+    }
+
+    Text tmpText = new Text();
+    for (int i = 0; i < capacity; i++) {
+      DynamicIntArray intArray = hashArray[i];
+      int bucketSize = intArray.size();
+      if (bucketSize > 0) {
+        for (int j = 0; j < bucketSize; j++) {
+          getText(tmpText, intArray.get(j));
+          int newIndex = getIndex(tmpText);
+          resizedHashArray[newIndex].add(intArray.get(j));
+        }
+      }
+    }
+
+    Arrays.fill(hashArray, null);

Review comment:
       we reassign hashArray below, is this really needed?

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);
+
+    Text tmpText = new Text();
+    for (int i = 0; i < candidateArray.size(); i++) {
+      getText(tmpText, candidateArray.get(i));
+      if (tmpText.equals(newKey)) {
+        return candidateArray.get(i);
+      }
+    }
+
+    // if making it here, it means no match.
+    int len = newKey.getLength();
+    int currIdx = keyOffsets.size();
+    keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+    candidateArray.add(currIdx);
+    return currIdx;
+  }
+
+  private void resizeIfNeeded() {
+    if (keyOffsets.size() >= threshold) {
+      int oldCapacity = keyOffsets.size();
+      int newCapacity = (oldCapacity << 1) + 1;
+      doResize(newCapacity);
+      this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+    }
+  }
+
+  @Override
+  public int size() {
+    return keyOffsets.size();
+  }
+
+  /**
+   * Compute the hash value and find the corresponding index.
+   *
+   */
+  int getIndex(Text text) {
+    return (text.hashCode() & 0x7FFFFFFF) % capacity;
+  }
+
+  // Resize the hash table, re-hash all the existing keys.
+  // byteArray and keyOffsetsArray don't have to be re-filled.
+  private void doResize(int newSize) {
+    DynamicIntArray[] resizedHashArray = new DynamicIntArray[newSize];
+    for (int i = 0; i < newSize; i++) {
+      resizedHashArray[i] = new DynamicIntArray();
+    }
+
+    Text tmpText = new Text();
+    for (int i = 0; i < capacity; i++) {
+      DynamicIntArray intArray = hashArray[i];
+      int bucketSize = intArray.size();
+      if (bucketSize > 0) {

Review comment:
       unnecessary condition -- already handled by the loop

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;

Review comment:
       Lets clarify what each of hashArray, keyOffsets, byteArray represent.
   Seems like hashArray is storing String hash (int) to KeyOffset List -- lets clarify and make variables more representative.

##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+  /**
+   * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+   *
+   */
+  private static class SimpleHashDictionary extends StringHashTableDictionary {
+    public SimpleHashDictionary(int initialCapacity) {
+      super(initialCapacity);
+    }
+
+    /**
+     * Obtain the prefix for each string as the hash value.
+     * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+     * this way we know the order of the traverse() method.
+     */
+    @Override
+    int getIndex(Text text) {
+      String s = text.toString();
+      int underscore = s.indexOf("_");
+      return Integer.parseInt(text.toString().substring(0, underscore));
+    }
+  }
+
+  @Test
+  public void test1()
+      throws Exception {
+    SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+    // Non-resize trivial cases
+    Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
+    Text text = new Text();
+    hashTableDictionary.getText(text, 0);
+    Assert.assertEquals("2_Alice", text.toString());
+    hashTableDictionary.getText(text, 1);
+    Assert.assertEquals("3_Bob", text.toString());
+    hashTableDictionary.getText(text, 2);
+    Assert.assertEquals("1_Cindy", text.toString());
+
+    // entering the fourth and fifth element which triggers rehash
+    Assert.assertEquals(3, hashTableDictionary.add(new Text("0_David")));
+    hashTableDictionary.getText(text, 3);
+    Assert.assertEquals("0_David", text.toString());
+    Assert.assertEquals(4, hashTableDictionary.add(new Text("4_Eason")));
+    hashTableDictionary.getText(text, 4);
+    Assert.assertEquals("4_Eason", text.toString());
+
+    // Re-ensure no all previously existed string still have correct encoded value
+    hashTableDictionary.getText(text, 0);
+    Assert.assertEquals("2_Alice", text.toString());
+    hashTableDictionary.getText(text, 1);
+    Assert.assertEquals("3_Bob", text.toString());
+    hashTableDictionary.getText(text, 2);
+    Assert.assertEquals("1_Cindy", text.toString());
+
+
+    // The order of words are based on each string's prefix given their index in the hashArray will be based on that.
+    TestStringRedBlackTree
+        .checkContents(hashTableDictionary, new int[]{3, 2, 0, 1, 4}, "0_David", "1_Cindy", "2_Alice", "3_Bob",
+            "4_Eason");
+

Review comment:
       ensure that size of both HT and RB impl is the same?

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);
+
+    Text tmpText = new Text();
+    for (int i = 0; i < candidateArray.size(); i++) {
+      getText(tmpText, candidateArray.get(i));
+      if (tmpText.equals(newKey)) {
+        return candidateArray.get(i);
+      }
+    }
+
+    // if making it here, it means no match.
+    int len = newKey.getLength();
+    int currIdx = keyOffsets.size();
+    keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+    candidateArray.add(currIdx);
+    return currIdx;
+  }
+
+  private void resizeIfNeeded() {
+    if (keyOffsets.size() >= threshold) {
+      int oldCapacity = keyOffsets.size();
+      int newCapacity = (oldCapacity << 1) + 1;
+      doResize(newCapacity);
+      this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+    }
+  }
+
+  @Override
+  public int size() {
+    return keyOffsets.size();
+  }
+
+  /**
+   * Compute the hash value and find the corresponding index.
+   *
+   */
+  int getIndex(Text text) {
+    return (text.hashCode() & 0x7FFFFFFF) % capacity;

Review comment:
       lets explain masking

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);

Review comment:
       why are we moving text to front here?

##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+  /**
+   * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+   *
+   */
+  private static class SimpleHashDictionary extends StringHashTableDictionary {
+    public SimpleHashDictionary(int initialCapacity) {
+      super(initialCapacity);
+    }
+
+    /**
+     * Obtain the prefix for each string as the hash value.
+     * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+     * this way we know the order of the traverse() method.
+     */
+    @Override
+    int getIndex(Text text) {
+      String s = text.toString();
+      int underscore = s.indexOf("_");
+      return Integer.parseInt(text.toString().substring(0, underscore));
+    }
+  }
+
+  @Test
+  public void test1()
+      throws Exception {
+    SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+    // Non-resize trivial cases
+    Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+

Review comment:
       check bytes after additions?

##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+  /**
+   * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+   *
+   */
+  private static class SimpleHashDictionary extends StringHashTableDictionary {
+    public SimpleHashDictionary(int initialCapacity) {
+      super(initialCapacity);
+    }
+
+    /**
+     * Obtain the prefix for each string as the hash value.
+     * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+     * this way we know the order of the traverse() method.
+     */
+    @Override
+    int getIndex(Text text) {
+      String s = text.toString();
+      int underscore = s.indexOf("_");
+      return Integer.parseInt(text.toString().substring(0, underscore));
+    }
+  }
+
+  @Test
+  public void test1()
+      throws Exception {
+    SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+    // Non-resize trivial cases
+    Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+    Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+    Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
+    Text text = new Text();
+    hashTableDictionary.getText(text, 0);
+    Assert.assertEquals("2_Alice", text.toString());
+    hashTableDictionary.getText(text, 1);
+    Assert.assertEquals("3_Bob", text.toString());
+    hashTableDictionary.getText(text, 2);
+    Assert.assertEquals("1_Cindy", text.toString());
+
+    // entering the fourth and fifth element which triggers rehash
+    Assert.assertEquals(3, hashTableDictionary.add(new Text("0_David")));
+    hashTableDictionary.getText(text, 3);
+    Assert.assertEquals("0_David", text.toString());
+    Assert.assertEquals(4, hashTableDictionary.add(new Text("4_Eason")));
+    hashTableDictionary.getText(text, 4);
+    Assert.assertEquals("4_Eason", text.toString());
+
+    // Re-ensure no all previously existed string still have correct encoded value
+    hashTableDictionary.getText(text, 0);
+    Assert.assertEquals("2_Alice", text.toString());
+    hashTableDictionary.getText(text, 1);
+    Assert.assertEquals("3_Bob", text.toString());
+    hashTableDictionary.getText(text, 2);
+    Assert.assertEquals("1_Cindy", text.toString());
+
+
+    // The order of words are based on each string's prefix given their index in the hashArray will be based on that.
+    TestStringRedBlackTree

Review comment:
       This is a bit confusing -- I would just copy the checkContents method  here, or create a Utility class if want to reuse




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-826976373


   > Updated the PR. @omalley @pgaref please take another look, thanks.
   
   Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
   Left some mostly minor comments, let me know what you think! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun edited a comment on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-854805204


   Hi, @pgaref . Is this merged? Could you check GitHub Action status?
   - https://github.com/apache/orc/actions/workflows/build_and_test.yml?query=branch%3Amain
   
   <img width="490" alt="Screen Shot 2021-06-04 at 8 44 43 AM" src="https://user-images.githubusercontent.com/9700541/120828249-27eea800-c511-11eb-9bd8-eecde81a61a8.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-828353933


   > Here is the new jmh result:
   > 
   > ```
   > Benchmark                                    (dictImpl)  (upperBound)  Mode  Cnt       Score      Error  Units
   > ORCWriterBenchMark.dictBench                     RBTREE         10000  avgt    5   28939.068 ± 3080.947  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord      RBTREE         10000  avgt    5      49.963                 #
   > ORCWriterBenchMark.dictBench:ops                 RBTREE         10000  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord           RBTREE         10000  avgt    5       0.883 ±    0.094  us/op
   > ORCWriterBenchMark.dictBench:records             RBTREE         10000  avgt    5  163840.000                 #
   > ORCWriterBenchMark.dictBench                     RBTREE          2500  avgt    5   21998.781 ± 1448.300  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord      RBTREE          2500  avgt    5      23.532                 #
   > ORCWriterBenchMark.dictBench:ops                 RBTREE          2500  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord           RBTREE          2500  avgt    5       0.671 ±    0.044  us/op
   > ORCWriterBenchMark.dictBench:records             RBTREE          2500  avgt    5  163840.000                 #
   > ORCWriterBenchMark.dictBench                     RBTREE           500  avgt    5   17730.281 ± 4574.132  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord      RBTREE           500  avgt    5      13.156                 #
   > ORCWriterBenchMark.dictBench:ops                 RBTREE           500  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord           RBTREE           500  avgt    5       0.541 ±    0.140  us/op
   > ORCWriterBenchMark.dictBench:records             RBTREE           500  avgt    5  163840.000                 #
   > ORCWriterBenchMark.dictBench                       HASH         10000  avgt    5   21269.613 ± 4137.763  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord        HASH         10000  avgt    5      42.268                 #
   > ORCWriterBenchMark.dictBench:ops                   HASH         10000  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord             HASH         10000  avgt    5       0.649 ±    0.126  us/op
   > ORCWriterBenchMark.dictBench:records               HASH         10000  avgt    5  163840.000                 #
   > ORCWriterBenchMark.dictBench                       HASH          2500  avgt    5   11586.898 ± 4075.783  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord        HASH          2500  avgt    5      17.692                 #
   > ORCWriterBenchMark.dictBench:ops                   HASH          2500  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord             HASH          2500  avgt    5       0.354 ±    0.124  us/op
   > ORCWriterBenchMark.dictBench:records               HASH          2500  avgt    5  163840.000                 #
   > ORCWriterBenchMark.dictBench                       HASH           500  avgt    5    9646.080 ± 2279.530  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord        HASH           500  avgt    5      11.613                 #
   > ORCWriterBenchMark.dictBench:ops                   HASH           500  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord             HASH           500  avgt    5       0.294 ±    0.070  us/op
   > ORCWriterBenchMark.dictBench:records               HASH           500  avgt    5  163840.000                 #
   > ORCWriterBenchMark.dictBench                       NONE         10000  avgt    5    4077.675 ±  117.606  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord        NONE         10000  avgt    5      50.146                 #
   > ORCWriterBenchMark.dictBench:ops                   NONE         10000  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord             NONE         10000  avgt    5       0.124 ±    0.004  us/op
   > ORCWriterBenchMark.dictBench:records               NONE         10000  avgt    5  163840.000                 #
   > ORCWriterBenchMark.dictBench                       NONE          2500  avgt    5    4607.634 ± 1163.084  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord        NONE          2500  avgt    5      50.146                 #
   > ORCWriterBenchMark.dictBench:ops                   NONE          2500  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord             NONE          2500  avgt    5       0.141 ±    0.035  us/op
   > ORCWriterBenchMark.dictBench:records               NONE          2500  avgt    5  163840.000                 #
   > ORCWriterBenchMark.dictBench                       NONE           500  avgt    5    3783.059 ±  367.511  us/op
   > ORCWriterBenchMark.dictBench:bytesPerRecord        NONE           500  avgt    5      50.146                 #
   > ORCWriterBenchMark.dictBench:ops                   NONE           500  avgt    5         ≈ 0                 #
   > ORCWriterBenchMark.dictBench:perRecord             NONE           500  avgt    5       0.115 ±    0.011  us/op
   > ORCWriterBenchMark.dictBench:records               NONE           500  avgt    5  163840.000                 #
   > ```
   > 
   > Unfortunately the previous implementation had a bug which end up with great locality (but incorrect). HASH is still much better than RB-Tree but obviously we needs to iterate further to improve it.
   
   Sounds like this can be improved by reducing collisions right? Since we have an upper bound on the number of entries per batch would it make sense to experiment a but more with HT sizes to reduce collisions as much as possible?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r621799552



##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  // containing all keys every seen in bytes.
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // containing starting offset of the key (in byte) in the byte array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashBuckets;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int BUCKET_SIZE = 20;

Review comment:
       It is just randomly picked ... 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org