You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/03/09 23:28:56 UTC
[GitHub] [orc] autumnust opened a new pull request #651: [ORC-757] HashTable dictionary
autumnust opened a new pull request #651:
URL: https://github.com/apache/orc/pull/651
This PR:
- Add a straightforward implementation for `Dictionary` using hash table.
- Refactored RB-Tree to make code reusable, like `VisitorContextImpl`. Moving it into `DictionaryUtils.java` to make is sharable between different implementation of `Dictionary` interface.
- Enabling hash-based dictionary for existing tests that enables dictionary-encoding.
<!--
Thanks for sending a pull request! Here are some tips for you:
1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
2. Use your PR title to summarize what this PR proposes instead of describing the problem.
3. Make PR title and description complete because these will be the permanent commit log.
4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
5. If the PR is unfinished, use GitHub PR Draft feature.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If there is a discussion in the mailing list, please add the link.
-->
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
We find RB-Tree based dictionary implementation being slow in our production workload. The performance comparison for the new hash-table based implementation will be done as part of ORC-50.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
Mostly tested with added unit tests for hash-table and enabled hash-table based dictionary in some of the existing tests.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-832850321
> > > Updated the PR. @omalley @pgaref please take another look, thanks.
> >
> >
> > Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
> > Left some mostly minor comments, let me know what you think!
>
> PR seems in pretty good shape already, @autumnust anything to polish/benchmark before going in? Whats the plan?
@pgaref thanks for getting back. I was trying to finalize the bucketSize thing and tune things a bit more. Will publish a new benchmark later today.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613585262
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
Review comment:
Thanks for catching, removed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620451284
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ // containing all keys every seen in bytes.
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // containing starting offset of the key (in byte) in the byte array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashBuckets;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int BUCKET_SIZE = 20;
Review comment:
How did we come up with 20 here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-833094987
Also tried putting the init dict size as a parameter in the benchmark:
```Benchmark (dictImpl) (initSize) (upperBound) Mode Cnt Score Error Units
ORCWriterBenchMark.dictBench RBTREE 4096 10000 avgt 5 40363.137 ± 28516.301 us/op
ORCWriterBenchMark.dictBench RBTREE 4096 2500 avgt 5 24392.270 ± 8096.477 us/op
ORCWriterBenchMark.dictBench RBTREE 4096 500 avgt 5 20316.419 ± 2917.176 us/op
ORCWriterBenchMark.dictBench RBTREE 8192 10000 avgt 5 33112.917 ± 4040.524 us/op
ORCWriterBenchMark.dictBench RBTREE 8192 2500 avgt 5 27269.018 ± 1593.083 us/op
ORCWriterBenchMark.dictBench RBTREE 8192 500 avgt 5 21326.469 ± 1373.140 us/op
ORCWriterBenchMark.dictBench RBTREE 10240 10000 avgt 5 33538.635 ± 5781.321 us/op
ORCWriterBenchMark.dictBench RBTREE 10240 2500 avgt 5 27548.238 ± 3307.047 us/op
ORCWriterBenchMark.dictBench RBTREE 10240 500 avgt 5 21053.961 ± 3118.891 us/op
**ORCWriterBenchMark.dictBench HASH 4096 10000 avgt 5 19614.326 ± 1304.425 us/op**
ORCWriterBenchMark.dictBench HASH 4096 2500 avgt 5 11529.653 ± 1851.182 us/op
ORCWriterBenchMark.dictBench HASH 4096 500 avgt 5 9108.816 ± 837.123 us/op
**ORCWriterBenchMark.dictBench HASH 8192 10000 avgt 5 15611.967 ± 583.613 us/op**
ORCWriterBenchMark.dictBench HASH 8192 2500 avgt 5 13396.318 ± 3114.460 us/op
ORCWriterBenchMark.dictBench HASH 8192 500 avgt 5 10742.425 ± 1031.070 us/op
**ORCWriterBenchMark.dictBench HASH 10240 10000 avgt 5 17044.101 ± 1671.182 us/op**
ORCWriterBenchMark.dictBench HASH 10240 2500 avgt 5 13767.572 ± 196.728 us/op
ORCWriterBenchMark.dictBench HASH 10240 500 avgt 5 11120.604 ± 305.075 us/op
ORCWriterBenchMark.dictBench NONE 4096 10000 avgt 5 4327.766 ± 1747.765 us/op
ORCWriterBenchMark.dictBench NONE 4096 2500 avgt 5 4390.480 ± 2545.236 us/op
ORCWriterBenchMark.dictBench NONE 4096 500 avgt 5 12315.912 ± 8071.684 us/op
ORCWriterBenchMark.dictBench NONE 8192 10000 avgt 5 5529.683 ± 4187.802 us/op
ORCWriterBenchMark.dictBench NONE 8192 2500 avgt 5 5461.490 ± 914.698 us/op
ORCWriterBenchMark.dictBench NONE 8192 500 avgt 5 4745.401 ± 1097.454 us/op
ORCWriterBenchMark.dictBench NONE 10240 10000 avgt 5 4734.983 ± 257.299 us/op
ORCWriterBenchMark.dictBench NONE 10240 2500 avgt 5 4776.043 ± 690.286 us/op
ORCWriterBenchMark.dictBench NONE 10240 500 avgt 5 4750.625 ± 440.191 us/op```
Obviously it is not the case that the larger the init size the better the performance (considering the cost of resizing and the overhead brought from this in terms of metadata size). Again the good size will be relevant to the common record size seen by the writer (which together with the stripe size determines the number of entries within a stripe). From the benchmark above, enlarging the size could only be beneficial if the collision is common (e.g. in the `HASH` case when upper bound is larger (10k)) but the benefits isn't that consistent among different value of `upperBound`.
I am therefore putting `initSize` as a config in `OrcConf` and keep the default as the original.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613637335
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
Review comment:
yeah given Owen's recommendation on using linear-probing, I will see if I could abstract this class a bit so that adding another collision-resolution impl. will be simpler.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust edited a comment on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-832850321
> > > Updated the PR. @omalley @pgaref please take another look, thanks.
> >
> >
> > Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
> > Left some mostly minor comments, let me know what you think!
>
> PR seems in pretty good shape already, @autumnust anything to polish/benchmark before going in? Whats the plan?
@pgaref thanks for getting back. I was trying to finalize the bucketSize thing and tune things a bit more. Will publish a new benchmark later today. Beyond this PR, I am planning to try the linear probing as the collision resolution method and do a benchmark with current impl. to see if there's any space for improvement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620449979
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
Review comment:
Not sure I understand the size reduction comment
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613598496
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
Review comment:
Let me wrap up the benchmark for this version and we could then compare the linear-probing approach. But yeah I agree that linear probing seems more promising.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613636689
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
Review comment:
Shrink it to a smaller bucket in the current chaining implementation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref edited a comment on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-842585252
> Gentle ping ... @pgaref
Hey @autumnust thanks for pinging me, got distracted by various tasks last week!
Also thanks for updating the benchmark, this helps a lot. Some comments:
- I would expect NONE dictImpl bench results to be identical across runs, why is 4096 X 500 run 3x the others?
- Seems like an init size of 4k could be a good default conf in most of the cases -- would it make sense remove the extra conf completely if we can avoid it?
- Minor: I would rename upperBound to distinctCount
Let me know what you think
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r626174029
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ // containing all keys every seen in bytes.
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // containing starting offset of the key (in byte) in the byte array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashBuckets;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int BUCKET_SIZE = 20;
Review comment:
Can you elaborate a bit more on how 5/6 is deduced? the particular point I am confusing is its relation to the batch size, since dictionary size is more associated with stripe size right ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-824554472
Updated the PR. @omalley @pgaref please take another look, thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-820145038
Thanks @pgaref @omalley both for your comments! Please kindly take a second look for some pf the response. For those resolved conversation I've already addressed locally and will post together with all other comments once they are settled.
I am targeting to adding the benchmark after that as part of this PR, and add a version with linear-probing as comparison (probably another PR).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-848350035
> > Gentle ping ... @pgaref
>
> Hey @autumnust thanks for pinging me, got distracted by various tasks last week!
> Also thanks for updating the benchmark, this helps a lot. Some comments:
>
> * I would expect NONE dictImpl bench results to be identical across runs, why is 4096 X 500 run 3x the others?
> * Seems like an init size of 4k could be a good default conf in most of the cases -- would it make sense remove the extra conf completely if we can avoid it?
> * Minor: I would rename upperBound to distinctCount
>
> Let me know what you think
Hi @pgaref thanks for getting back to me. Yeah the 3x number in 4096 X 500 case is weird as I am no long able to reproduce it. Not sure if it is just some noise since I am running this benchmark locally.
I updated the PR for the rest of comments, they all make sense to me, especially the point to keep the configuration minimal ( one of the other projects that I am working on have this problem that results in a lot of confusion from beginner users)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] dongjoon-hyun commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-854805204
Hi, @pgaref . Is this merged? Could you check GitHub Action status?
- https://github.com/apache/orc/actions/workflows/build_and_test.yml?query=branch%3Amain
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r598714046
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
Review comment:
As a broader note, would it make sense to move the base functionality to an abstract HT class as we do for RedBlackTree?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] omalley commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r600844787
##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+ private DictionaryUtils() {
+ // Utility class does nothing in constructor
+ }
+
+ public static void getTextInternal(Text result, int position, DynamicIntArray keyOffsets, DynamicByteArray byteArray) {
+ int offset = keyOffsets.get(position);
+ int length;
+ if (position + 1 == keyOffsets.size()) {
+ length = byteArray.size() - offset;
+ } else {
+ length = keyOffsets.get(position + 1) - offset;
+ }
+ byteArray.setText(result, offset, length);
+ }
+
+ static class VisitorContextImpl implements Dictionary.VisitorContext {
Review comment:
Sorry, I left off the context of why. We could pull the common code into a single class between the red-black tree and hash table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] dongjoon-hyun closed pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #651:
URL: https://github.com/apache/orc/pull/651
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613780163
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
Review comment:
Just to ensure, are you asking the line `newKey.set(bytes, offset, length);` in `add(byte[] bytes, int offset, int length)` can be removed since `newKey` will anyway be set here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r621720755
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
Review comment:
Removed this comment as it was meant for something that no longer valid
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-832534534
> > Updated the PR. @omalley @pgaref please take another look, thanks.
>
> Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
> Left some mostly minor comments, let me know what you think!
PR seems in pretty good shape already, @autumnust anything to polish/benchmark before going in? Whats the plan?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] dongjoon-hyun commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-854906883
Thanks, @pgaref !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620453553
##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+ /**
+ * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+ *
+ */
+ private static class SimpleHashDictionary extends StringHashTableDictionary {
+ public SimpleHashDictionary(int initialCapacity) {
+ super(initialCapacity);
+ }
+
+ /**
+ * Obtain the prefix for each string as the hash value.
+ * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+ * this way we know the order of the traverse() method.
+ */
+ @Override
+ int getIndex(Text text) {
+ String s = text.toString();
+ int underscore = s.indexOf("_");
+ return Integer.parseInt(text.toString().substring(0, underscore));
+ }
+ }
+
+ @Test
+ public void test1()
+ throws Exception {
+ SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+ // Non-resize trivial cases
+ Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
+ Text text = new Text();
+ hashTableDictionary.getText(text, 0);
+ Assert.assertEquals("2_Alice", text.toString());
+ hashTableDictionary.getText(text, 1);
+ Assert.assertEquals("3_Bob", text.toString());
+ hashTableDictionary.getText(text, 2);
+ Assert.assertEquals("1_Cindy", text.toString());
+
+ // entering the fourth and fifth element which triggers rehash
+ Assert.assertEquals(3, hashTableDictionary.add(new Text("0_David")));
+ hashTableDictionary.getText(text, 3);
+ Assert.assertEquals("0_David", text.toString());
+ Assert.assertEquals(4, hashTableDictionary.add(new Text("4_Eason")));
+ hashTableDictionary.getText(text, 4);
+ Assert.assertEquals("4_Eason", text.toString());
+
+ // Re-ensure no all previously existed string still have correct encoded value
+ hashTableDictionary.getText(text, 0);
+ Assert.assertEquals("2_Alice", text.toString());
+ hashTableDictionary.getText(text, 1);
+ Assert.assertEquals("3_Bob", text.toString());
+ hashTableDictionary.getText(text, 2);
+ Assert.assertEquals("1_Cindy", text.toString());
+
+
+ // The order of words are based on each string's prefix given their index in the hashArray will be based on that.
+ TestStringRedBlackTree
+ .checkContents(hashTableDictionary, new int[]{3, 2, 0, 1, 4}, "0_David", "1_Cindy", "2_Alice", "3_Bob",
+ "4_Eason");
+
Review comment:
Just for correctness making sure that both dictionaries write the same strings -- you are doing the same thing above so not that important
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620443005
##########
File path: java/bench/core/src/java/org/apache/orc/bench/core/IOCounters.java
##########
@@ -48,23 +50,26 @@ public void print() {
if (recordCounters != null) {
recordCounters.print();
}
- System.out.println("Reads: " + reads);
- System.out.println("Bytes: " + bytesRead);
+ System.out.println("io: " + io);
+ System.out.println("Bytes: " + bytesIO);
}
public double bytesPerRecord() {
return recordCounters == null || recordCounters.records == 0 ?
- 0 : ((double) bytesRead) / recordCounters.records;
+ 0 : ((double) bytesIO) / recordCounters.records;
}
public long records() {
return recordCounters == null || recordCounters.invocations == 0 ?
0 : recordCounters.records / recordCounters.invocations;
}
- public long reads() {
+ /**
+ * Capture the number of I/O on average in each invocation.
+ */
+ public long iOs() {
Review comment:
IOps?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-842536951
Gentle ping ... @pgaref
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613780163
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
Review comment:
Just to ensure, are you asking the line `newKey.set(bytes, offset, length);` in `add(byte[] bytes, int offset, int length)` can be removed since `newKey` will anyway be set here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-837051398
@pgaref @omalley Gentle ping for the review. Please let me know if there's more comments, thanks !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] omalley commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
omalley commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-806242027
Of course, it would be really good to have a benchmark in the benchmark module that lets us test the two implementations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r622051809
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
Review comment:
lets add another comment here please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613611182
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
+
+ Text tmpText = new Text();
+ for (int i = 0; i < candidateArray.size(); i++) {
+ getText(tmpText, candidateArray.get(i));
+ if (tmpText.equals(newKey)) {
+ return candidateArray.get(i);
+ }
+ }
+
+ // if making it here, it means no match.
+ int len = newKey.getLength();
+ int currIdx = keyOffsets.size();
+ keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+ candidateArray.add(currIdx);
+ return currIdx;
+ }
+
+ private void resizeIfNeeded() {
+ if (keyOffsets.size() >= threshold) {
+ int oldCapacity = keyOffsets.size();
+ int newCapacity = (oldCapacity << 1) + 1;
+ doResize(newCapacity);
+ this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+ }
+
+ @Override
+ public int size() {
+ return keyOffsets.size();
+ }
+
+ /**
+ * Compute the hash value and find the corresponding index.
+ *
+ */
+ int getIndex(Text text) {
+ return (text.hashCode() & 0x7FFFFFFF) % capacity;
Review comment:
@pgaref it was meant to remove the highest sign bits to ensure the dividend is always a positive value. But yeah I will take Owen's suggestion to use the Math library instead for better readability.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r626906494
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
Review comment:
Will address.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust edited a comment on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-833094987
Also tried putting the init dict size as a parameter in the benchmark:
<pre>Benchmark (dictImpl) (initSize) (upperBound) Mode Cnt Score Error Units
ORCWriterBenchMark.dictBench RBTREE 4096 10000 avgt 5 40363.137 ± 28516.301 us/op
ORCWriterBenchMark.dictBench RBTREE 4096 2500 avgt 5 24392.270 ± 8096.477 us/op
ORCWriterBenchMark.dictBench RBTREE 4096 500 avgt 5 20316.419 ± 2917.176 us/op
ORCWriterBenchMark.dictBench RBTREE 8192 10000 avgt 5 33112.917 ± 4040.524 us/op
ORCWriterBenchMark.dictBench RBTREE 8192 2500 avgt 5 27269.018 ± 1593.083 us/op
ORCWriterBenchMark.dictBench RBTREE 8192 500 avgt 5 21326.469 ± 1373.140 us/op
ORCWriterBenchMark.dictBench RBTREE 10240 10000 avgt 5 33538.635 ± 5781.321 us/op
ORCWriterBenchMark.dictBench RBTREE 10240 2500 avgt 5 27548.238 ± 3307.047 us/op
ORCWriterBenchMark.dictBench RBTREE 10240 500 avgt 5 21053.961 ± 3118.891 us/op
<b>ORCWriterBenchMark.dictBench HASH 4096 10000 avgt 5 19614.326 ± 1304.425 us/op</b>
ORCWriterBenchMark.dictBench HASH 4096 2500 avgt 5 11529.653 ± 1851.182 us/op
ORCWriterBenchMark.dictBench HASH 4096 500 avgt 5 9108.816 ± 837.123 us/op
<b>ORCWriterBenchMark.dictBench HASH 8192 10000 avgt 5 15611.967 ± 583.613 us/op</b>
ORCWriterBenchMark.dictBench HASH 8192 2500 avgt 5 13396.318 ± 3114.460 us/op
ORCWriterBenchMark.dictBench HASH 8192 500 avgt 5 10742.425 ± 1031.070 us/op
<b>ORCWriterBenchMark.dictBench HASH 10240 10000 avgt 5 17044.101 ± 1671.182 us/op</b>
ORCWriterBenchMark.dictBench HASH 10240 2500 avgt 5 13767.572 ± 196.728 us/op
ORCWriterBenchMark.dictBench HASH 10240 500 avgt 5 11120.604 ± 305.075 us/op
ORCWriterBenchMark.dictBench NONE 4096 10000 avgt 5 4327.766 ± 1747.765 us/op
ORCWriterBenchMark.dictBench NONE 4096 2500 avgt 5 4390.480 ± 2545.236 us/op
ORCWriterBenchMark.dictBench NONE 4096 500 avgt 5 12315.912 ± 8071.684 us/op
ORCWriterBenchMark.dictBench NONE 8192 10000 avgt 5 5529.683 ± 4187.802 us/op
ORCWriterBenchMark.dictBench NONE 8192 2500 avgt 5 5461.490 ± 914.698 us/op
ORCWriterBenchMark.dictBench NONE 8192 500 avgt 5 4745.401 ± 1097.454 us/op
ORCWriterBenchMark.dictBench NONE 10240 10000 avgt 5 4734.983 ± 257.299 us/op
ORCWriterBenchMark.dictBench NONE 10240 2500 avgt 5 4776.043 ± 690.286 us/op
ORCWriterBenchMark.dictBench NONE 10240 500 avgt 5 4750.625 ± 440.191 us/op</pre>
Obviously it is not the case that the larger the init size the better the performance (considering the cost of resizing and the overhead brought from this in terms of metadata size). Again the good size will be relevant to the common record size seen by the writer (which together with the stripe size determines the number of entries within a stripe). From the benchmark above, enlarging the size could only be beneficial if the collision is common (e.g. in the `HASH` case when upper bound is larger (10k)) but the benefits isn't that consistent among different value of `upperBound`.
I am therefore putting `initSize` as a config in `OrcConf` and keep the default as the original.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] omalley commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r600838359
##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+ private DictionaryUtils() {
+ // Utility class does nothing in constructor
+ }
+
+ public static void getTextInternal(Text result, int position, DynamicIntArray keyOffsets, DynamicByteArray byteArray) {
+ int offset = keyOffsets.get(position);
+ int length;
+ if (position + 1 == keyOffsets.size()) {
+ length = byteArray.size() - offset;
+ } else {
+ length = keyOffsets.get(position + 1) - offset;
+ }
+ byteArray.setText(result, offset, length);
+ }
+
+ static class VisitorContextImpl implements Dictionary.VisitorContext {
Review comment:
I'd think that a common class (named DictionaryStorage?) that implemented VisitorContext and had the byteArray and keyOffsets would make sense. Each of the dictionaries could have a field that held the reference.
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
Review comment:
I suspect that we'd get better performance from using linear probing into a single DynamicIntArray, although we might want to drop the max load factor.
##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+ private DictionaryUtils() {
Review comment:
Actually, I don't mind this style for utility classes, since it prevents anyone from accidentally creating a useless instance. That said, I suspect that this should probably be refactored.
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
Review comment:
Yeah, that is too large. If we have more than a handful of collisions the table is too small or the function isn't good.
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
+
+ Text tmpText = new Text();
+ for (int i = 0; i < candidateArray.size(); i++) {
+ getText(tmpText, candidateArray.get(i));
+ if (tmpText.equals(newKey)) {
+ return candidateArray.get(i);
+ }
+ }
+
+ // if making it here, it means no match.
+ int len = newKey.getLength();
+ int currIdx = keyOffsets.size();
+ keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+ candidateArray.add(currIdx);
+ return currIdx;
+ }
+
+ private void resizeIfNeeded() {
+ if (keyOffsets.size() >= threshold) {
+ int oldCapacity = keyOffsets.size();
+ int newCapacity = (oldCapacity << 1) + 1;
+ doResize(newCapacity);
+ this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+ }
+
+ @Override
+ public int size() {
+ return keyOffsets.size();
+ }
+
+ /**
+ * Compute the hash value and find the corresponding index.
+ *
+ */
+ int getIndex(Text text) {
+ return (text.hashCode() & 0x7FFFFFFF) % capacity;
Review comment:
I'd be tempted to use Math.floorMod here, which would always be positive.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613763763
##########
File path: java/core/src/test/org/apache/orc/TestStringDictionary.java
##########
@@ -114,10 +117,8 @@ public void testTooManyDistinct() throws Exception {
}
}
- @Test
- public void testHalfDistinct() throws Exception {
+ public void testHalfDistinctHelper(Configuration conf) throws Exception {
Review comment:
Yes, that makes sense, will address.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r621720363
##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/ORCWriterBenchMark.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench.hive;
+
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.core.IOCounters;
+import org.apache.orc.bench.core.OrcBenchmark;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.runner.Runner;
+
+import com.google.auto.service.AutoService;
+
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@State(Scope.Thread)
+@AutoService(OrcBenchmark.class)
+public class ORCWriterBenchMark implements OrcBenchmark {
+ private static final Path root = Utilities.getBenchmarkRoot();
+ private Path dumpDir = new Path(root, "dumpDir");
+ private List<VectorizedRowBatch> batches = new ArrayList<>();
+
+ @Override
+ public String getName() {
+ return "write";
+ }
+
+ @Override
+ public String getDescription() {
+ return "Benchmark ORC Writer with different options";
+ }
+
+ @Setup(Level.Trial)
+ public void prepareData() {
+ TypeDescription schema = TypeDescription.fromString("struct<str:string>");
+ VectorizedRowBatch batch = schema.createRowBatch();
+ BytesColumnVector strVector = (BytesColumnVector) batch.cols[0];
+
+ Random rand = new Random();
+ for (int i = 0; i < 32 * 1024; i++) {
+ if (batch.size == batch.getMaxSize()) {
+ batches.add(batch);
+ batch = schema.createRowBatch();
+ strVector = (BytesColumnVector) batch.cols[0];
+ }
+ int randomNum = rand.nextInt(10000 - 10 + 1) + 10;
Review comment:
Addressed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613781297
##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+ /**
+ * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+ *
+ */
+ private static class SimpleHashDictionary extends StringHashTableDictionary {
+ public SimpleHashDictionary(int initialCapacity) {
+ super(initialCapacity);
+ }
+
+ /**
+ * Obtain the prefix for each string as the hash value.
+ * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+ * this way we know the order of the traverse() method.
+ */
+ @Override
+ int getIndex(Text text) {
+ String s = text.toString();
+ int underscore = s.indexOf("_");
+ return Integer.parseInt(text.toString().substring(0, underscore));
+ }
+ }
+
+ @Test
+ public void test1()
+ throws Exception {
+ SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+ // Non-resize trivial cases
+ Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
Review comment:
Hmm maybe I am missing something, but why byte check is needed since it is the content of `Text` we cared. Isn't add-then-read enough for the correctness purpose ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620451724
##########
File path: java/core/src/java/org/apache/orc/impl/VisitorContextImpl.java
##########
@@ -0,0 +1,74 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ *
Review comment:
Missing doc?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r622051672
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ // containing all keys every seen in bytes.
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // containing starting offset of the key (in byte) in the byte array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashBuckets;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int BUCKET_SIZE = 20;
Review comment:
Was just thinking that for the default batches of size 1024, we use Dict encoding if less than 20% of are distinct we would need around 0.6 - 0.7 of the batch size?
In that sense bucketSize of 5 or 6 could be enough right?
Might be missing something but lets add a comment here behind the logic
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r626177117
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ // containing all keys every seen in bytes.
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // containing starting offset of the key (in byte) in the byte array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashBuckets;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int BUCKET_SIZE = 20;
Review comment:
My rough estimate:
Given default stripe size 64MB, assuming each row were around 500B,
`4096 * 0.75 (capacity without resize) * avgBucketSize * 5 (20% distinct) = 64 * 1024 * 1024 / 500`
avgBucketSize ~8 in this case. Let me know if it makes sense/converge with your thoughts
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620448557
##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/ORCWriterBenchMark.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench.hive;
+
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.core.IOCounters;
+import org.apache.orc.bench.core.OrcBenchmark;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.runner.Runner;
+
+import com.google.auto.service.AutoService;
+
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@State(Scope.Thread)
+@AutoService(OrcBenchmark.class)
+public class ORCWriterBenchMark implements OrcBenchmark {
+ private static final Path root = Utilities.getBenchmarkRoot();
+ private Path dumpDir = new Path(root, "dumpDir");
+ private List<VectorizedRowBatch> batches = new ArrayList<>();
+
+ @Override
+ public String getName() {
+ return "write";
+ }
+
+ @Override
+ public String getDescription() {
+ return "Benchmark ORC Writer with different options";
+ }
+
+ @Setup(Level.Trial)
+ public void prepareData() {
+ TypeDescription schema = TypeDescription.fromString("struct<str:string>");
+ VectorizedRowBatch batch = schema.createRowBatch();
+ BytesColumnVector strVector = (BytesColumnVector) batch.cols[0];
+
+ Random rand = new Random();
+ for (int i = 0; i < 32 * 1024; i++) {
+ if (batch.size == batch.getMaxSize()) {
+ batches.add(batch);
+ batch = schema.createRowBatch();
+ strVector = (BytesColumnVector) batch.cols[0];
+ }
+ int randomNum = rand.nextInt(10000 - 10 + 1) + 10;
Review comment:
Ideally I would like to see how the number of distinct dict keys is affecting performance per data structure.
Can we make the above configurable during data generation?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-854855642
> Hi, @pgaref . Is this merged? Could you check GitHub Action status?
>
> * https://github.com/apache/orc/actions/workflows/build_and_test.yml?query=branch%3Amain
>
> <img alt="Screen Shot 2021-06-04 at 8 44 43 AM" width="490" src="https://user-images.githubusercontent.com/9700541/120828249-27eea800-c511-11eb-9bd8-eecde81a61a8.png">
Hey @dongjoon-hyun yes this is merged but broke some tests when I manually set HASH as the default dictionary.
This should be resolved now with ORC-810
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613779065
##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+ /**
+ * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+ *
+ */
+ private static class SimpleHashDictionary extends StringHashTableDictionary {
+ public SimpleHashDictionary(int initialCapacity) {
+ super(initialCapacity);
+ }
+
+ /**
+ * Obtain the prefix for each string as the hash value.
+ * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+ * this way we know the order of the traverse() method.
+ */
+ @Override
+ int getIndex(Text text) {
+ String s = text.toString();
+ int underscore = s.indexOf("_");
+ return Integer.parseInt(text.toString().substring(0, underscore));
+ }
+ }
+
+ @Test
+ public void test1()
+ throws Exception {
+ SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+ // Non-resize trivial cases
+ Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
+ Text text = new Text();
+ hashTableDictionary.getText(text, 0);
+ Assert.assertEquals("2_Alice", text.toString());
+ hashTableDictionary.getText(text, 1);
+ Assert.assertEquals("3_Bob", text.toString());
+ hashTableDictionary.getText(text, 2);
+ Assert.assertEquals("1_Cindy", text.toString());
+
+ // entering the fourth and fifth element which triggers rehash
+ Assert.assertEquals(3, hashTableDictionary.add(new Text("0_David")));
+ hashTableDictionary.getText(text, 3);
+ Assert.assertEquals("0_David", text.toString());
+ Assert.assertEquals(4, hashTableDictionary.add(new Text("4_Eason")));
+ hashTableDictionary.getText(text, 4);
+ Assert.assertEquals("4_Eason", text.toString());
+
+ // Re-ensure no all previously existed string still have correct encoded value
+ hashTableDictionary.getText(text, 0);
+ Assert.assertEquals("2_Alice", text.toString());
+ hashTableDictionary.getText(text, 1);
+ Assert.assertEquals("3_Bob", text.toString());
+ hashTableDictionary.getText(text, 2);
+ Assert.assertEquals("1_Cindy", text.toString());
+
+
+ // The order of words are based on each string's prefix given their index in the hashArray will be based on that.
+ TestStringRedBlackTree
+ .checkContents(hashTableDictionary, new int[]{3, 2, 0, 1, 4}, "0_David", "1_Cindy", "2_Alice", "3_Bob",
+ "4_Eason");
+
Review comment:
Not sure if I understand it. Why do we need to compare different implementations here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613780510
##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+ /**
+ * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+ *
+ */
+ private static class SimpleHashDictionary extends StringHashTableDictionary {
+ public SimpleHashDictionary(int initialCapacity) {
+ super(initialCapacity);
+ }
+
+ /**
+ * Obtain the prefix for each string as the hash value.
+ * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+ * this way we know the order of the traverse() method.
+ */
+ @Override
+ int getIndex(Text text) {
Review comment:
Will address.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613430970
##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+ private DictionaryUtils() {
Review comment:
+1. I was getting the feedback to add private constructor for utility class at some point and inherent this "convention" afterwards. Here is the SO page on this topic: https://stackoverflow.com/questions/14398747/hide-utility-class-constructor-utility-classes-should-not-have-a-public-or-def
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust edited a comment on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-832850321
> > > Updated the PR. @omalley @pgaref please take another look, thanks.
> >
> >
> > Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
> > Left some mostly minor comments, let me know what you think!
>
> PR seems in pretty good shape already, @autumnust anything to polish/benchmark before going in? Whats the plan?
@pgaref thanks for getting back. I was trying to finalize the bucketSize thing and tune things a bit more (for the dict size). Will publish a new benchmark later today. Beyond this PR, I am planning to try the linear probing as the collision resolution method and do a benchmark with current impl. to see if there's any space for improvement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613635755
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
+
+ Text tmpText = new Text();
+ for (int i = 0; i < candidateArray.size(); i++) {
+ getText(tmpText, candidateArray.get(i));
+ if (tmpText.equals(newKey)) {
+ return candidateArray.get(i);
+ }
+ }
+
+ // if making it here, it means no match.
+ int len = newKey.getLength();
+ int currIdx = keyOffsets.size();
+ keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+ candidateArray.add(currIdx);
+ return currIdx;
+ }
+
+ private void resizeIfNeeded() {
+ if (keyOffsets.size() >= threshold) {
+ int oldCapacity = keyOffsets.size();
+ int newCapacity = (oldCapacity << 1) + 1;
+ doResize(newCapacity);
+ this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+ }
+
+ @Override
+ public int size() {
+ return keyOffsets.size();
+ }
+
+ /**
+ * Compute the hash value and find the corresponding index.
+ *
+ */
+ int getIndex(Text text) {
+ return (text.hashCode() & 0x7FFFFFFF) % capacity;
+ }
+
+ // Resize the hash table, re-hash all the existing keys.
+ // byteArray and keyOffsetsArray don't have to be re-filled.
+ private void doResize(int newSize) {
+ DynamicIntArray[] resizedHashArray = new DynamicIntArray[newSize];
+ for (int i = 0; i < newSize; i++) {
+ resizedHashArray[i] = new DynamicIntArray();
+ }
+
+ Text tmpText = new Text();
+ for (int i = 0; i < capacity; i++) {
+ DynamicIntArray intArray = hashArray[i];
+ int bucketSize = intArray.size();
+ if (bucketSize > 0) {
+ for (int j = 0; j < bucketSize; j++) {
+ getText(tmpText, intArray.get(j));
+ int newIndex = getIndex(tmpText);
+ resizedHashArray[newIndex].add(intArray.get(j));
+ }
+ }
+ }
+
+ Arrays.fill(hashArray, null);
Review comment:
I was thinking to ensure everything to be null just in case that some elements in the array is still be referenced so that the space doesn't get garbage-collected. Do you think that's a fair ? I don't have strong opinion about it though.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r626395478
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ // containing all keys every seen in bytes.
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // containing starting offset of the key (in byte) in the byte array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashBuckets;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int BUCKET_SIZE = 20;
Review comment:
Hey @autumnust you are right, it is more associated with the stripe size so your estimation makes sense. Thanks for clarifying!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r620439426
##########
File path: java/bench/hive/src/java/org/apache/orc/bench/hive/ORCWriterBenchMark.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.bench.hive;
+
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.TrackingLocalFileSystem;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.apache.orc.bench.core.IOCounters;
+import org.apache.orc.bench.core.OrcBenchmark;
+import org.apache.orc.bench.core.Utilities;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.runner.Runner;
+
+import com.google.auto.service.AutoService;
+
+
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.MICROSECONDS)
+@State(Scope.Thread)
+@AutoService(OrcBenchmark.class)
+public class ORCWriterBenchMark implements OrcBenchmark {
+ private static final Path root = Utilities.getBenchmarkRoot();
+ private Path dumpDir = new Path(root, "dumpDir");
+ private List<VectorizedRowBatch> batches = new ArrayList<>();
+
+ @Override
+ public String getName() {
+ return "write";
+ }
+
+ @Override
+ public String getDescription() {
+ return "Benchmark ORC Writer with different options";
Review comment:
DICT options for now
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-828170668
Here is the new jmh result:
```# Run complete. Total time: 00:10:35
Benchmark (dictImpl) (upperBound) Mode Cnt Score Error Units
ORCWriterBenchMark.dictBench RBTREE 10000 avgt 5 28939.068 ± 3080.947 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 10000 avgt 5 49.963 #
ORCWriterBenchMark.dictBench:ops RBTREE 10000 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord RBTREE 10000 avgt 5 0.883 ± 0.094 us/op
ORCWriterBenchMark.dictBench:records RBTREE 10000 avgt 5 163840.000 #
ORCWriterBenchMark.dictBench RBTREE 2500 avgt 5 21998.781 ± 1448.300 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 2500 avgt 5 23.532 #
ORCWriterBenchMark.dictBench:ops RBTREE 2500 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord RBTREE 2500 avgt 5 0.671 ± 0.044 us/op
ORCWriterBenchMark.dictBench:records RBTREE 2500 avgt 5 163840.000 #
ORCWriterBenchMark.dictBench RBTREE 500 avgt 5 17730.281 ± 4574.132 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 500 avgt 5 13.156 #
ORCWriterBenchMark.dictBench:ops RBTREE 500 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord RBTREE 500 avgt 5 0.541 ± 0.140 us/op
ORCWriterBenchMark.dictBench:records RBTREE 500 avgt 5 163840.000 #
ORCWriterBenchMark.dictBench HASH 10000 avgt 5 21269.613 ± 4137.763 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord HASH 10000 avgt 5 42.268 #
ORCWriterBenchMark.dictBench:ops HASH 10000 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord HASH 10000 avgt 5 0.649 ± 0.126 us/op
ORCWriterBenchMark.dictBench:records HASH 10000 avgt 5 163840.000 #
ORCWriterBenchMark.dictBench HASH 2500 avgt 5 11586.898 ± 4075.783 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord HASH 2500 avgt 5 17.692 #
ORCWriterBenchMark.dictBench:ops HASH 2500 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord HASH 2500 avgt 5 0.354 ± 0.124 us/op
ORCWriterBenchMark.dictBench:records HASH 2500 avgt 5 163840.000 #
ORCWriterBenchMark.dictBench HASH 500 avgt 5 9646.080 ± 2279.530 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord HASH 500 avgt 5 11.613 #
ORCWriterBenchMark.dictBench:ops HASH 500 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord HASH 500 avgt 5 0.294 ± 0.070 us/op
ORCWriterBenchMark.dictBench:records HASH 500 avgt 5 163840.000 #
ORCWriterBenchMark.dictBench NONE 10000 avgt 5 4077.675 ± 117.606 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord NONE 10000 avgt 5 50.146 #
ORCWriterBenchMark.dictBench:ops NONE 10000 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord NONE 10000 avgt 5 0.124 ± 0.004 us/op
ORCWriterBenchMark.dictBench:records NONE 10000 avgt 5 163840.000 #
ORCWriterBenchMark.dictBench NONE 2500 avgt 5 4607.634 ± 1163.084 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord NONE 2500 avgt 5 50.146 #
ORCWriterBenchMark.dictBench:ops NONE 2500 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord NONE 2500 avgt 5 0.141 ± 0.035 us/op
ORCWriterBenchMark.dictBench:records NONE 2500 avgt 5 163840.000 #
ORCWriterBenchMark.dictBench NONE 500 avgt 5 3783.059 ± 367.511 us/op
ORCWriterBenchMark.dictBench:bytesPerRecord NONE 500 avgt 5 50.146 #
ORCWriterBenchMark.dictBench:ops NONE 500 avgt 5 ≈ 0 #
ORCWriterBenchMark.dictBench:perRecord NONE 500 avgt 5 0.115 ± 0.011 us/op
ORCWriterBenchMark.dictBench:records NONE 500 avgt 5 163840.000 #
```
Unfortunately the previous implementation had a bug which end up with great locality (but incorrect). HASH is still much better than RB-Tree but obviously we needs to iterate further to improve it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-842585252
> 10000
Hey @autumnust thanks for pinging me, got distracted by various tasks last week!
Also thanks for updating the benchmark, this helps a lot. Some comments:
- I would expect NONE dictImpl bench results to be identical across runs, why is 4096 X 500 run 3x the others?
- Seems like an init size of 4k could be a good default conf in most of the cases -- would it make sense remove the extra conf completely if we can avoid it?
- Minor: I would rename upperBound to distinctCount
Let me know what you think
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r621639554
##########
File path: java/bench/core/src/java/org/apache/orc/bench/core/IOCounters.java
##########
@@ -48,23 +50,26 @@ public void print() {
if (recordCounters != null) {
recordCounters.print();
}
- System.out.println("Reads: " + reads);
- System.out.println("Bytes: " + bytesRead);
+ System.out.println("io: " + io);
+ System.out.println("Bytes: " + bytesIO);
}
public double bytesPerRecord() {
return recordCounters == null || recordCounters.records == 0 ?
- 0 : ((double) bytesRead) / recordCounters.records;
+ 0 : ((double) bytesIO) / recordCounters.records;
}
public long records() {
return recordCounters == null || recordCounters.invocations == 0 ?
0 : recordCounters.records / recordCounters.invocations;
}
- public long reads() {
+ /**
+ * Capture the number of I/O on average in each invocation.
+ */
+ public long iOs() {
Review comment:
thanks that's a much better name !
##########
File path: java/bench/core/src/java/org/apache/orc/bench/core/IOCounters.java
##########
@@ -48,23 +50,26 @@ public void print() {
if (recordCounters != null) {
recordCounters.print();
}
- System.out.println("Reads: " + reads);
- System.out.println("Bytes: " + bytesRead);
+ System.out.println("io: " + io);
+ System.out.println("Bytes: " + bytesIO);
}
public double bytesPerRecord() {
return recordCounters == null || recordCounters.records == 0 ?
- 0 : ((double) bytesRead) / recordCounters.records;
+ 0 : ((double) bytesIO) / recordCounters.records;
}
public long records() {
return recordCounters == null || recordCounters.invocations == 0 ?
0 : recordCounters.records / recordCounters.invocations;
}
- public long reads() {
+ /**
+ * Capture the number of I/O on average in each invocation.
+ */
+ public long iOs() {
Review comment:
Second thoughts: there's no "per second" so I may just called it `ops`, wdyt ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r598636479
##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+ private DictionaryUtils() {
Review comment:
Remove empty constructor and add comment as class doc?
##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+ private DictionaryUtils() {
+ // Utility class does nothing in constructor
+ }
+
+ public static void getTextInternal(Text result, int position, DynamicIntArray keyOffsets, DynamicByteArray byteArray) {
Review comment:
doc method please?
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
Review comment:
resizeIfNeeded called twice here -- believe its safe to remove this call
##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+ /**
+ * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+ *
+ */
+ private static class SimpleHashDictionary extends StringHashTableDictionary {
+ public SimpleHashDictionary(int initialCapacity) {
+ super(initialCapacity);
+ }
+
+ /**
+ * Obtain the prefix for each string as the hash value.
+ * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+ * this way we know the order of the traverse() method.
+ */
+ @Override
+ int getIndex(Text text) {
Review comment:
Shall we have a test where the actual HashFunction is used as well?
##########
File path: java/core/src/test/org/apache/orc/TestStringDictionary.java
##########
@@ -114,10 +117,8 @@ public void testTooManyDistinct() throws Exception {
}
}
- @Test
- public void testHalfDistinct() throws Exception {
+ public void testHalfDistinctHelper(Configuration conf) throws Exception {
Review comment:
Shall we make this a Parametrized test where @Parameter is going to be the DICT implementations -- other tests with dict encoding could also benefit from this
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
Review comment:
explain MAX_ARRAY_SIZE limit and how this relates to Hash MASK below ignoring the first bits
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
Review comment:
No sure each bucket should be initialized to default 8K ints -- are we expecting that many collisions?
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
+
+ Text tmpText = new Text();
+ for (int i = 0; i < candidateArray.size(); i++) {
+ getText(tmpText, candidateArray.get(i));
+ if (tmpText.equals(newKey)) {
+ return candidateArray.get(i);
+ }
+ }
+
+ // if making it here, it means no match.
+ int len = newKey.getLength();
+ int currIdx = keyOffsets.size();
+ keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+ candidateArray.add(currIdx);
+ return currIdx;
+ }
+
+ private void resizeIfNeeded() {
+ if (keyOffsets.size() >= threshold) {
+ int oldCapacity = keyOffsets.size();
+ int newCapacity = (oldCapacity << 1) + 1;
+ doResize(newCapacity);
+ this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+ }
+
+ @Override
+ public int size() {
+ return keyOffsets.size();
+ }
+
+ /**
+ * Compute the hash value and find the corresponding index.
+ *
+ */
+ int getIndex(Text text) {
+ return (text.hashCode() & 0x7FFFFFFF) % capacity;
+ }
+
+ // Resize the hash table, re-hash all the existing keys.
+ // byteArray and keyOffsetsArray don't have to be re-filled.
+ private void doResize(int newSize) {
+ DynamicIntArray[] resizedHashArray = new DynamicIntArray[newSize];
+ for (int i = 0; i < newSize; i++) {
+ resizedHashArray[i] = new DynamicIntArray();
+ }
+
+ Text tmpText = new Text();
+ for (int i = 0; i < capacity; i++) {
+ DynamicIntArray intArray = hashArray[i];
+ int bucketSize = intArray.size();
+ if (bucketSize > 0) {
+ for (int j = 0; j < bucketSize; j++) {
+ getText(tmpText, intArray.get(j));
+ int newIndex = getIndex(tmpText);
+ resizedHashArray[newIndex].add(intArray.get(j));
+ }
+ }
+ }
+
+ Arrays.fill(hashArray, null);
Review comment:
we reassign hashArray below, is this really needed?
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
+
+ Text tmpText = new Text();
+ for (int i = 0; i < candidateArray.size(); i++) {
+ getText(tmpText, candidateArray.get(i));
+ if (tmpText.equals(newKey)) {
+ return candidateArray.get(i);
+ }
+ }
+
+ // if making it here, it means no match.
+ int len = newKey.getLength();
+ int currIdx = keyOffsets.size();
+ keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+ candidateArray.add(currIdx);
+ return currIdx;
+ }
+
+ private void resizeIfNeeded() {
+ if (keyOffsets.size() >= threshold) {
+ int oldCapacity = keyOffsets.size();
+ int newCapacity = (oldCapacity << 1) + 1;
+ doResize(newCapacity);
+ this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+ }
+
+ @Override
+ public int size() {
+ return keyOffsets.size();
+ }
+
+ /**
+ * Compute the hash value and find the corresponding index.
+ *
+ */
+ int getIndex(Text text) {
+ return (text.hashCode() & 0x7FFFFFFF) % capacity;
+ }
+
+ // Resize the hash table, re-hash all the existing keys.
+ // byteArray and keyOffsetsArray don't have to be re-filled.
+ private void doResize(int newSize) {
+ DynamicIntArray[] resizedHashArray = new DynamicIntArray[newSize];
+ for (int i = 0; i < newSize; i++) {
+ resizedHashArray[i] = new DynamicIntArray();
+ }
+
+ Text tmpText = new Text();
+ for (int i = 0; i < capacity; i++) {
+ DynamicIntArray intArray = hashArray[i];
+ int bucketSize = intArray.size();
+ if (bucketSize > 0) {
Review comment:
unnecessary condition -- already handled by the loop
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
Review comment:
Lets clarify what each of hashArray, keyOffsets, byteArray represent.
Seems like hashArray is storing String hash (int) to KeyOffset List -- lets clarify and make variables more representative.
##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+ /**
+ * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+ *
+ */
+ private static class SimpleHashDictionary extends StringHashTableDictionary {
+ public SimpleHashDictionary(int initialCapacity) {
+ super(initialCapacity);
+ }
+
+ /**
+ * Obtain the prefix for each string as the hash value.
+ * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+ * this way we know the order of the traverse() method.
+ */
+ @Override
+ int getIndex(Text text) {
+ String s = text.toString();
+ int underscore = s.indexOf("_");
+ return Integer.parseInt(text.toString().substring(0, underscore));
+ }
+ }
+
+ @Test
+ public void test1()
+ throws Exception {
+ SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+ // Non-resize trivial cases
+ Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
+ Text text = new Text();
+ hashTableDictionary.getText(text, 0);
+ Assert.assertEquals("2_Alice", text.toString());
+ hashTableDictionary.getText(text, 1);
+ Assert.assertEquals("3_Bob", text.toString());
+ hashTableDictionary.getText(text, 2);
+ Assert.assertEquals("1_Cindy", text.toString());
+
+ // entering the fourth and fifth element which triggers rehash
+ Assert.assertEquals(3, hashTableDictionary.add(new Text("0_David")));
+ hashTableDictionary.getText(text, 3);
+ Assert.assertEquals("0_David", text.toString());
+ Assert.assertEquals(4, hashTableDictionary.add(new Text("4_Eason")));
+ hashTableDictionary.getText(text, 4);
+ Assert.assertEquals("4_Eason", text.toString());
+
+ // Re-ensure no all previously existed string still have correct encoded value
+ hashTableDictionary.getText(text, 0);
+ Assert.assertEquals("2_Alice", text.toString());
+ hashTableDictionary.getText(text, 1);
+ Assert.assertEquals("3_Bob", text.toString());
+ hashTableDictionary.getText(text, 2);
+ Assert.assertEquals("1_Cindy", text.toString());
+
+
+ // The order of words are based on each string's prefix given their index in the hashArray will be based on that.
+ TestStringRedBlackTree
+ .checkContents(hashTableDictionary, new int[]{3, 2, 0, 1, 4}, "0_David", "1_Cindy", "2_Alice", "3_Bob",
+ "4_Eason");
+
Review comment:
ensure that size of both HT and RB impl is the same?
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
+
+ Text tmpText = new Text();
+ for (int i = 0; i < candidateArray.size(); i++) {
+ getText(tmpText, candidateArray.get(i));
+ if (tmpText.equals(newKey)) {
+ return candidateArray.get(i);
+ }
+ }
+
+ // if making it here, it means no match.
+ int len = newKey.getLength();
+ int currIdx = keyOffsets.size();
+ keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+ candidateArray.add(currIdx);
+ return currIdx;
+ }
+
+ private void resizeIfNeeded() {
+ if (keyOffsets.size() >= threshold) {
+ int oldCapacity = keyOffsets.size();
+ int newCapacity = (oldCapacity << 1) + 1;
+ doResize(newCapacity);
+ this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+ }
+
+ @Override
+ public int size() {
+ return keyOffsets.size();
+ }
+
+ /**
+ * Compute the hash value and find the corresponding index.
+ *
+ */
+ int getIndex(Text text) {
+ return (text.hashCode() & 0x7FFFFFFF) % capacity;
Review comment:
lets explain masking
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // starting offset of key-in-byte in the byte array for the i-th key.
+ // Two things combined stores the key array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashArray;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ public StringHashTableDictionary(int initialCapacity) {
+ this(initialCapacity, DEFAULT_LOAD_FACTOR);
+ }
+
+ public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+ this.capacity = initialCapacity;
+ this.loadFactor = loadFactor;
+ this.keyOffsets = new DynamicIntArray(initialCapacity);
+ this.hashArray = initHashArray(initialCapacity);
+ this.threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
+ }
+
+ private DynamicIntArray[] initHashArray(int capacity) {
+ DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+ for (int i = 0; i < capacity; i++) {
+ bucket[i] = new DynamicIntArray();
+ }
+ return bucket;
+ }
+
+ @Override
+ public void visit(Visitor visitor)
+ throws IOException {
+ traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, this.keyOffsets));
+ }
+
+ private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl context) throws IOException {
+ for (DynamicIntArray intArray : hashArray) {
+ for (int i = 0; i < intArray.size() ; i ++) {
+ context.setPosition(intArray.get(i));
+ visitor.visit(context);
+ }
+ }
+ }
+
+ @Override
+ public void clear() {
+ byteArray.clear();
+ keyOffsets.clear();
+ Arrays.fill(hashArray, null);
+ }
+
+ @Override
+ public void getText(Text result, int position) {
+ DictionaryUtils.getTextInternal(result, position, this.keyOffsets, this.byteArray);
+ }
+
+ @Override
+ public int add(byte[] bytes, int offset, int length) {
+ resizeIfNeeded();
+ newKey.set(bytes, offset, length);
+ return add(newKey);
+ }
+
+ public int add(Text text) {
+ resizeIfNeeded();
+
+ int index = getIndex(text);
+ DynamicIntArray candidateArray = hashArray[index];
+
+ newKey.set(text);
Review comment:
why are we moving text to front here?
##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+ /**
+ * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+ *
+ */
+ private static class SimpleHashDictionary extends StringHashTableDictionary {
+ public SimpleHashDictionary(int initialCapacity) {
+ super(initialCapacity);
+ }
+
+ /**
+ * Obtain the prefix for each string as the hash value.
+ * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+ * this way we know the order of the traverse() method.
+ */
+ @Override
+ int getIndex(Text text) {
+ String s = text.toString();
+ int underscore = s.indexOf("_");
+ return Integer.parseInt(text.toString().substring(0, underscore));
+ }
+ }
+
+ @Test
+ public void test1()
+ throws Exception {
+ SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+ // Non-resize trivial cases
+ Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
Review comment:
check bytes after additions?
##########
File path: java/core/src/test/org/apache/orc/impl/TestStringHashTableDictionary.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import org.apache.hadoop.io.Text;
+import org.junit.Assert;
+import org.junit.Test;
+
+
+public class TestStringHashTableDictionary {
+
+ /**
+ * A extension for {@link StringHashTableDictionary} for testing purpose by overwriting the hash function.
+ *
+ */
+ private static class SimpleHashDictionary extends StringHashTableDictionary {
+ public SimpleHashDictionary(int initialCapacity) {
+ super(initialCapacity);
+ }
+
+ /**
+ * Obtain the prefix for each string as the hash value.
+ * All the string being used in this test suite will contains its hash value as the prefix for the string content.
+ * this way we know the order of the traverse() method.
+ */
+ @Override
+ int getIndex(Text text) {
+ String s = text.toString();
+ int underscore = s.indexOf("_");
+ return Integer.parseInt(text.toString().substring(0, underscore));
+ }
+ }
+
+ @Test
+ public void test1()
+ throws Exception {
+ SimpleHashDictionary hashTableDictionary = new SimpleHashDictionary(5);
+ // Non-resize trivial cases
+ Assert.assertEquals(0, hashTableDictionary.getSizeInBytes());
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(0, hashTableDictionary.add(new Text("2_Alice")));
+ Assert.assertEquals(1, hashTableDictionary.add(new Text("3_Bob")));
+ Assert.assertEquals(2, hashTableDictionary.add(new Text("1_Cindy")));
+
+ Text text = new Text();
+ hashTableDictionary.getText(text, 0);
+ Assert.assertEquals("2_Alice", text.toString());
+ hashTableDictionary.getText(text, 1);
+ Assert.assertEquals("3_Bob", text.toString());
+ hashTableDictionary.getText(text, 2);
+ Assert.assertEquals("1_Cindy", text.toString());
+
+ // entering the fourth and fifth element which triggers rehash
+ Assert.assertEquals(3, hashTableDictionary.add(new Text("0_David")));
+ hashTableDictionary.getText(text, 3);
+ Assert.assertEquals("0_David", text.toString());
+ Assert.assertEquals(4, hashTableDictionary.add(new Text("4_Eason")));
+ hashTableDictionary.getText(text, 4);
+ Assert.assertEquals("4_Eason", text.toString());
+
+ // Re-ensure no all previously existed string still have correct encoded value
+ hashTableDictionary.getText(text, 0);
+ Assert.assertEquals("2_Alice", text.toString());
+ hashTableDictionary.getText(text, 1);
+ Assert.assertEquals("3_Bob", text.toString());
+ hashTableDictionary.getText(text, 2);
+ Assert.assertEquals("1_Cindy", text.toString());
+
+
+ // The order of words are based on each string's prefix given their index in the hashArray will be based on that.
+ TestStringRedBlackTree
Review comment:
This is a bit confusing -- I would just copy the checkContents method here, or create a Utility class if want to reuse
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-826976373
> Updated the PR. @omalley @pgaref please take another look, thanks.
Hey @autumnust thanks for the changes, latest PR looks pretty good -- JMH extension also helps a lot!
Left some mostly minor comments, let me know what you think!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] dongjoon-hyun edited a comment on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-854805204
Hi, @pgaref . Is this merged? Could you check GitHub Action status?
- https://github.com/apache/orc/actions/workflows/build_and_test.yml?query=branch%3Amain
<img width="490" alt="Screen Shot 2021-06-04 at 8 44 43 AM" src="https://user-images.githubusercontent.com/9700541/120828249-27eea800-c511-11eb-9bd8-eecde81a61a8.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] pgaref commented on pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-828353933
> Here is the new jmh result:
>
> ```
> Benchmark (dictImpl) (upperBound) Mode Cnt Score Error Units
> ORCWriterBenchMark.dictBench RBTREE 10000 avgt 5 28939.068 ± 3080.947 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 10000 avgt 5 49.963 #
> ORCWriterBenchMark.dictBench:ops RBTREE 10000 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord RBTREE 10000 avgt 5 0.883 ± 0.094 us/op
> ORCWriterBenchMark.dictBench:records RBTREE 10000 avgt 5 163840.000 #
> ORCWriterBenchMark.dictBench RBTREE 2500 avgt 5 21998.781 ± 1448.300 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 2500 avgt 5 23.532 #
> ORCWriterBenchMark.dictBench:ops RBTREE 2500 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord RBTREE 2500 avgt 5 0.671 ± 0.044 us/op
> ORCWriterBenchMark.dictBench:records RBTREE 2500 avgt 5 163840.000 #
> ORCWriterBenchMark.dictBench RBTREE 500 avgt 5 17730.281 ± 4574.132 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 500 avgt 5 13.156 #
> ORCWriterBenchMark.dictBench:ops RBTREE 500 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord RBTREE 500 avgt 5 0.541 ± 0.140 us/op
> ORCWriterBenchMark.dictBench:records RBTREE 500 avgt 5 163840.000 #
> ORCWriterBenchMark.dictBench HASH 10000 avgt 5 21269.613 ± 4137.763 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord HASH 10000 avgt 5 42.268 #
> ORCWriterBenchMark.dictBench:ops HASH 10000 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord HASH 10000 avgt 5 0.649 ± 0.126 us/op
> ORCWriterBenchMark.dictBench:records HASH 10000 avgt 5 163840.000 #
> ORCWriterBenchMark.dictBench HASH 2500 avgt 5 11586.898 ± 4075.783 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord HASH 2500 avgt 5 17.692 #
> ORCWriterBenchMark.dictBench:ops HASH 2500 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord HASH 2500 avgt 5 0.354 ± 0.124 us/op
> ORCWriterBenchMark.dictBench:records HASH 2500 avgt 5 163840.000 #
> ORCWriterBenchMark.dictBench HASH 500 avgt 5 9646.080 ± 2279.530 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord HASH 500 avgt 5 11.613 #
> ORCWriterBenchMark.dictBench:ops HASH 500 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord HASH 500 avgt 5 0.294 ± 0.070 us/op
> ORCWriterBenchMark.dictBench:records HASH 500 avgt 5 163840.000 #
> ORCWriterBenchMark.dictBench NONE 10000 avgt 5 4077.675 ± 117.606 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord NONE 10000 avgt 5 50.146 #
> ORCWriterBenchMark.dictBench:ops NONE 10000 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord NONE 10000 avgt 5 0.124 ± 0.004 us/op
> ORCWriterBenchMark.dictBench:records NONE 10000 avgt 5 163840.000 #
> ORCWriterBenchMark.dictBench NONE 2500 avgt 5 4607.634 ± 1163.084 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord NONE 2500 avgt 5 50.146 #
> ORCWriterBenchMark.dictBench:ops NONE 2500 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord NONE 2500 avgt 5 0.141 ± 0.035 us/op
> ORCWriterBenchMark.dictBench:records NONE 2500 avgt 5 163840.000 #
> ORCWriterBenchMark.dictBench NONE 500 avgt 5 3783.059 ± 367.511 us/op
> ORCWriterBenchMark.dictBench:bytesPerRecord NONE 500 avgt 5 50.146 #
> ORCWriterBenchMark.dictBench:ops NONE 500 avgt 5 ≈ 0 #
> ORCWriterBenchMark.dictBench:perRecord NONE 500 avgt 5 0.115 ± 0.011 us/op
> ORCWriterBenchMark.dictBench:records NONE 500 avgt 5 163840.000 #
> ```
>
> Unfortunately the previous implementation had a bug which end up with great locality (but incorrect). HASH is still much better than RB-Tree but obviously we needs to iterate further to improve it.
Sounds like this can be improved by reducing collisions right? Since we have an upper bound on the number of entries per batch would it make sense to experiment a but more with HT sizes to reduce collisions as much as possible?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary
Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r621799552
##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+ // containing all keys every seen in bytes.
+ private final DynamicByteArray byteArray = new DynamicByteArray();
+ // containing starting offset of the key (in byte) in the byte array.
+ private final DynamicIntArray keyOffsets;
+
+ private final Text newKey = new Text();
+
+ private DynamicIntArray[] hashBuckets;
+
+ private int capacity;
+
+ private int threshold;
+
+ private float loadFactor;
+
+ private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+ private static final int BUCKET_SIZE = 20;
Review comment:
It is just randomly picked ...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org