You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/01/27 18:49:30 UTC

[GitHub] [orc] autumnust opened a new pull request #637: [ORC-50] Abstract Dictionary interface and refactoring around that

autumnust opened a new pull request #637:
URL: https://github.com/apache/orc/pull/637


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   This PR contains the very first part to make the implementation of dictionary used for encoding pluggable, by introducing an interface `Dictionary`. The subsequent PR will add HashTable-based Dictionary implementation and compare with the existing RB-tree based dictionary. 
   
   For reviewers: 
   - Please comment if reflection-based initialization will be preferred or not since I didn't found much of that within the code base. 
   - The `Dictionary` class isn't added with generic type declaration (e.g. we could use a generic type `T` to replace `Text`). Please let me know if that's necessary to add. 
   
   ### Why are the changes needed?
   
   ### How was this patch tested?
   Since this is mostly a refactoring, it is just passing all existing unit tests. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r578864919



##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -105,6 +105,11 @@
           "(default 10000 rows) else dictionary check will happen before\n" +
           "writing first stripe. In both cases, the decision to use\n" +
           "dictionary or not will be retained thereafter."),
+  DICTIONARY_IMPL("orc.dictionary.implementation", null,

Review comment:
       the value for this config is being processed `createDict`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #637:
URL: https://github.com/apache/orc/pull/637#issuecomment-769960713


   Gentle ping, @autumnust .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #637: ORC-50: Abstract Dictionary interface and refactoring around that

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #637:
URL: https://github.com/apache/orc/pull/637#issuecomment-768637326


   Thank you for making a PR, @autumnust . It seems that many failure. Could you test it locally first?
   ```
   Error:  Tests run: 1146, Failures: 0, Errors: 48, Skipped: 1
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r578865688



##########
File path: java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java
##########
@@ -55,15 +59,32 @@
   // If the number of keys in a dictionary is greater than this fraction of
   //the total number of non-null rows, turn off dictionary encoding
   private final double dictionaryKeySizeThreshold;
+  protected Dictionary dictionary;
   protected boolean useDictionaryEncoding = true;
   private boolean isDirectV2 = true;
   private boolean doneDictionaryCheck;
   private final boolean strideDictionaryCheck;
 
+  static Dictionary createDict(Configuration conf) {

Review comment:
       addressed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r572309978



##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -37,7 +37,7 @@
   ROW_INDEX_STRIDE("orc.row.index.stride",
       "hive.exec.orc.default.row.index.stride", 10000,
       "Define the default ORC index stride in number of rows. (Stride is the\n"+
-          " number of rows n index entry represents.)"),
+          " number of rows an index entry represents.)"),

Review comment:
       I don't think we need a separate jira/pr for this. I'll commit it separately.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley closed pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
omalley closed pull request #637:
URL: https://github.com/apache/orc/pull/637


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r578607043



##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -105,6 +105,11 @@
           "(default 10000 rows) else dictionary check will happen before\n" +
           "writing first stripe. In both cases, the decision to use\n" +
           "dictionary or not will be retained thereafter."),
+  DICTIONARY_IMPL("orc.dictionary.implementation", null,

Review comment:
       hash option missing?

##########
File path: java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java
##########
@@ -55,15 +59,32 @@
   // If the number of keys in a dictionary is greater than this fraction of
   //the total number of non-null rows, turn off dictionary encoding
   private final double dictionaryKeySizeThreshold;
+  protected Dictionary dictionary;
   protected boolean useDictionaryEncoding = true;
   private boolean isDirectV2 = true;
   private boolean doneDictionaryCheck;
   private final boolean strideDictionaryCheck;
 
+  static Dictionary createDict(Configuration conf) {

Review comment:
       make private?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #637: ORC-50: Abstract Dictionary interface and refactoring around that

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r565700331



##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -37,7 +37,7 @@
   ROW_INDEX_STRIDE("orc.row.index.stride",
       "hive.exec.orc.default.row.index.stride", 10000,
       "Define the default ORC index stride in number of rows. (Stride is the\n"+
-          " number of rows n index entry represents.)"),
+          " number of rows an index entry represents.)"),

Review comment:
       Let's avoid touching irrelevant part. You can fix this as a separate PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on a change in pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
autumnust commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r579729387



##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -105,6 +105,11 @@
           "(default 10000 rows) else dictionary check will happen before\n" +
           "writing first stripe. In both cases, the decision to use\n" +
           "dictionary or not will be retained thereafter."),
+  DICTIONARY_IMPL("orc.dictionary.implementation", null,

Review comment:
       Thanks, updated the description in a way similar to `BLOOM_FILTER_WRITE_VERSION`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r579090843



##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -105,6 +105,11 @@
           "(default 10000 rows) else dictionary check will happen before\n" +
           "writing first stripe. In both cases, the decision to use\n" +
           "dictionary or not will be retained thereafter."),
+  DICTIONARY_IMPL("orc.dictionary.implementation", null,

Review comment:
       I guess what I meant was that we should explicitly list the choices (similar to BLOOM_FILTER_WRITE_VERSION):
   * rbTree
   * hash (not yet implemented
   
   Moreover, I would set the hiveConfName same as the attribute -- I believe null means that it does not apply to Hive




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] autumnust commented on pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
autumnust commented on pull request #637:
URL: https://github.com/apache/orc/pull/637#issuecomment-775388464


   > Gentle ping, @autumnust .
   
   Sorry I got preempted by some other thing last week. I will get back to fixing the failure in Github Action ASAP. Thanks @dongjoon-hyun  for your review in first round. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #637: ORC-747: Abstract Dictionary interface and refactoring

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r572311006



##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -105,6 +105,9 @@
           "(default 10000 rows) else dictionary check will happen before\n" +
           "writing first stripe. In both cases, the decision to use\n" +
           "dictionary or not will be retained thereafter."),
+  DICTIONARY_FACTORY_CLASS_KEY("orc.dictionary.class.key", null,

Review comment:
       Let's make this a simple string with values of "hash" or "rbtree".

##########
File path: java/core/src/java/org/apache/orc/impl/Dictionary.java
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Interface to define the dictionary used for encoding value in columns of specific types like string, char, varchar, etc.
+ */
+public interface Dictionary {
+  int INITIAL_DICTIONARY_SIZE = 4096;
+
+  interface DictionaryFactory {
+    Dictionary createDict(Configuration conf);
+  }
+
+  /**
+   * Traverse the whole dictionary and apply the action.
+   */
+  void visit(Visitor visitor) throws IOException;
+
+  void clear();
+
+  /**
+   * Given the position index, return the original string before being encoded.
+   */
+  void getText(Text result, int originalPosition);
+
+  int add(String value);

Review comment:
       We shouldn't need the add(String).

##########
File path: java/core/src/java/org/apache/orc/impl/Dictionary.java
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Interface to define the dictionary used for encoding value in columns of specific types like string, char, varchar, etc.
+ */
+public interface Dictionary {
+  int INITIAL_DICTIONARY_SIZE = 4096;
+
+  interface DictionaryFactory {

Review comment:
       I don't think I'd bother defining a factory interface.

##########
File path: java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java
##########
@@ -64,6 +64,15 @@
                        WriterEncryptionVariant encryption,
                        WriterContext writer) throws IOException {
     super(schema, encryption, writer);
+    Configuration conf = writer.getConfiguration();
+

Review comment:
       I'd make a function that looks at the configuration value and builds the hash or red-black dictionary directly. I can't see a need to make the dictionary mechanism pluggable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org