You are viewing a plain text version of this content. The canonical link for it is here.

Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/12/15 13:43:52 UTC

[GitHub] [hive] pgaref opened a new pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

pgaref opened a new pull request #1783:
URL: https://github.com/apache/hive/pull/1783


   ### What changes were proposed in this pull request?
   OrcInputFormat  getDesiredRowTypeDescr method should use column delimiter to generate column names.
   
   ### Why are the changes needed?
   Current logic can create wrong names when column names contain commas.
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   q file


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pgaref commented on a change in pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

pgaref commented on a change in pull request #1783:
URL: https://github.com/apache/hive/pull/1783#discussion_r544233488



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
##########
@@ -2675,12 +2676,13 @@ public static TypeDescription convertTypeInfo(TypeInfo info) {
   public static TypeDescription getDesiredRowTypeDescr(Configuration conf,
                                                        boolean isAcidRead,
                                                        int dataColumns) {
-
     String columnNameProperty = null;
     String columnTypeProperty = null;
 
     ArrayList<String> schemaEvolutionColumnNames = null;
     ArrayList<TypeDescription> schemaEvolutionTypeDescrs = null;
+    // Make sure we split colNames using the right Delimiter
+    final String columnNameDelimiter = conf.get(serdeConstants.COLUMN_NAME_DELIMITER, String.valueOf(SerDeUtils.COMMA));

Review comment:
       Hey @abstractdog  thanks for taking a look!
   > this makes me think that in order to use commas in column names, you need to define another column name delimiter, otherwise those are cannot be distinguished from each other, right?
   
   We already check this corner case when creating the **TableDesc** https://github.com/apache/hive/blob/95f3d6512f35839f2fad3cfd608616534e506a4b/ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java#L562
   
   https://github.com/pgaref/hive/blob/0d2b39ac180d809788f662fbe3271482cd4d909d/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L1562
   **getColumnNameDelimiter** method actually checks for commas in colNames and uses \0 to split them instead.
   
   This PR is just making use of the custom delimiter that was forgotten for OrcInputFormat but is done for others e.g., OrcOutputFormat
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pgaref commented on a change in pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

pgaref commented on a change in pull request #1783:
URL: https://github.com/apache/hive/pull/1783#discussion_r544233488



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
##########
@@ -2675,12 +2676,13 @@ public static TypeDescription convertTypeInfo(TypeInfo info) {
   public static TypeDescription getDesiredRowTypeDescr(Configuration conf,
                                                        boolean isAcidRead,
                                                        int dataColumns) {
-
     String columnNameProperty = null;
     String columnTypeProperty = null;
 
     ArrayList<String> schemaEvolutionColumnNames = null;
     ArrayList<TypeDescription> schemaEvolutionTypeDescrs = null;
+    // Make sure we split colNames using the right Delimiter
+    final String columnNameDelimiter = conf.get(serdeConstants.COLUMN_NAME_DELIMITER, String.valueOf(SerDeUtils.COMMA));

Review comment:
       Hey @abstractdog  thanks for taking a look!
   > this makes me think that in order to use commas in column names, you need to define another column name delimiter, otherwise those are cannot be distinguished from each other, right?
   We already check this corner case when creating the https://github.com/apache/hive/blob/95f3d6512f35839f2fad3cfd608616534e506a4b/ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java#L562
   
   https://github.com/pgaref/hive/blob/0d2b39ac180d809788f662fbe3271482cd4d909d/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L1562
   actually checks for commas in colNames and uses \0 to split them instead.
   
   This PR is just making use of the custom delimiter that was forgotten for OrcInputFormat but is done for others e.g., OrcOutputFormat
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pgaref commented on pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

pgaref commented on pull request #1783:
URL: https://github.com/apache/hive/pull/1783#issuecomment-745762841


   Hey @abstractdog @maheshk114  can you please take a look?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pgaref commented on pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

pgaref commented on pull request #1783:
URL: https://github.com/apache/hive/pull/1783#issuecomment-757916044


   Shall we push this change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pgaref edited a comment on pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

pgaref edited a comment on pull request #1783:
URL: https://github.com/apache/hive/pull/1783#issuecomment-757916044


   Thanks for taking a look @abstractdog ! Shall we push this change before getting out-of date?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] abstractdog merged pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

abstractdog merged pull request #1783:
URL: https://github.com/apache/hive/pull/1783


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pgaref commented on a change in pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

pgaref commented on a change in pull request #1783:
URL: https://github.com/apache/hive/pull/1783#discussion_r544233488



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
##########
@@ -2675,12 +2676,13 @@ public static TypeDescription convertTypeInfo(TypeInfo info) {
   public static TypeDescription getDesiredRowTypeDescr(Configuration conf,
                                                        boolean isAcidRead,
                                                        int dataColumns) {
-
     String columnNameProperty = null;
     String columnTypeProperty = null;
 
     ArrayList<String> schemaEvolutionColumnNames = null;
     ArrayList<TypeDescription> schemaEvolutionTypeDescrs = null;
+    // Make sure we split colNames using the right Delimiter
+    final String columnNameDelimiter = conf.get(serdeConstants.COLUMN_NAME_DELIMITER, String.valueOf(SerDeUtils.COMMA));

Review comment:
       Hey @abstractdog  thanks for taking a look!
   > this makes me think that in order to use commas in column names, you need to define another column name delimiter, otherwise those are cannot be distinguished from each other, right?
   
   We already check this corner case when creating the **TableDesc** https://github.com/apache/hive/blob/95f3d6512f35839f2fad3cfd608616534e506a4b/ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java#L562
   
   https://github.com/pgaref/hive/blob/0d2b39ac180d809788f662fbe3271482cd4d909d/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L1562
   **getColumnNameDelimiter** method actually checks for commas in colNames and uses `\0` to split them instead.
   
   This PR is just making use of the custom delimiter that was forgotten for OrcInputFormat but is done for others e.g., OrcOutputFormat
   
   In your example above the DELIMITER will be `\0` to avoid colName splitting issues
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] abstractdog commented on a change in pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

abstractdog commented on a change in pull request #1783:
URL: https://github.com/apache/hive/pull/1783#discussion_r544085969



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
##########
@@ -2675,12 +2676,13 @@ public static TypeDescription convertTypeInfo(TypeInfo info) {
   public static TypeDescription getDesiredRowTypeDescr(Configuration conf,
                                                        boolean isAcidRead,
                                                        int dataColumns) {
-
     String columnNameProperty = null;
     String columnTypeProperty = null;
 
     ArrayList<String> schemaEvolutionColumnNames = null;
     ArrayList<TypeDescription> schemaEvolutionTypeDescrs = null;
+    // Make sure we split colNames using the right Delimiter
+    final String columnNameDelimiter = conf.get(serdeConstants.COLUMN_NAME_DELIMITER, String.valueOf(SerDeUtils.COMMA));

Review comment:
       this makes me think that in order to use commas in column names, you need to define another column name delimiter, otherwise those are cannot be distinguished from each other, right?
   I mean, could you please include an example where the configuration can be used for this purpose? I haven't seen that in q test, what's the valid use-case for multiple columns? what happens if you try to do something like:
   ```
   create table test_n4 (`x,y` int, z int);
   select `x,y`, z from test_n4 where `x,y` >= 2 and z = 0;
   ```
   other than this, the patch is simple and neat, which I like :)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pgaref commented on a change in pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

pgaref commented on a change in pull request #1783:
URL: https://github.com/apache/hive/pull/1783#discussion_r544233488



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
##########
@@ -2675,12 +2676,13 @@ public static TypeDescription convertTypeInfo(TypeInfo info) {
   public static TypeDescription getDesiredRowTypeDescr(Configuration conf,
                                                        boolean isAcidRead,
                                                        int dataColumns) {
-
     String columnNameProperty = null;
     String columnTypeProperty = null;
 
     ArrayList<String> schemaEvolutionColumnNames = null;
     ArrayList<TypeDescription> schemaEvolutionTypeDescrs = null;
+    // Make sure we split colNames using the right Delimiter
+    final String columnNameDelimiter = conf.get(serdeConstants.COLUMN_NAME_DELIMITER, String.valueOf(SerDeUtils.COMMA));

Review comment:
       Hey @abstractdog  thanks for taking a look!
   > this makes me think that in order to use commas in column names, you need to define another column name delimiter, otherwise those are cannot be distinguished from each other, right?
   
   We already check this corner case when creating the https://github.com/apache/hive/blob/95f3d6512f35839f2fad3cfd608616534e506a4b/ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java#L562
   
   https://github.com/pgaref/hive/blob/0d2b39ac180d809788f662fbe3271482cd4d909d/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L1562
   actually checks for commas in colNames and uses \0 to split them instead.
   
   This PR is just making use of the custom delimiter that was forgotten for OrcInputFormat but is done for others e.g., OrcOutputFormat
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] pgaref commented on pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

pgaref commented on pull request #1783:
URL: https://github.com/apache/hive/pull/1783#issuecomment-755961849


   Gentle ping @abstractdog  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] abstractdog commented on a change in pull request #1783: HIVE-24539: OrcInputFormat schema generation should respect column delimiter

Posted by GitBox <gi...@apache.org>.

abstractdog commented on a change in pull request #1783:
URL: https://github.com/apache/hive/pull/1783#discussion_r553259234



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
##########
@@ -2675,12 +2676,13 @@ public static TypeDescription convertTypeInfo(TypeInfo info) {
   public static TypeDescription getDesiredRowTypeDescr(Configuration conf,
                                                        boolean isAcidRead,
                                                        int dataColumns) {
-
     String columnNameProperty = null;
     String columnTypeProperty = null;
 
     ArrayList<String> schemaEvolutionColumnNames = null;
     ArrayList<TypeDescription> schemaEvolutionTypeDescrs = null;
+    // Make sure we split colNames using the right Delimiter
+    final String columnNameDelimiter = conf.get(serdeConstants.COLUMN_NAME_DELIMITER, String.valueOf(SerDeUtils.COMMA));

Review comment:
       cool, thanks for clarifying, LGTM
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org