You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@orc.apache.org by "cxzl25 (via GitHub)" <gi...@apache.org> on 2023/04/07 10:24:15 UTC

[GitHub] [orc] cxzl25 opened a new pull request, #1458: ORC-1403: ORC supports reading empty field name

cxzl25 opened a new pull request, #1458:
URL: https://github.com/apache/orc/pull/1458

   ### What changes were proposed in this pull request?
   `ParserUtils` removes empty check
   
   
   ### Why are the changes needed?
   
   https://github.com/apache/spark/pull/35253#issuecomment-1321866542
   
   ```java
   java.lang.IllegalArgumentException: Empty quoted field name at 'struct<``^:string>'
       at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
       at org.apache.orc.impl.ParserUtils.parseStruct(ParserUtils.java:170)
       at org.apache.orc.impl.ParserUtils.parseType(ParserUtils.java:228)
       at org.apache.orc.TypeDescription.fromString(TypeDescription.java:202)
       at org.apache.orc.mapred.OrcInputFormat.buildOptions(OrcInputFormat.java:122)
       at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
       at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:207) 
   ```
   
   ### How was this patch tested?
   add UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] cxzl25 commented on pull request #1458: ORC-1403: ORC supports reading empty field name

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.

cxzl25 commented on PR #1458:
URL: https://github.com/apache/orc/pull/1458#issuecomment-1501858518

   > What is your end goal as a user? For example, did you hit the following?
   
   We have encountered some problems. Some lower versions of Spark use a lower version of ORC to write data with empty field name, but using Spark3.2 ORC 1.6 to read data fails.
   
   We have injected custom rules into Spark through `SparkSessionExtensions` to realize the operation of adding alias automatically, which can avoid this problem.
   
   So I was thinking, it would be better if this problem could be solved at Spark or ORC level.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] deshanxiao commented on pull request #1458: ORC-1403: ORC supports reading empty field name

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.

deshanxiao commented on PR #1458:
URL: https://github.com/apache/orc/pull/1458#issuecomment-1502624145

   Based on the above scenario, in order to avoid some additional side effects, maybe we could skip the limitation by adding a new configuration?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on pull request #1458: ORC-1403: ORC supports reading empty field name

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #1458:
URL: https://github.com/apache/orc/pull/1458#issuecomment-1501422081

   Ya, we can do some. However, instead of switching behaviors back and forth in this layer, I believe we need to focus on your end goals in the higher layers. What is your end goal as a user? For example, did you hit the following? If then, do we have a test coverage in Apache Spark codebase across multiple data sources (Hive,ORC,Parquet)? Can we start from there?
   
   > automatically generate _c1 if no alias is specified


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] cxzl25 commented on pull request #1458: ORC-1403: ORC supports reading empty field name

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.

cxzl25 commented on PR #1458:
URL: https://github.com/apache/orc/pull/1458#issuecomment-1501342079

   > Sorry but the existing way is the correct direction. When Apache Hive (HiveFileFormat) doesn't support it, we should not remove a test coverage, `testQuotedField2` .
   
   Now the Spark orc datasource does not check the field name, but the Spark hive orc format checks the field name, which causes the Spark orc datasource to be able to write the field name. 
   Should we check it at the spark level, and it is not allowed to write the field name?
   ```
   org.apache.spark.sql.execution.datasources.orc.OrcFileFormat 
   org.apache.spark.sql.hive.execution.HiveFileFormat#supportFieldName
   ```
   In addition, can we check the schema in orc writer, otherwise the written data may not be read?
   ```
   org.apache.orc.OrcFile.WriterOptions#setSchema
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1458: ORC-1403: ORC supports reading empty field name

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #1458:
URL: https://github.com/apache/orc/pull/1458#discussion_r1160874542


##########
java/core/src/test/org/apache/orc/TestTypeDescription.java:
##########
@@ -174,14 +174,6 @@ public void testQuotedField1() {
     assertTrue(e.getMessage().contains("Unmatched quote at 'struct<^`abc'"));
   }
 
-  @Test
-  public void testQuotedField2() {
-    IllegalArgumentException e = assertThrows(IllegalArgumentException.class, () -> {
-      TypeDescription.fromString("struct<``:int>");
-    });
-    assertTrue(e.getMessage().contains("Empty quoted field name at 'struct<``^:int>'"));

Review Comment:
   This test coverage means it's a feature before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] deshanxiao commented on a diff in pull request #1458: ORC-1403: ORC supports reading empty field name

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.

deshanxiao commented on code in PR #1458:
URL: https://github.com/apache/orc/pull/1458#discussion_r1162250628


##########
java/core/src/test/org/apache/orc/TestTypeDescription.java:
##########
@@ -174,14 +174,6 @@ public void testQuotedField1() {
     assertTrue(e.getMessage().contains("Unmatched quote at 'struct<^`abc'"));
   }
 
-  @Test
-  public void testQuotedField2() {
-    IllegalArgumentException e = assertThrows(IllegalArgumentException.class, () -> {
-      TypeDescription.fromString("struct<``:int>");
-    });
-    assertTrue(e.getMessage().contains("Empty quoted field name at 'struct<``^:int>'"));

Review Comment:
   I'd probably prefer to just ignore it instead of removing it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] cxzl25 commented on pull request #1458: ORC-1403: ORC supports reading empty field name

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.

cxzl25 commented on PR #1458:
URL: https://github.com/apache/orc/pull/1458#issuecomment-1501124249

   > Does Apache Hive support empty field names?
   
   See https://github.com/apache/spark/pull/35253#issuecomment-1321866542.
   
   When `spark.sql.orc.impl=native`, writing to empty field name is supported, but reading is failed, when `spark.sql.orc.impl=hive`, neither writing nor reading empty field name is supported.
   
   Before ORC-529 1.6.0, ORC should support reading empty filed name.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org