You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/01/25 22:45:53 UTC

[GitHub] [orc] pavibhai opened a new pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

pavibhai opened a new pull request #634:
URL: https://github.com/apache/orc/pull/634


   ### What changes were proposed in this pull request?
   The following changes are proposed during the resolution of columns:
   * If a column is not found in the read schema, then fail with an `IllegalArgumentException`
   * If a column is not found in the file, then do not fail but instead treat as `NullReader`
   
   ### Why are the changes needed?
   In the absence of the patch, when the filter columns referred to missing columns in the file we were incorrectly getting an `IllegalArgumentException` instead of handling it as a missing column in the file.
   
   ### How was this patch tested?
   Unit tests were added to verify this behavior
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564666137



##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -312,6 +314,26 @@ public void testPositional2() {
         false, addSarg, true);
   }
 
+  private void checkEvolutionPosn(String writerType, String readerType,

Review comment:
       This was primarily to fix the errors in some of the tests that were using positional names




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on pull request #634:
URL: https://github.com/apache/orc/pull/634#issuecomment-767690736


   @pgaref @dongjoon-hyun 
   Thanks for your review comments. I have addressed them, please check and let me know if something is amiss.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564662732



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
     if (options.getPreFilterColumnNames() != null) {
       for (String colName : options.getPreFilterColumnNames()) {
         int expandColId = findColumns(evolution, colName);
+        // If the column is not present in the file then this can be ignored from read.
         if (expandColId != -1) {
           filterColIds.add(expandColId);
-        } else {
-          throw new IllegalArgumentException("Filter could not find column with name: " +
-              colName + " on " + evolution.getReaderBaseSchema());

Review comment:
       Yeah, will add this wrapping into the find method as suggested by @pgaref 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564175246



##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -235,35 +235,37 @@ public void testListEvolution() {
   @Test
   public void testPreHive4243CheckEqual() {
     // Expect success on equal schemas
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), false, addSarg, false);
+                   struct(1, "foo", null),
+                   false, addSarg, false);
   }
 
   @Test
   public void testPreHive4243Check() {
     // Expect exception on strict compatibility check
     thrown.expectMessage("HIVE-4243");
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string,_col2:double>",
                    struct(1, "foo"),
                    struct(1, "foo", null), false, addSarg, false);
   }
 
   @Test
   public void testPreHive4243AddColumn() {
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string,_col2:double>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), true, addSarg, false);
+                   struct(1, "foo", null),
+                   true, addSarg, false);

Review comment:
       Let's keep the original form to reduce the patch size.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564480514



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
     if (options.getPreFilterColumnNames() != null) {
       for (String colName : options.getPreFilterColumnNames()) {
         int expandColId = findColumns(evolution, colName);
+        // If the column is not present in the file then this can be ignored from read.
         if (expandColId != -1) {
           filterColIds.add(expandColId);
-        } else {

Review comment:
       The intention of the original code here was to notify the consumer that the filter can not be applied  and could lead to wrong results (keep in mind that row-filtering is not best-effort as row-group skipping and the consumer expects only the rows that actually pass the filter).
   This was also a side-effect of the existing code that had no distinction between a column missing from the schema, and schema-fileReader missmatch as you very well mentioned above.
   
   However, (in the former case) we can now end up with pretty generic exceptions from findSubtype that do not reveal the root cause:
   **IllegalArgumentException("Field " + first + " not found in " + current.toString())**
   
   I believe the best thing to do here (even if its a bit ugly) is capture the Exception and rethrow with an appropriate message.
   

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,

Review comment:
       I believe it is safe to assume that the column should always be part of the readerSchema -- if thats not the case and there is a column/schema missmatch the caller (Spark/Hive) will be responsible for handling the RuntimeException.
   
   I like this approach because its cleaner (and distinguishes between the two cases) so +1 for that -- but lets first make sure we add the behavior change as part of the method doc.

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,
                          String columnName) {
-    try {
-      TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
-          columnName, evolution.isSchemaEvolutionCaseAware);
-      TypeDescription fileColumn = evolution.getFileType(readerColumn);
-      return fileColumn == null ? -1 : fileColumn.getId();
-    } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
-    }
+    TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+        columnName, evolution.isSchemaEvolutionCaseAware);
+    TypeDescription fileColumn = evolution.getFileType(readerColumn);
+    return fileColumn == null ? -1 : fileColumn.getId();

Review comment:
       +1 on the above -- lets add a new test or extend testSchemaEvolutionMissingColumn -- where the "missing" column is not part of schema evolution and assert for expected exception 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
omalley commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564726617



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -110,10 +111,9 @@ static int findColumns(SchemaEvolution evolution,
       TypeDescription fileColumn = evolution.getFileType(readerColumn);
       return fileColumn == null ? -1 : fileColumn.getId();
     } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
+      throw new IllegalArgumentException("Filter could not find column with name: " +

Review comment:
       @pgaref what are you asking for?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564175310



##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -235,35 +235,37 @@ public void testListEvolution() {
   @Test
   public void testPreHive4243CheckEqual() {
     // Expect success on equal schemas
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), false, addSarg, false);
+                   struct(1, "foo", null),
+                   false, addSarg, false);

Review comment:
       Let's keep the original form to reduce the patch size.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564214073



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,

Review comment:
       For this one, we hit many issues at Apache ORC 1.6.x due to the breaking API and behavior change.
   Although this is an implementation internal class, Hive and Spark's hive module assumes old behavior.
   So, if we need this change, I'd like to recommend to have a configuration to support both behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564662732



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
     if (options.getPreFilterColumnNames() != null) {
       for (String colName : options.getPreFilterColumnNames()) {
         int expandColId = findColumns(evolution, colName);
+        // If the column is not present in the file then this can be ignored from read.
         if (expandColId != -1) {
           filterColIds.add(expandColId);
-        } else {
-          throw new IllegalArgumentException("Filter could not find column with name: " +
-              colName + " on " + evolution.getReaderBaseSchema());

Review comment:
       Added this wrapping into the find method as suggested by @pgaref 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] omalley closed pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
omalley closed pull request #634:
URL: https://github.com/apache/orc/pull/634


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564733567



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -110,10 +111,9 @@ static int findColumns(SchemaEvolution evolution,
       TypeDescription fileColumn = evolution.getFileType(readerColumn);
       return fileColumn == null ? -1 : fileColumn.getId();
     } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
+      throw new IllegalArgumentException("Filter could not find column with name: " +

Review comment:
       Looked like the indentation was off here (on my IDE at least) but just realised it was just split in 2 lines instead of 3 -- ignore the comment please 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564690985



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -110,10 +111,9 @@ static int findColumns(SchemaEvolution evolution,
       TypeDescription fileColumn = evolution.getFileType(readerColumn);
       return fileColumn == null ? -1 : fileColumn.getId();
     } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
+      throw new IllegalArgumentException("Filter could not find column with name: " +

Review comment:
       Style is a bit off here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564692139



##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -232,10 +233,23 @@ public void testListEvolution() {
                    addSarg);
   }
 
+  @Test
+  public void testMissingColumnFromReaderSchema() {
+    // Expect failure if the column is missing from the reader schema, as column a that is added
+    // by the SArg is missing from the reader schema
+    Assert.assertThrows("Field a not found in", IllegalArgumentException.class,
+                        () -> checkEvolution("struct<b:int,c:string>",

Review comment:
       style?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564675626



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,

Review comment:
       @dongjoon-hyun My guess is that there are some unit tests that need fixing. I had to do the same here in ORC that is the reason for that new `checkEvolutionPosn` method in the test class.
   
   @pgaref Updated the javadoc for the method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #634:
URL: https://github.com/apache/orc/pull/634#issuecomment-767699690


   > @pgaref @dongjoon-hyun
   > Thanks for your review comments. I have addressed them, please check and let me know if something is amiss.
   
   Thanks for the prompt reply @pavibhai  -- Latest PR is almost there, please check the two style comments above


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564176012



##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -312,6 +314,26 @@ public void testPositional2() {
         false, addSarg, true);
   }
 
+  private void checkEvolutionPosn(String writerType, String readerType,

Review comment:
       Ya. I also don't have a better option. +1 for adding this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564677190



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
     if (options.getPreFilterColumnNames() != null) {
       for (String colName : options.getPreFilterColumnNames()) {
         int expandColId = findColumns(evolution, colName);
+        // If the column is not present in the file then this can be ignored from read.
         if (expandColId != -1) {
           filterColIds.add(expandColId);
-        } else {

Review comment:
       Agreed. Wrapped the exception in the method as suggested.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564171846



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
     if (options.getPreFilterColumnNames() != null) {
       for (String colName : options.getPreFilterColumnNames()) {
         int expandColId = findColumns(evolution, colName);
+        // If the column is not present in the file then this can be ignored from read.
         if (expandColId != -1) {
           filterColIds.add(expandColId);
-        } else {
-          throw new IllegalArgumentException("Filter could not find column with name: " +
-              colName + " on " + evolution.getReaderBaseSchema());

Review comment:
       Could you confirm that what is the new error message? I understand that the thrown exception is the same, `IllegalArgumentException`, but I'm not sure about the information on the error message. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564295072



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,

Review comment:
       The behavior change we did to this API is as follows:
   * If the column is present is missing in readSchema then throw an exception of missing column
   * If the column is missing in file but present in readSchema then we see this as missing column in file handled via NullReader
   
   Wanted to confirm that the ask is to not throw an exception when Hive/Spark use a column name that is missing in the readSchema. I might be missing the scenario where this is the case, I was under the impression that column references should always be valid within the readSchema.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564172470



##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -312,6 +314,26 @@ public void testPositional2() {
         false, addSarg, true);
   }
 
+  private void checkEvolutionPosn(String writerType, String readerType,

Review comment:
       This function seems to be duplicated mostly with `checkEvolution`. Is the only difference is the column name, `_col0` and `a`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564170608



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,
                          String columnName) {
-    try {
-      TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
-          columnName, evolution.isSchemaEvolutionCaseAware);
-      TypeDescription fileColumn = evolution.getFileType(readerColumn);
-      return fileColumn == null ? -1 : fileColumn.getId();
-    } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
-    }
+    TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+        columnName, evolution.isSchemaEvolutionCaseAware);
+    TypeDescription fileColumn = evolution.getFileType(readerColumn);
+    return fileColumn == null ? -1 : fileColumn.getId();

Review comment:
       This PR passed without this change. Shall we have a test coverage for this change?
   And, could you update the function document by adding 
   ```java
   @throws IllegalArgumentException ....(description)
   ```

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,
                          String columnName) {
-    try {
-      TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
-          columnName, evolution.isSchemaEvolutionCaseAware);
-      TypeDescription fileColumn = evolution.getFileType(readerColumn);
-      return fileColumn == null ? -1 : fileColumn.getId();
-    } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
-    }
+    TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+        columnName, evolution.isSchemaEvolutionCaseAware);
+    TypeDescription fileColumn = evolution.getFileType(readerColumn);
+    return fileColumn == null ? -1 : fileColumn.getId();

Review comment:
       This PR passed without this change. Shall we have a test coverage for this change?
   And, could you update the function document by adding the exception?
   ```java
   @throws IllegalArgumentException ....(description)
   ```

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
     if (options.getPreFilterColumnNames() != null) {
       for (String colName : options.getPreFilterColumnNames()) {
         int expandColId = findColumns(evolution, colName);
+        // If the column is not present in the file then this can be ignored from read.
         if (expandColId != -1) {
           filterColIds.add(expandColId);
-        } else {
-          throw new IllegalArgumentException("Filter could not find column with name: " +
-              colName + " on " + evolution.getReaderBaseSchema());

Review comment:
       Could you confirm that what is the new error message? I understand that the thrown exception is the same, `IllegalArgumentException`, but I'm not sure about the information on the error message. 

##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -312,6 +314,26 @@ public void testPositional2() {
         false, addSarg, true);
   }
 
+  private void checkEvolutionPosn(String writerType, String readerType,

Review comment:
       This function seems to be duplicated mostly with `checkEvolution`. Is the only difference is the column name, `_col0` and `a`?

##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -235,35 +235,37 @@ public void testListEvolution() {
   @Test
   public void testPreHive4243CheckEqual() {
     // Expect success on equal schemas
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), false, addSarg, false);
+                   struct(1, "foo", null),
+                   false, addSarg, false);
   }
 
   @Test
   public void testPreHive4243Check() {
     // Expect exception on strict compatibility check
     thrown.expectMessage("HIVE-4243");
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string,_col2:double>",
                    struct(1, "foo"),
                    struct(1, "foo", null), false, addSarg, false);
   }
 
   @Test
   public void testPreHive4243AddColumn() {
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string,_col2:double>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), true, addSarg, false);
+                   struct(1, "foo", null),
+                   true, addSarg, false);

Review comment:
       Let's keep the original form to reduce the patch size.

##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -235,35 +235,37 @@ public void testListEvolution() {
   @Test
   public void testPreHive4243CheckEqual() {
     // Expect success on equal schemas
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), false, addSarg, false);
+                   struct(1, "foo", null),
+                   false, addSarg, false);

Review comment:
       Let's keep the original form to reduce the patch size.

##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -312,6 +314,26 @@ public void testPositional2() {
         false, addSarg, true);
   }
 
+  private void checkEvolutionPosn(String writerType, String readerType,

Review comment:
       Ya. I also don't have a better option. +1 for adding this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564660301



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,
                          String columnName) {
-    try {
-      TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
-          columnName, evolution.isSchemaEvolutionCaseAware);
-      TypeDescription fileColumn = evolution.getFileType(readerColumn);
-      return fileColumn == null ? -1 : fileColumn.getId();
-    } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
-    }
+    TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+        columnName, evolution.isSchemaEvolutionCaseAware);
+    TypeDescription fileColumn = evolution.getFileType(readerColumn);
+    return fileColumn == null ? -1 : fileColumn.getId();

Review comment:
       Added the test




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564170608



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,
                          String columnName) {
-    try {
-      TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
-          columnName, evolution.isSchemaEvolutionCaseAware);
-      TypeDescription fileColumn = evolution.getFileType(readerColumn);
-      return fileColumn == null ? -1 : fileColumn.getId();
-    } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
-    }
+    TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+        columnName, evolution.isSchemaEvolutionCaseAware);
+    TypeDescription fileColumn = evolution.getFileType(readerColumn);
+    return fileColumn == null ? -1 : fileColumn.getId();

Review comment:
       This PR passed without this change. Shall we have a test coverage for this change?
   And, could you update the function document by adding 
   ```java
   @throws IllegalArgumentException ....(description)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on pull request #634:
URL: https://github.com/apache/orc/pull/634#issuecomment-767353536


   > Thank you for making a PR, @pavibhai . I left a few minor comments. Could you check them?
   
   Thanks for the feedback @dongjoon-hyun will go through the same and update.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pavibhai commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pavibhai commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564668912



##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -235,35 +235,37 @@ public void testListEvolution() {
   @Test
   public void testPreHive4243CheckEqual() {
     // Expect success on equal schemas
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), false, addSarg, false);
+                   struct(1, "foo", null),
+                   false, addSarg, false);
   }
 
   @Test
   public void testPreHive4243Check() {
     // Expect exception on strict compatibility check
     thrown.expectMessage("HIVE-4243");
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string,_col2:double>",
                    struct(1, "foo"),
                    struct(1, "foo", null), false, addSarg, false);
   }
 
   @Test
   public void testPreHive4243AddColumn() {
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string,_col2:double>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), true, addSarg, false);
+                   struct(1, "foo", null),
+                   true, addSarg, false);

Review comment:
       Done

##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -235,35 +235,37 @@ public void testListEvolution() {
   @Test
   public void testPreHive4243CheckEqual() {
     // Expect success on equal schemas
-    checkEvolution("struct<_col0:int,_col1:string>",
+    checkEvolutionPosn("struct<_col0:int,_col1:string>",
                    "struct<_col0:int,_col1:string>",
                    struct(1, "foo"),
-                   struct(1, "foo", null), false, addSarg, false);
+                   struct(1, "foo", null),
+                   false, addSarg, false);

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564170608



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,
                          String columnName) {
-    try {
-      TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
-          columnName, evolution.isSchemaEvolutionCaseAware);
-      TypeDescription fileColumn = evolution.getFileType(readerColumn);
-      return fileColumn == null ? -1 : fileColumn.getId();
-    } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
-    }
+    TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+        columnName, evolution.isSchemaEvolutionCaseAware);
+    TypeDescription fileColumn = evolution.getFileType(readerColumn);
+    return fileColumn == null ? -1 : fileColumn.getId();

Review comment:
       This PR passed without this change. Shall we have a test coverage for this change?
   And, could you update the function document by adding the exception?
   ```java
   @throws IllegalArgumentException ....(description)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564692139



##########
File path: java/mapreduce/src/test/org/apache/orc/mapred/TestOrcFileEvolution.java
##########
@@ -232,10 +233,23 @@ public void testListEvolution() {
                    addSarg);
   }
 
+  @Test
+  public void testMissingColumnFromReaderSchema() {
+    // Expect failure if the column is missing from the reader schema, as column a that is added
+    // by the SArg is missing from the reader schema
+    Assert.assertThrows("Field a not found in", IllegalArgumentException.class,
+                        () -> checkEvolution("struct<b:int,c:string>",

Review comment:
       style?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org