You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/01/26 12:44:08 UTC
[GitHub] [orc] pgaref commented on a change in pull request #634: ORC-741: Schema Evolution missing column not handled in the presence of filters

pgaref commented on a change in pull request #634:
URL: https://github.com/apache/orc/pull/634#discussion_r564480514



##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -231,11 +224,9 @@ protected RecordReaderImpl(ReaderImpl fileReader,
     if (options.getPreFilterColumnNames() != null) {
       for (String colName : options.getPreFilterColumnNames()) {
         int expandColId = findColumns(evolution, colName);
+        // If the column is not present in the file then this can be ignored from read.
         if (expandColId != -1) {
           filterColIds.add(expandColId);
-        } else {

Review comment:
       The intention of the original code here was to notify the consumer that the filter can not be applied  and could lead to wrong results (keep in mind that row-filtering is not best-effort as row-group skipping and the consumer expects only the rows that actually pass the filter).
   This was also a side-effect of the existing code that had no distinction between a column missing from the schema, and schema-fileReader missmatch as you very well mentioned above.
   
   However, (in the former case) we can now end up with pretty generic exceptions from findSubtype that do not reveal the root cause:
   **IllegalArgumentException("Field " + first + " not found in " + current.toString())**
   
   I believe the best thing to do here (even if its a bit ugly) is capture the Exception and rethrow with an appropriate message.
   

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,

Review comment:
       I believe it is safe to assume that the column should always be part of the readerSchema -- if thats not the case and there is a column/schema missmatch the caller (Spark/Hive) will be responsible for handling the RuntimeException.
   
   I like this approach because its cleaner (and distinguishes between the two cases) so +1 for that -- but lets first make sure we add the behavior change as part of the method doc.

##########
File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
##########
@@ -104,17 +104,10 @@
    */
   static int findColumns(SchemaEvolution evolution,
                          String columnName) {
-    try {
-      TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
-          columnName, evolution.isSchemaEvolutionCaseAware);
-      TypeDescription fileColumn = evolution.getFileType(readerColumn);
-      return fileColumn == null ? -1 : fileColumn.getId();
-    } catch (IllegalArgumentException e) {
-      if (LOG.isDebugEnabled()){
-        LOG.debug("{}", e.getMessage());
-      }
-      return -1;
-    }
+    TypeDescription readerColumn = evolution.getReaderBaseSchema().findSubtype(
+        columnName, evolution.isSchemaEvolutionCaseAware);
+    TypeDescription fileColumn = evolution.getFileType(readerColumn);
+    return fileColumn == null ? -1 : fileColumn.getId();

Review comment:
       +1 on the above -- lets add a new test or extend testSchemaEvolutionMissingColumn -- where the "missing" column is not part of schema evolution and assert for expected exception 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org