You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/10/28 13:39:49 UTC

[GitHub] [drill] cgivre opened a new pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

cgivre opened a new pull request #2354:
URL: https://github.com/apache/drill/pull/2354


   # [DRILL-8022](https://issues.apache.org/jira/browse/DRILL-8022): Add Provided Schema Support for Excel Reader
   
   ## Description
   This PR adds support for provided schema for the Excel format plugin. 
   
   ## Documentation
   The provided schema functionality is already documented in the Drill docs, but this PR applies it to the Excel reader.
   
   An example query:
   ```sql
   SELECT col1, col2, col3 
   FROM table(dfs.`excel/schema_provisioning.xlsx`  
   (schema => 'inline=(`col1` INTEGER, `col2` FLOAT, `col3` VARCHAR)'))
   ```
   
   Previously, the Excel reader only supported dates, and doubles as data types.  This PR adds support for INT data types but only via provided schema.  
   
   One limitation is that while Drill supports nested data, this PR does not allow a user to provision data types that Excel does not support.  So if a user attempts to create a schema with maps, or lists, this will fail. 
   
   ## Testing
   Added additional unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on pull request #2354:
URL: https://github.com/apache/drill/pull/2354#issuecomment-955808543


   @luocooong Thanks for the review.  I addressed your review comments and squashed commits.  Just as an FYI, we changed the settings so that whoever merges the PR can actually do the squash rather than the contributor. 
   
   I should mention that this PR actually has a little bit of an Easter egg... It has a new function in the test utils to make testing functions that produce dates a little easier. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on pull request #2354:
URL: https://github.com/apache/drill/pull/2354#issuecomment-955857715


   > @cgivre I have an idea : Is it possible to add a label for the `learning reference`? It can be quick to filter the pull request, and provide a good chance for learning and improvement of the pull request(or called contribution) for all the contributors. Break down the complexity of Drill internal.
   
   Great idea!  Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on a change in pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
luocooong commented on a change in pull request #2354:
URL: https://github.com/apache/drill/pull/2354#discussion_r739914927



##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -363,6 +407,22 @@ private void setFirstRow() {
     }
   }
 
+  /**
+   * This function is used to set the iterator to the first row of actual data.  When a schema is provided,
+   * we can safely skip the header row, and start reading the first row of data.
+   */
+  private void setFirstDataRow() {
+    // Initialize
+    currentRow = rowIterator.next();

Review comment:
       The logic has not changed and is easy to understand. Because at least two iterations are required.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2354:
URL: https://github.com/apache/drill/pull/2354#discussion_r739887830



##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -710,6 +785,40 @@ public void load(Cell cell) {
     }
   }
 
+
+  public static class IntStringWriter extends ExcelBatchReader.CellWriter {
+    IntStringWriter(ScalarWriter columnWriter) {
+      super(columnWriter);
+    }
+
+    @Override
+    public void load(Cell cell) {
+      if (cell == null) {
+        columnWriter.setNull();
+      } else {
+        String fieldValue = String.valueOf(cell.getNumericCellValue());
+        columnWriter.setString(fieldValue);
+      }
+    }
+  }
+
+  public static class IntCellWriter extends ExcelBatchReader.CellWriter {
+    IntCellWriter(ScalarWriter columnWriter) {
+      super(columnWriter);
+    }
+
+    @Override
+    public void load(Cell cell) {
+      if (cell == null) {
+        columnWriter.setNull();
+      } else {
+        int fieldNumValue = (int)cell.getNumericCellValue();

Review comment:
       Fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2354:
URL: https://github.com/apache/drill/pull/2354#issuecomment-955170775


   @cgivre Thanks for the contribution. Is it ready for review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on a change in pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
luocooong commented on a change in pull request #2354:
URL: https://github.com/apache/drill/pull/2354#discussion_r739776980



##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -710,6 +785,40 @@ public void load(Cell cell) {
     }
   }
 
+
+  public static class IntStringWriter extends ExcelBatchReader.CellWriter {
+    IntStringWriter(ScalarWriter columnWriter) {
+      super(columnWriter);
+    }
+
+    @Override
+    public void load(Cell cell) {
+      if (cell == null) {
+        columnWriter.setNull();
+      } else {
+        String fieldValue = String.valueOf(cell.getNumericCellValue());
+        columnWriter.setString(fieldValue);
+      }
+    }
+  }
+
+  public static class IntCellWriter extends ExcelBatchReader.CellWriter {
+    IntCellWriter(ScalarWriter columnWriter) {
+      super(columnWriter);
+    }
+
+    @Override
+    public void load(Cell cell) {
+      if (cell == null) {
+        columnWriter.setNull();
+      } else {
+        int fieldNumValue = (int)cell.getNumericCellValue();

Review comment:
       Add a space before `cell`.

##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -363,6 +407,22 @@ private void setFirstRow() {
     }
   }
 
+  /**
+   * This function is used to set the iterator to the first row of actual data.  When a schema is provided,
+   * we can safely skip the header row, and start reading the first row of data.
+   */
+  private void setFirstDataRow() {
+    // Initialize
+    currentRow = rowIterator.next();

Review comment:
       Is it possible to do this (?) : 
   ```java
   // 1. Remove this line
   // 2. Add a loop once
   for (int i = 0; i <= rowNumber + 1; i++) {
     currentRow = rowIterator.next();
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2354:
URL: https://github.com/apache/drill/pull/2354#discussion_r739889161



##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -363,6 +407,22 @@ private void setFirstRow() {
     }
   }
 
+  /**
+   * This function is used to set the iterator to the first row of actual data.  When a schema is provided,
+   * we can safely skip the header row, and start reading the first row of data.
+   */
+  private void setFirstDataRow() {
+    // Initialize
+    currentRow = rowIterator.next();

Review comment:
       I suppose so.  Not sure what the advantage is though.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2354:
URL: https://github.com/apache/drill/pull/2354#issuecomment-955857050


   @cgivre I have an idea : Is it possible to add a label for the `learning reference`? It can be quick to filter the pull request, and provide a good chance for learning and improvement of the pull request(or called contribution) for all the contributors. Break down the complexity of Drill internal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong merged pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
luocooong merged pull request #2354:
URL: https://github.com/apache/drill/pull/2354


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on pull request #2354: DRILL-8022: Add Provided Schema Support for Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on pull request #2354:
URL: https://github.com/apache/drill/pull/2354#issuecomment-955606365


   > @cgivre Thanks for the contribution. Is it ready for review?
   
   Hi @luocooong This PR is ready for review.  I restarted the CI and it will likely pass this time. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org