You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/05/02 04:13:05 UTC

[GitHub] [drill] cgivre opened a new pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

cgivre opened a new pull request #2211:
URL: https://github.com/apache/drill/pull/2211


   # [DRILL-7912](https://issues.apache.org/jira/browse/DRILL-7912): Add Sheet Names to Excel Reader
   
   ## Description
   Currently, there was no way to determine what sheets are available in a given Excel file.  This minor PR adds a new implicit metadata field called `_sheets` which a user can use to determine what sheets are available.  By definition, Excel files must have at least one sheet.
   
   Additionally, I updated the streaming reader to the latest version.
   
   ## Documentation
   Updated `README` to reflect the additional field, but to determine the available sheets, a user could execute the query below:
   
   ```sql
   SELECT FLATTEN(_sheets) AS sheetNames 
   FROM <Excel File>
   ```
   
   ## Testing
   Added additional unit test. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on a change in pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
luocooong commented on a change in pull request #2211:
URL: https://github.com/apache/drill/pull/2211#discussion_r624658770



##########
File path: contrib/format-excel/pom.xml
##########
@@ -67,7 +67,7 @@
     <dependency>
       <groupId>com.github.pjfanning</groupId>
       <artifactId>excel-streaming-reader</artifactId>
-      <version>3.0.3</version>
+      <version>3.0.4</version>

Review comment:
       It‘s always good to use the latest version. Is possible to record it on the `Description` of PR  and `Message` of Commits?

##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -566,6 +598,34 @@ private void writeMetadata() {
         metadataColumnWriters.get(index).setTimestamp(Instant.ofEpochMilli(timeValue.getTime()));
       }
     }
+
+    // Write the sheet names.  Since this is the only list field
+    int listIndex = IMPLICIT_STRING_COLUMN.values().length + IMPLICIT_TIMESTAMP_COLUMN.values().length;
+    String sheetColumnName = IMPLICIT_LIST_COLUMN.SHEETS.fieldName;
+    List<String> sheetNames = listMetadata.get(sheetColumnName);
+
+    if (sheetNameWriter == null) {
+      int sheetColumnIndex = rowWriter.tupleSchema().index(IMPLICIT_LIST_COLUMN.SHEETS.getFieldName());
+      if (sheetColumnIndex == -1) {
+        ColumnMetadata colSchema = MetadataUtils.newScalar(sheetColumnName, MinorType.VARCHAR, DataMode.REPEATED);
+        colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+        listIndex = rowWriter.addColumn(colSchema);
+      }
+      sheetNameWriter = rowWriter.column(listIndex).array().scalar();
+    }
+
+    for (String sheetName : sheetNames) {
+      sheetNameWriter.setString(sheetName);
+    }
+  }
+
+  private List<String> getSheetNames() {
+    List<String> sheets = new ArrayList<>();
+    int sheetCount = streamingWorkbook.getNumberOfSheets();
+    for (int i = 0; i < sheetCount; i++) {
+      sheets.add(streamingWorkbook.getSheetName(i));

Review comment:
       Is it worked fine if the `getSheetName()`is a blank?

##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -566,6 +598,34 @@ private void writeMetadata() {
         metadataColumnWriters.get(index).setTimestamp(Instant.ofEpochMilli(timeValue.getTime()));
       }
     }
+
+    // Write the sheet names.  Since this is the only list field
+    int listIndex = IMPLICIT_STRING_COLUMN.values().length + IMPLICIT_TIMESTAMP_COLUMN.values().length;
+    String sheetColumnName = IMPLICIT_LIST_COLUMN.SHEETS.fieldName;
+    List<String> sheetNames = listMetadata.get(sheetColumnName);
+
+    if (sheetNameWriter == null) {
+      int sheetColumnIndex = rowWriter.tupleSchema().index(IMPLICIT_LIST_COLUMN.SHEETS.getFieldName());
+      if (sheetColumnIndex == -1) {
+        ColumnMetadata colSchema = MetadataUtils.newScalar(sheetColumnName, MinorType.VARCHAR, DataMode.REPEATED);
+        colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);

Review comment:
       Great feature. Does this function also apply to other format plugins (`EXCLUDE_FROM_WILDCARD ` set to true)?

##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -116,6 +117,23 @@ public String getFieldName() {
     }
   }
 
+  private enum IMPLICIT_LIST_COLUMN {
+    /**
+     * A list of the available sheets in the file.
+     */
+    SHEETS("_sheets");
+
+    private final String fieldName;
+
+    IMPLICIT_LIST_COLUMN(String fieldName) {
+      this.fieldName = fieldName;
+    }
+
+    public String getFieldName() {

Review comment:
       It can simply declare  to `String getFieldName()`. the code of above is same to this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2211:
URL: https://github.com/apache/drill/pull/2211#discussion_r624711391



##########
File path: contrib/format-excel/pom.xml
##########
@@ -67,7 +67,7 @@
     <dependency>
       <groupId>com.github.pjfanning</groupId>
       <artifactId>excel-streaming-reader</artifactId>
-      <version>3.0.3</version>
+      <version>3.0.4</version>

Review comment:
       It is noted in the PR description.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre merged pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre merged pull request #2211:
URL: https://github.com/apache/drill/pull/2211


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2211:
URL: https://github.com/apache/drill/pull/2211#discussion_r624710583



##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -116,6 +117,23 @@ public String getFieldName() {
     }
   }
 
+  private enum IMPLICIT_LIST_COLUMN {
+    /**
+     * A list of the available sheets in the file.
+     */
+    SHEETS("_sheets");
+
+    private final String fieldName;
+
+    IMPLICIT_LIST_COLUMN(String fieldName) {
+      this.fieldName = fieldName;
+    }
+
+    public String getFieldName() {

Review comment:
       I don't believe that will work in this case.  The line `this.fieldName = fieldName` assigns the variable.  If you call `getFieldName()` when that is not assigned, you will get some error. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on pull request #2211:
URL: https://github.com/apache/drill/pull/2211#issuecomment-830825978


   Hi @luocooong 
   I addressed your comments.  Could you please take another look?
   Thx!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2211:
URL: https://github.com/apache/drill/pull/2211#discussion_r624710325



##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -566,6 +598,34 @@ private void writeMetadata() {
         metadataColumnWriters.get(index).setTimestamp(Instant.ofEpochMilli(timeValue.getTime()));
       }
     }
+
+    // Write the sheet names.  Since this is the only list field
+    int listIndex = IMPLICIT_STRING_COLUMN.values().length + IMPLICIT_TIMESTAMP_COLUMN.values().length;
+    String sheetColumnName = IMPLICIT_LIST_COLUMN.SHEETS.fieldName;
+    List<String> sheetNames = listMetadata.get(sheetColumnName);
+
+    if (sheetNameWriter == null) {
+      int sheetColumnIndex = rowWriter.tupleSchema().index(IMPLICIT_LIST_COLUMN.SHEETS.getFieldName());
+      if (sheetColumnIndex == -1) {
+        ColumnMetadata colSchema = MetadataUtils.newScalar(sheetColumnName, MinorType.VARCHAR, DataMode.REPEATED);
+        colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+        listIndex = rowWriter.addColumn(colSchema);
+      }
+      sheetNameWriter = rowWriter.column(listIndex).array().scalar();
+    }
+
+    for (String sheetName : sheetNames) {
+      sheetNameWriter.setString(sheetName);
+    }
+  }
+
+  private List<String> getSheetNames() {
+    List<String> sheets = new ArrayList<>();
+    int sheetCount = streamingWorkbook.getNumberOfSheets();
+    for (int i = 0; i < sheetCount; i++) {
+      sheets.add(streamingWorkbook.getSheetName(i));

Review comment:
       Excel populates the names by default as `Sheet 1` etc.  I don't think Excel will let you have a `null` name, and if it is blank you'd just get `""` which will work. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2211:
URL: https://github.com/apache/drill/pull/2211#discussion_r624710376



##########
File path: contrib/format-excel/pom.xml
##########
@@ -67,7 +67,7 @@
     <dependency>
       <groupId>com.github.pjfanning</groupId>
       <artifactId>excel-streaming-reader</artifactId>
-      <version>3.0.3</version>
+      <version>3.0.4</version>

Review comment:
       Sure.  I should have broken that up into multiple commits. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2211:
URL: https://github.com/apache/drill/pull/2211#discussion_r624711472



##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -116,6 +117,23 @@ public String getFieldName() {
     }
   }
 
+  private enum IMPLICIT_LIST_COLUMN {
+    /**
+     * A list of the available sheets in the file.
+     */
+    SHEETS("_sheets");
+
+    private final String fieldName;
+
+    IMPLICIT_LIST_COLUMN(String fieldName) {
+      this.fieldName = fieldName;
+    }
+
+    public String getFieldName() {

Review comment:
       I fixed this.  Really that function should be called something else, so I made it `void` and renamed the function. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2211: DRILL-7912: Add Sheet Names to Excel Reader

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2211:
URL: https://github.com/apache/drill/pull/2211#discussion_r624710173



##########
File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
##########
@@ -566,6 +598,34 @@ private void writeMetadata() {
         metadataColumnWriters.get(index).setTimestamp(Instant.ofEpochMilli(timeValue.getTime()));
       }
     }
+
+    // Write the sheet names.  Since this is the only list field
+    int listIndex = IMPLICIT_STRING_COLUMN.values().length + IMPLICIT_TIMESTAMP_COLUMN.values().length;
+    String sheetColumnName = IMPLICIT_LIST_COLUMN.SHEETS.fieldName;
+    List<String> sheetNames = listMetadata.get(sheetColumnName);
+
+    if (sheetNameWriter == null) {
+      int sheetColumnIndex = rowWriter.tupleSchema().index(IMPLICIT_LIST_COLUMN.SHEETS.getFieldName());
+      if (sheetColumnIndex == -1) {
+        ColumnMetadata colSchema = MetadataUtils.newScalar(sheetColumnName, MinorType.VARCHAR, DataMode.REPEATED);
+        colSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);

Review comment:
       > Great feature. Does this function also apply to other format plugins (`EXCLUDE_FROM_WILDCARD ` set to true)?
   
   The `EXCLUDE_FROM_WILDCARD` feature is meant for metadata fields or other info that you'd want to access to, but you'd also want the user to explicitly ask for.  In this case, the sheet names... 
   
   Some other format plugins, such as the log reader, have some metadata fields like this.  I think there may be a few in various storage plugins as well. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org