You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "dan-s1 (via GitHub)" <gi...@apache.org> on 2023/03/06 19:52:05 UTC

[GitHub] [nifi] dan-s1 opened a new pull request, #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

dan-s1 opened a new pull request, #7016:
URL: https://github.com/apache/nifi/pull/7016

   …. Also refactored to reduce the cognitive complexity in the onTrigger method.
   
   <!-- Licensed to the Apache Software Foundation (ASF) under one or more -->
   <!-- contributor license agreements.  See the NOTICE file distributed with -->
   <!-- this work for additional information regarding copyright ownership. -->
   <!-- The ASF licenses this file to You under the Apache License, Version 2.0 -->
   <!-- (the "License"); you may not use this file except in compliance with -->
   <!-- the License.  You may obtain a copy of the License at -->
   <!--     http://www.apache.org/licenses/LICENSE-2.0 -->
   <!-- Unless required by applicable law or agreed to in writing, software -->
   <!-- distributed under the License is distributed on an "AS IS" BASIS, -->
   <!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -->
   <!-- See the License for the specific language governing permissions and -->
   <!-- limitations under the License. -->
   
   # Summary
   
   [NIFI-10792](https://issues.apache.org/jira/browse/NIFI-10792)
   
   # Tracking
   
   Please complete the following tracking steps prior to pull request creation.
   
   ### Issue Tracking
   
   - [ ] [Apache NiFi Jira](https://issues.apache.org/jira/browse/NIFI) issue created
   
   ### Pull Request Tracking
   
   - [ ] Pull Request title starts with Apache NiFi Jira issue number, such as `NIFI-00000`
   - [ ] Pull Request commit message starts with Apache NiFi Jira issue number, as such `NIFI-00000`
   
   ### Pull Request Formatting
   
   - [ ] Pull Request based on current revision of the `main` branch
   - [ ] Pull Request refers to a feature branch with one commit containing changes
   
   # Verification
   
   Please indicate the verification steps performed prior to pull request creation.
   
   ### Build
   
   - [ ] Build completed using `mvn clean install -P contrib-check`
     - [ ] JDK 11
     - [ ] JDK 17
   
   ### Licensing
   
   - [ ] New dependencies are compatible with the [Apache License 2.0](https://apache.org/licenses/LICENSE-2.0) according to the [License Policy](https://www.apache.org/legal/resolved.html)
   - [ ] New dependencies are documented in applicable `LICENSE` and `NOTICE` files
   
   ### Documentation
   
   - [ ] Documentation formatting appears as expected in rendered files
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "exceptionfactory (via GitHub)" <gi...@apache.org>.
exceptionfactory commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1137567815


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -313,7 +334,7 @@ private String getSheetsNotFound(Map<String, Boolean> desiredSheets) {
      * do most of the work of parsing the contents of the Excel sheet
      * and outputs the contents as a (basic) CSV.
      */
-    private static class SheetToCSV {
+    private class SheetToCSV {

Review Comment:
   Thanks @dan-s1, I appreciate the attention to detail!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] mh013370 commented on pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "mh013370 (via GitHub)" <gi...@apache.org>.
mh013370 commented on PR #7016:
URL: https://github.com/apache/nifi/pull/7016#issuecomment-1460537814

   > @exceptionfactory I did not end up including the unit test I had as it was a unit which tested with a 20MB file. I would have thought there should be a unit test to exercise the change I made. Please advise.
   
   Anecdotally, the NiFi mock framework will read the entire FF contents into memory (I discovered this in #6369) so I think you're correct in not including unit tests testing larger FFs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1133949458


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -199,93 +166,52 @@ public final List<PropertyDescriptor> getSupportedPropertyDescriptors() {
     @Override
     public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
         final FlowFile flowFile = session.get();
-        if ( flowFile == null ) {
+        if (flowFile == null) {
             return;
         }
 
-        final String desiredSheetsDelimited = context.getProperty(DESIRED_SHEETS).evaluateAttributeExpressions(flowFile).getValue();
+        final Map<String, Boolean> desiredSheets = getDesiredSheets(context, flowFile);

Review Comment:
   No a primitive cannot be used here. The Map represented with desiredSheets keeps track of which specified sheets were found (lines 189-196) in order to produce the log message on lines 199-201.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] exceptionfactory commented on pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "exceptionfactory (via GitHub)" <gi...@apache.org>.
exceptionfactory commented on PR #7016:
URL: https://github.com/apache/nifi/pull/7016#issuecomment-1463978789

   Good point @dan-s1, I'm not sure if the formatting information could be carried through to an Excel Writer, but it would be useful if it can be supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1137540823


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -313,7 +334,7 @@ private String getSheetsNotFound(Map<String, Boolean> desiredSheets) {
      * do most of the work of parsing the contents of the Excel sheet
      * and outputs the contents as a (basic) CSV.
      */
-    private static class SheetToCSV {
+    private class SheetToCSV {

Review Comment:
   @exceptionfactory Shouldn't this class be static?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on PR #7016:
URL: https://github.com/apache/nifi/pull/7016#issuecomment-1463973568

   @exceptionfactory Thanks! That is a good point regarding the formatting. Though I would think you would want to have an option to preserve formatting for use with an Excel record writer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] exceptionfactory closed pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "exceptionfactory (via GitHub)" <gi...@apache.org>.
exceptionfactory closed pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…
URL: https://github.com/apache/nifi/pull/7016


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1133945660


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -299,55 +225,62 @@ public void process(InputStream inputStream) throws IOException {
         }
     }
 
+    private List<Integer> getColumnsToSkip(final ProcessContext context, FlowFile flowFile) {
+        final String[] columnsToSkip = StringUtils.split(context.getProperty(COLUMNS_TO_SKIP)
+                .evaluateAttributeExpressions(flowFile).getValue(), ",");
+
+        if (columnsToSkip != null) {
+            try {
+                return Arrays.stream(columnsToSkip)
+                        .map(columnToSkip -> Integer.parseInt(columnToSkip) - 1)
+                        .collect(Collectors.toList());
+            } catch (NumberFormatException e) {
+                throw new ProcessException("Invalid column in Columns to Skip list.", e);
+            }
+        }
+
+        return new ArrayList<>();
+    }
+
+    private Map<String, Boolean> getDesiredSheets(final ProcessContext context, FlowFile flowFile) {
+        final String desiredSheetsDelimited = context.getProperty(DESIRED_SHEETS).evaluateAttributeExpressions(flowFile).getValue();
+        if (desiredSheetsDelimited != null) {
+            String[] desiredSheets = StringUtils.split(desiredSheetsDelimited, DESIRED_SHEETS_DELIMITER);
+            if(desiredSheets != null) {
+                return Arrays.stream(desiredSheets)
+                        .collect(Collectors.toMap(key -> key, value -> Boolean.FALSE));
+            } else {
+                getLogger().debug("Excel document was parsed but no sheets with the specified desired names were found.");
+            }
+        }
+
+        return new HashMap<>();
+    }
 
     /**
      * Handles an individual Excel sheet from the entire Excel document. Each sheet will result in an individual flowfile.
      *
-     * @param session
-     *  The NiFi ProcessSession instance for the current invocation.
+     * @param session The NiFi ProcessSession instance for the current invocation.
      */
-    private void handleExcelSheet(ProcessSession session, FlowFile originalParentFF, final InputStream sheetInputStream, ExcelSheetReadConfig readConfig,
-                                  CSVFormat csvFormat) throws IOException {
+    private void handleExcelSheet(ProcessSession session, FlowFile originalParentFF, final Sheet sheet, ExcelSheetReadConfig readConfig,
+                                  CSVFormat csvFormat) {
 
         FlowFile ff = session.create(originalParentFF);
+        final SheetToCSV sheetHandler = new SheetToCSV(readConfig, csvFormat);
         try {
-            final DataFormatter formatter = new DataFormatter();
-            final InputSource sheetSource = new InputSource(sheetInputStream);
-
-            final SheetToCSV sheetHandler = new SheetToCSV(readConfig, csvFormat);
-
-            final XMLReader parser = SAXHelper.newXMLReader();
-
-            //If Value Formatting is set to false then don't pass in the styles table.
-            // This will cause the XSSF Handler to return the raw value instead of the formatted one.
-            final StylesTable sst = readConfig.getFormatValues()?readConfig.getStyles():null;
-
-            final XSSFSheetXMLHandler handler = new XSSFSheetXMLHandler(
-                    sst, null, readConfig.getSharedStringsTable(), sheetHandler, formatter, false);
-
-            parser.setContentHandler(handler);
-
-            ff = session.write(ff, new OutputStreamCallback() {
-                @Override
-                public void process(OutputStream out) throws IOException {
-                    PrintStream outPrint = new PrintStream(out, false, StandardCharsets.UTF_8.name());
-                    sheetHandler.setOutput(outPrint);
-
-                    try {
-                        parser.parse(sheetSource);
-
-                        sheetInputStream.close();
-
-                        sheetHandler.close();
-                        outPrint.close();
-                    } catch (SAXException se) {
-                        getLogger().error("Error occurred while processing Excel sheet {}", new Object[]{readConfig.getSheetName()}, se);
-                    }
-                }
+            ff = session.write(ff, out -> {
+                PrintStream outPrint = new PrintStream(out, false, StandardCharsets.UTF_8);
+                sheetHandler.setOutput(outPrint);
+                sheet.forEach(row -> {
+                    sheetHandler.startRow(row.getRowNum());
+                    row.forEach(sheetHandler::cell);
+                    sheetHandler.endRow();
+                });
+                sheetHandler.close();

Review Comment:
   `outPrint.close()` is no longer there and the PrintStream represented with the variable `outPrint` is closed in the sheetHandler's close method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1133949458


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -199,93 +166,52 @@ public final List<PropertyDescriptor> getSupportedPropertyDescriptors() {
     @Override
     public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
         final FlowFile flowFile = session.get();
-        if ( flowFile == null ) {
+        if (flowFile == null) {
             return;
         }
 
-        final String desiredSheetsDelimited = context.getProperty(DESIRED_SHEETS).evaluateAttributeExpressions(flowFile).getValue();
+        final Map<String, Boolean> desiredSheets = getDesiredSheets(context, flowFile);

Review Comment:
   No a primitive cannot be used here. The Map represented with desiredSheets keeps track of which specified sheets were found (lines 189-196)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] exceptionfactory commented on pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "exceptionfactory (via GitHub)" <gi...@apache.org>.
exceptionfactory commented on PR #7016:
URL: https://github.com/apache/nifi/pull/7016#issuecomment-1460568600

   Thanks for the comment @mh013370, this issue is similar to the size limits for Tar files resolved in #6369.
   
   @dan-s1, Although unit tests are the preferred way to confirm expected behavior, unit tests are not optimal for testing these kinds of scenarios. For this particular situation, confirming existing functionality is good, and runtime verification is better than introducing large files or long-running tests into the repository.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1137565949


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -313,7 +334,7 @@ private String getSheetsNotFound(Map<String, Boolean> desiredSheets) {
      * do most of the work of parsing the contents of the Excel sheet
      * and outputs the contents as a (basic) CSV.
      */
-    private static class SheetToCSV {
+    private class SheetToCSV {

Review Comment:
   @exceptionfactory I did not realize that. The way you have it then is fine. My motto keep it simple :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on PR #7016:
URL: https://github.com/apache/nifi/pull/7016#issuecomment-1460523830

   @exceptionfactory I did not end up including the unit test I had as it was a unit which tested with a 20MB file. I would have thought there should be a unit test to exercise the change I made.  Please advise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "exceptionfactory (via GitHub)" <gi...@apache.org>.
exceptionfactory commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1137551740


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -313,7 +334,7 @@ private String getSheetsNotFound(Map<String, Boolean> desiredSheets) {
      * do most of the work of parsing the contents of the Excel sheet
      * and outputs the contents as a (basic) CSV.
      */
-    private static class SheetToCSV {
+    private class SheetToCSV {

Review Comment:
   That was a good change, but there was an `e.printStackTrace()` in a catch block that I changed to call `getLogger().warn()`, requiring the class to be non-static. The other option is to pass the ComponentLog reference to the SheetToCSV class, but since the class was previously non-static, this seemed like a less impactful change. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1133949458


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -199,93 +166,52 @@ public final List<PropertyDescriptor> getSupportedPropertyDescriptors() {
     @Override
     public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
         final FlowFile flowFile = session.get();
-        if ( flowFile == null ) {
+        if (flowFile == null) {
             return;
         }
 
-        final String desiredSheetsDelimited = context.getProperty(DESIRED_SHEETS).evaluateAttributeExpressions(flowFile).getValue();
+        final Map<String, Boolean> desiredSheets = getDesiredSheets(context, flowFile);

Review Comment:
   No a primitive cannot be used here. The Map represented with desiredSheets keeps track of which specified sheets were found (lines 189-196) in order to produce the log message on lines 199-201 as part of the improvement done for [NIFI-8005](https://issues.apache.org/jira/browse/NIFI-8005)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on PR #7016:
URL: https://github.com/apache/nifi/pull/7016#issuecomment-1463889666

   @exceptionfactory When you get a chance can look over this PR? I was hoping I could reuse the Excel part of this code for [NIFI-11167](https://issues.apache.org/jira/browse/NIFI-11167). I noticed a drawback with [fastexcel-reader](https://github.com/dhatim/fastexcel#fastexcel-reader) that it
   >  discards styles, graphs, and many other stuff
   therefore I would like to use [excel-streaming-reader](https://github.com/pjfanning/excel-streaming-reader) as I did in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] exceptionfactory commented on pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "exceptionfactory (via GitHub)" <gi...@apache.org>.
exceptionfactory commented on PR #7016:
URL: https://github.com/apache/nifi/pull/7016#issuecomment-1463958748

   Thanks @dan-s1, I plan to take a closer look at this pull request soon.
   
   The `excel-streaming-reader` seems acceptable, although discarding styles, graphs, and other elements does not necessarily sound like a problem because that would not necessarily translate to record-oriented data for other services.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] emiliosetiadarma commented on a diff in pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "emiliosetiadarma (via GitHub)" <gi...@apache.org>.
emiliosetiadarma commented on code in PR #7016:
URL: https://github.com/apache/nifi/pull/7016#discussion_r1132959852


##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -299,55 +225,62 @@ public void process(InputStream inputStream) throws IOException {
         }
     }
 
+    private List<Integer> getColumnsToSkip(final ProcessContext context, FlowFile flowFile) {
+        final String[] columnsToSkip = StringUtils.split(context.getProperty(COLUMNS_TO_SKIP)
+                .evaluateAttributeExpressions(flowFile).getValue(), ",");
+
+        if (columnsToSkip != null) {
+            try {
+                return Arrays.stream(columnsToSkip)
+                        .map(columnToSkip -> Integer.parseInt(columnToSkip) - 1)
+                        .collect(Collectors.toList());
+            } catch (NumberFormatException e) {
+                throw new ProcessException("Invalid column in Columns to Skip list.", e);
+            }
+        }
+
+        return new ArrayList<>();
+    }
+
+    private Map<String, Boolean> getDesiredSheets(final ProcessContext context, FlowFile flowFile) {
+        final String desiredSheetsDelimited = context.getProperty(DESIRED_SHEETS).evaluateAttributeExpressions(flowFile).getValue();
+        if (desiredSheetsDelimited != null) {
+            String[] desiredSheets = StringUtils.split(desiredSheetsDelimited, DESIRED_SHEETS_DELIMITER);
+            if(desiredSheets != null) {
+                return Arrays.stream(desiredSheets)
+                        .collect(Collectors.toMap(key -> key, value -> Boolean.FALSE));
+            } else {
+                getLogger().debug("Excel document was parsed but no sheets with the specified desired names were found.");
+            }
+        }
+
+        return new HashMap<>();
+    }
 
     /**
      * Handles an individual Excel sheet from the entire Excel document. Each sheet will result in an individual flowfile.
      *
-     * @param session
-     *  The NiFi ProcessSession instance for the current invocation.
+     * @param session The NiFi ProcessSession instance for the current invocation.
      */
-    private void handleExcelSheet(ProcessSession session, FlowFile originalParentFF, final InputStream sheetInputStream, ExcelSheetReadConfig readConfig,
-                                  CSVFormat csvFormat) throws IOException {
+    private void handleExcelSheet(ProcessSession session, FlowFile originalParentFF, final Sheet sheet, ExcelSheetReadConfig readConfig,
+                                  CSVFormat csvFormat) {
 
         FlowFile ff = session.create(originalParentFF);
+        final SheetToCSV sheetHandler = new SheetToCSV(readConfig, csvFormat);
         try {
-            final DataFormatter formatter = new DataFormatter();
-            final InputSource sheetSource = new InputSource(sheetInputStream);
-
-            final SheetToCSV sheetHandler = new SheetToCSV(readConfig, csvFormat);
-
-            final XMLReader parser = SAXHelper.newXMLReader();
-
-            //If Value Formatting is set to false then don't pass in the styles table.
-            // This will cause the XSSF Handler to return the raw value instead of the formatted one.
-            final StylesTable sst = readConfig.getFormatValues()?readConfig.getStyles():null;
-
-            final XSSFSheetXMLHandler handler = new XSSFSheetXMLHandler(
-                    sst, null, readConfig.getSharedStringsTable(), sheetHandler, formatter, false);
-
-            parser.setContentHandler(handler);
-
-            ff = session.write(ff, new OutputStreamCallback() {
-                @Override
-                public void process(OutputStream out) throws IOException {
-                    PrintStream outPrint = new PrintStream(out, false, StandardCharsets.UTF_8.name());
-                    sheetHandler.setOutput(outPrint);
-
-                    try {
-                        parser.parse(sheetSource);
-
-                        sheetInputStream.close();
-
-                        sheetHandler.close();
-                        outPrint.close();
-                    } catch (SAXException se) {
-                        getLogger().error("Error occurred while processing Excel sheet {}", new Object[]{readConfig.getSheetName()}, se);
-                    }
-                }
+            ff = session.write(ff, out -> {
+                PrintStream outPrint = new PrintStream(out, false, StandardCharsets.UTF_8);
+                sheetHandler.setOutput(outPrint);
+                sheet.forEach(row -> {
+                    sheetHandler.startRow(row.getRowNum());
+                    row.forEach(sheetHandler::cell);
+                    sheetHandler.endRow();
+                });
+                sheetHandler.close();

Review Comment:
   Is `outPrint.close()` call no longer needed after `sheetHandler.close()`?



##########
nifi-nar-bundles/nifi-poi-bundle/nifi-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java:
##########
@@ -199,93 +166,52 @@ public final List<PropertyDescriptor> getSupportedPropertyDescriptors() {
     @Override
     public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
         final FlowFile flowFile = session.get();
-        if ( flowFile == null ) {
+        if (flowFile == null) {
             return;
         }
 
-        final String desiredSheetsDelimited = context.getProperty(DESIRED_SHEETS).evaluateAttributeExpressions(flowFile).getValue();
+        final Map<String, Boolean> desiredSheets = getDesiredSheets(context, flowFile);

Review Comment:
   Wondering if we could use the primitive `boolean` type here, based on a cursory read-through it doesn't appear we need the Object version here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi] dan-s1 commented on pull request #7016: [NIFI-10792] Fixed bug to allow for processing files larger than 10MB…

Posted by "dan-s1 (via GitHub)" <gi...@apache.org>.
dan-s1 commented on PR #7016:
URL: https://github.com/apache/nifi/pull/7016#issuecomment-1468229057

   @emiliosetiadarma @exceptionfactory I believe I addressed the issues brought up. Is there anything else needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org