You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by GitBox <gi...@apache.org> on 2022/10/20 11:47:33 UTC

[GitHub] [nifi-minifi-cpp] lordgamez opened a new pull request, #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

lordgamez opened a new pull request, #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439

   https://issues.apache.org/jira/browse/MINIFICPP-1967
   
   ------------------------------------
   Thank you for submitting a contribution to Apache NiFi - MiNiFi C++.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced
        in the commit message?
   
   - [ ] Does your PR title start with MINIFICPP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically main)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the LICENSE file?
   - [ ] If applicable, have you updated the NOTICE file?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions CI results for build issues and submit an update to your PR as soon as possible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] lordgamez commented on a diff in pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
lordgamez commented on code in PR #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439#discussion_r1011717083


##########
PROCESSORS.md:
##########
@@ -2424,6 +2424,7 @@ In the list below, the names of required properties appear in bold. Any other pr
 | State File                 | TailFileState     |                                                        | Specifies the file that should be used for storing state about what data has been ingested so that upon restart NiFi can resume from where it left off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
 | tail-base-directory        |                   |                                                        | Base directory used to look for files to tail. This property is required when using Multiple file mode. Can contain expression language placeholders if Attribute Provider Service is set.<br/>**Supports Expression Language: true**                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
 | **tail-mode**              | Single file       | Single file<br>Multiple file<br>                       | Specifies the tail file mode. In 'Single file' mode only a single file will be watched. In 'Multiple file' mode a regex may be used. Note that in multiple file mode we will still continue to watch for rollover on the initial set of watched files. The Regex used to locate multiple files will be run during the schedule phrase. Note that if rotated files are matched by the regex, those files will be tailed.                                                                                                                                                                                                                                                                                            |
+| **Batch Size**             | 0                 |                                                        | Maximum number of lines to process in a single trigger. If set to 0 all new lines will be processed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

Review Comment:
   After reading the comments I don't think we can really talk about "lines" specifically in this processor, as the flow file contents are fully controlled by the delimiter property. So I'm not sure it would be a good idea to have a separate property that controls lines specifically. The `Batch Size` property name is consistently used in all processors to mark the number of flow files emitted at once by the processor so it should represent the same here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] adamdebreceni commented on a diff in pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
adamdebreceni commented on code in PR #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439#discussion_r1008109715


##########
extensions/standard-processors/processors/TailFile.cpp:
##########
@@ -135,6 +135,13 @@ const core::Property TailFile::AttributeProviderService(
         ->asType<minifi::controllers::AttributeProviderService>()
         ->build());
 
+const core::Property TailFile::BatchSize(
+    core::PropertyBuilder::createProperty("Batch Size")
+        ->withDescription("Maximum number of lines to process in a single trigger. If set to 0 all new lines will be processed.")

Review Comment:
   should we change this to "number of flowfiles emitted" or similar? with the input delimiter the user could break the "flowfile == line" correspondence 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] arpadboda closed pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
arpadboda closed pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] szaszm commented on a diff in pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
szaszm commented on code in PR #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439#discussion_r1011894988


##########
PROCESSORS.md:
##########
@@ -2424,6 +2424,7 @@ In the list below, the names of required properties appear in bold. Any other pr
 | State File                 | TailFileState     |                                                        | Specifies the file that should be used for storing state about what data has been ingested so that upon restart NiFi can resume from where it left off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
 | tail-base-directory        |                   |                                                        | Base directory used to look for files to tail. This property is required when using Multiple file mode. Can contain expression language placeholders if Attribute Provider Service is set.<br/>**Supports Expression Language: true**                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
 | **tail-mode**              | Single file       | Single file<br>Multiple file<br>                       | Specifies the tail file mode. In 'Single file' mode only a single file will be watched. In 'Multiple file' mode a regex may be used. Note that in multiple file mode we will still continue to watch for rollover on the initial set of watched files. The Regex used to locate multiple files will be run during the schedule phrase. Note that if rotated files are matched by the regex, those files will be tailed.                                                                                                                                                                                                                                                                                            |
+| **Batch Size**             | 0                 |                                                        | Maximum number of lines to process in a single trigger. If set to 0 all new lines will be processed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

Review Comment:
   My bad, the Batch Size property indeed means the same in all other processors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] szaszm commented on a diff in pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
szaszm commented on code in PR #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439#discussion_r1011706756


##########
PROCESSORS.md:
##########
@@ -2424,6 +2424,7 @@ In the list below, the names of required properties appear in bold. Any other pr
 | State File                 | TailFileState     |                                                        | Specifies the file that should be used for storing state about what data has been ingested so that upon restart NiFi can resume from where it left off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
 | tail-base-directory        |                   |                                                        | Base directory used to look for files to tail. This property is required when using Multiple file mode. Can contain expression language placeholders if Attribute Provider Service is set.<br/>**Supports Expression Language: true**                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
 | **tail-mode**              | Single file       | Single file<br>Multiple file<br>                       | Specifies the tail file mode. In 'Single file' mode only a single file will be watched. In 'Multiple file' mode a regex may be used. Note that in multiple file mode we will still continue to watch for rollover on the initial set of watched files. The Regex used to locate multiple files will be run during the schedule phrase. Note that if rotated files are matched by the regex, those files will be tailed.                                                                                                                                                                                                                                                                                            |
+| **Batch Size**             | 0                 |                                                        | Maximum number of lines to process in a single trigger. If set to 0 all new lines will be processed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

Review Comment:
   I would update the property name as well. This leaves the door open for a future property that controls how many lines are emitted in a single flow file, with a similar same name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] szaszm commented on a diff in pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
szaszm commented on code in PR #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439#discussion_r1001724862


##########
PROCESSORS.md:
##########
@@ -2424,6 +2424,7 @@ In the list below, the names of required properties appear in bold. Any other pr
 | State File                 | TailFileState     |                                                        | Specifies the file that should be used for storing state about what data has been ingested so that upon restart NiFi can resume from where it left off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
 | tail-base-directory        |                   |                                                        | Base directory used to look for files to tail. This property is required when using Multiple file mode. Can contain expression language placeholders if Attribute Provider Service is set.<br/>**Supports Expression Language: true**                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
 | **tail-mode**              | Single file       | Single file<br>Multiple file<br>                       | Specifies the tail file mode. In 'Single file' mode only a single file will be watched. In 'Multiple file' mode a regex may be used. Note that in multiple file mode we will still continue to watch for rollover on the initial set of watched files. The Regex used to locate multiple files will be run during the schedule phrase. Note that if rotated files are matched by the regex, those files will be tailed.                                                                                                                                                                                                                                                                                            |
+| **Batch Size**             | 0                 |                                                        | Maximum number of lines to process in a single trigger. If set to 0 all new lines will be processed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

Review Comment:
   I think it would make sense to combine this with another property that controls whether a batch is emitted as a single flow file or one flow file per line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] martinzink commented on a diff in pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
martinzink commented on code in PR #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439#discussion_r1004113991


##########
extensions/standard-processors/tests/unit/TailFileTests.cpp:
##########
@@ -1847,3 +1847,38 @@ TEST_CASE("TailFile can use an AttributeProviderService", "[AttributeProviderSer
 
   LogTestController::getInstance().reset();
 }
+
+TEST_CASE("TailFile honors batch size for maximum lines processed", "[batchSize]") {
+  TestController testController;
+  LogTestController::getInstance().setTrace<minifi::processors::TailFile>();
+  LogTestController::getInstance().setDebug<minifi::processors::LogAttribute>();
+
+  std::shared_ptr<TestPlan> plan = testController.createPlan();
+  std::shared_ptr<core::Processor> tailfile = plan->addProcessor("TailFile", "TailFile");
+  auto logattribute = plan->addProcessor("LogAttribute", "LogAttribute", core::Relationship("success", "description"), true);
+  plan->setProperty(logattribute, org::apache::nifi::minifi::processors::LogAttribute::FlowFilesToLog.getName(), "0");
+
+  auto dir = testController.createTempDirectory();
+  std::stringstream temp_file;
+  temp_file << dir << utils::file::get_separator() << TMP_FILE;
+
+  std::ofstream tmpfile;
+  tmpfile.open(temp_file.str(), std::ios::out | std::ios::binary);
+  for (auto i = 0; i < 20; ++i) {
+    tmpfile << NEW_TAIL_DATA;
+  }
+  tmpfile.close();
+
+  std::stringstream state_file;
+  state_file << dir << utils::file::get_separator() << STATE_FILE;
+
+  plan->setProperty(tailfile, org::apache::nifi::minifi::processors::TailFile::FileName.getName(), temp_file.str());
+  plan->setProperty(tailfile, org::apache::nifi::minifi::processors::TailFile::Delimiter.getName(), "\n");
+  plan->setProperty(tailfile, org::apache::nifi::minifi::processors::TailFile::BatchSize.getName(), "10");
+
+  testController.runSession(plan);
+
+  REQUIRE(LogTestController::getInstance().contains("Logged 10 flow files"));

Review Comment:
   I think it would be better to utilize the https://github.com/apache/nifi-minifi-cpp/blob/main/libminifi/test/SingleProcessorTestController.h then you could directly check the number of flowfiles tailfile generates per trigger, instead of the indirect LogAttribute approach, and this test could be much simpler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] lordgamez commented on a diff in pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
lordgamez commented on code in PR #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439#discussion_r1011357368


##########
PROCESSORS.md:
##########
@@ -2424,6 +2424,7 @@ In the list below, the names of required properties appear in bold. Any other pr
 | State File                 | TailFileState     |                                                        | Specifies the file that should be used for storing state about what data has been ingested so that upon restart NiFi can resume from where it left off                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
 | tail-base-directory        |                   |                                                        | Base directory used to look for files to tail. This property is required when using Multiple file mode. Can contain expression language placeholders if Attribute Provider Service is set.<br/>**Supports Expression Language: true**                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
 | **tail-mode**              | Single file       | Single file<br>Multiple file<br>                       | Specifies the tail file mode. In 'Single file' mode only a single file will be watched. In 'Multiple file' mode a regex may be used. Note that in multiple file mode we will still continue to watch for rollover on the initial set of watched files. The Regex used to locate multiple files will be run during the schedule phrase. Note that if rotated files are matched by the regex, those files will be tailed.                                                                                                                                                                                                                                                                                            |
+| **Batch Size**             | 0                 |                                                        | Maximum number of lines to process in a single trigger. If set to 0 all new lines will be processed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

Review Comment:
   If the delimiter already control how the lines are processed, is it okay if we only change the description as @adamdebreceni suggested?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [nifi-minifi-cpp] lordgamez commented on a diff in pull request #1439: MINIFICPP-1967 Add batch processing of lines in TailFile

Posted by GitBox <gi...@apache.org>.
lordgamez commented on code in PR #1439:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1439#discussion_r1011482503


##########
extensions/standard-processors/processors/TailFile.cpp:
##########
@@ -135,6 +135,13 @@ const core::Property TailFile::AttributeProviderService(
         ->asType<minifi::controllers::AttributeProviderService>()
         ->build());
 
+const core::Property TailFile::BatchSize(
+    core::PropertyBuilder::createProperty("Batch Size")
+        ->withDescription("Maximum number of lines to process in a single trigger. If set to 0 all new lines will be processed.")

Review Comment:
   Good point, updated in 24f9ab5c32083fea259b4b1e0889bd93f20e8cda



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org