You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Juan Pablo Gardella <ga...@gmail.com> on 2018/03/08 20:20:36 UTC

Add Max Rows Per Flow File into ExecuteSQL

Hello team,

I would like to add "Max Rows Per Flow File" to ExecuteSQL processor. I can
create a JIRA and spent some time into that. But before doing this, I would
like to know if someone of the team see a problem with that or, if that is
intentional.

I found that option useful in some use cases.

Thanks,
Juan

Re: Add Max Rows Per Flow File into ExecuteSQL

Posted by Matt Burgess <ma...@apache.org>.
Juan,

Glad to hear of your interest in this! Strangely, it seems to be a
popular feature (see the existing Jira [1]) but so far there hasn't
been a PR to address it. This has been done for QueryDatabaseTable,
and one workaround is to use QueryDatabaseTable without specifying a
Maximum Value Column, but that has its own drawbacks (it should only
run on the primary node, doesn't allow incoming connections, and
doesn't yet support arbitrary queries [2].

One thing I would keep in mind is whether to just break up the result
set into multiple flow files (once Max Rows Per Flow File is reached,
transfer the flow file and open a new flow file), or whether to also
support committing the session after X flow files have been written.
These can certainly be separate features/Jiras and feel free to only
tackle the multiple flow files aspect. I only mention it because we
did the same thing for QueryDatabaseTable. With large result sets,
whether you break them up into multiple flow files or not, they will
not be transferred downstream until the session is committed. If that
is done after all rows are processed, the downstream processors will
be waiting until all flow files are ready. The tradeoff is that the
flow file(s) could contain the total row count.  However if the Max
Rows Per Flow File use case is to allow downstream processing, you
could trade off the feature for each flow file to contain the total
count with the feature that X flow files will be committed, and can
thus be processed downstream while more rows are being processed by
ExecuteSQL.

Sorry for the long response, just wanted to get my thoughts down
before I forget :)  Let's use the dev list for any further discussion
about this, since it's more geared towards development. Thanks, and
looking forward to your contributions!


Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-1251
[2] https://issues.apache.org/jira/browse/NIFI-1706

On Thu, Mar 8, 2018 at 3:20 PM, Juan Pablo Gardella
<ga...@gmail.com> wrote:
> Hello team,
>
> I would like to add "Max Rows Per Flow File" to ExecuteSQL processor. I can
> create a JIRA and spent some time into that. But before doing this, I would
> like to know if someone of the team see a problem with that or, if that is
> intentional.
>
> I found that option useful in some use cases.
>
> Thanks,
> Juan