You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Mike Thomsen <mi...@gmail.com> on 2022/06/22 13:59:36 UTC

Most memory-efficient way in NiFi to fetch an entire RDBMS table?

We have a table with 68M records that will blow up to over 250M soon,
and need to do a full table fetch on it. What's the best practice for
efficiently doing a partial or full select on it?

Thanks,

Mike

Re: Most memory-efficient way in NiFi to fetch an entire RDBMS table?

Posted by Mike Thomsen <mi...@gmail.com>.
Thanks, Matt.

On Wed, Jun 22, 2022 at 10:46 AM Matt Burgess <ma...@apache.org> wrote:
>
> Mike,
>
> I recommend QueryDatabaseTableRecord with judicious choices for the
> following properties:
>
> Fetch Size: This should be tuned to return the most number of rows
> without causing network issues such as timeouts. Can be set to the
> same value as Max Rows Per Flow File ensuring one fetch per outgoing
> FlowFile
> Max Rows Per Flow File: This should be set to a reasonable number of
> rows per FlowFile, maybe 100K or even 1M if that doesn't cause issues
> (see above)
> Output Batch Size: This is the key to doing full selects on huge
> tables, as it allows FlowFiles to be committed to the session and
> passed downstream while the rest of the fetch is being processed. In
> your case if you set Max Rows to 100K then this could be 10, or if you
> set it to 1M it could be 1. Note that with this property set, the
> maxvalue.* and fragment.count attributes will not be set on these
> FlowFiles, so you can't merge them.  I believe the maxvalue state will
> still be updated even if this property is used, so it should turn into
> an incremental fetch after the first full fetch is complete.
>
> Regards,
> Matt
>
> On Wed, Jun 22, 2022 at 10:00 AM Mike Thomsen <mi...@gmail.com> wrote:
> >
> > We have a table with 68M records that will blow up to over 250M soon,
> > and need to do a full table fetch on it. What's the best practice for
> > efficiently doing a partial or full select on it?
> >
> > Thanks,
> >
> > Mike

Re: Most memory-efficient way in NiFi to fetch an entire RDBMS table?

Posted by Matt Burgess <ma...@apache.org>.
Mike,

I recommend QueryDatabaseTableRecord with judicious choices for the
following properties:

Fetch Size: This should be tuned to return the most number of rows
without causing network issues such as timeouts. Can be set to the
same value as Max Rows Per Flow File ensuring one fetch per outgoing
FlowFile
Max Rows Per Flow File: This should be set to a reasonable number of
rows per FlowFile, maybe 100K or even 1M if that doesn't cause issues
(see above)
Output Batch Size: This is the key to doing full selects on huge
tables, as it allows FlowFiles to be committed to the session and
passed downstream while the rest of the fetch is being processed. In
your case if you set Max Rows to 100K then this could be 10, or if you
set it to 1M it could be 1. Note that with this property set, the
maxvalue.* and fragment.count attributes will not be set on these
FlowFiles, so you can't merge them.  I believe the maxvalue state will
still be updated even if this property is used, so it should turn into
an incremental fetch after the first full fetch is complete.

Regards,
Matt

On Wed, Jun 22, 2022 at 10:00 AM Mike Thomsen <mi...@gmail.com> wrote:
>
> We have a table with 68M records that will blow up to over 250M soon,
> and need to do a full table fetch on it. What's the best practice for
> efficiently doing a partial or full select on it?
>
> Thanks,
>
> Mike