You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Tim Späth (JIRA)" <ji...@apache.org> on 2017/09/15 11:42:00 UTC
[jira] [Updated] (NIFI-4385) Adjust the QueryDatabaseTable
processor for handling big tables.
[ https://issues.apache.org/jira/browse/NIFI-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Späth updated NIFI-4385:
----------------------------
Description:
When querying large database tables, the *QueryDatabaseTable* processor does not perform very well.
The processor will always perform the full query and then transfer all flowfiles as a list instead of
transferring them particularly after the *ResultSet* is fetching the next rows(If a fetch size is given).
If you want to query a billion rows from a table,
the processor will add all flowfiles in an ArrayList<FlowFile> in memory
before transferring the whole list after the last row is fetched by the ResultSet.
I've checked the code in *org.apache.nifi.processors.standard.QueryDatabaseTable.java*
and in my opinion, it would be no big deal to move the session.transfer to a proper position in the code (into the while loop where the flowfile is added to the list) to
achieve a real _stream support_. There was also a bug report for this problem
which resulted in adding the new property *Maximum Number of Fragments*,
but this property will just limit the results.
Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of Fragments* to get your limit,
which is not really a solution for the original problem imho.
Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is much slower than using a database cursor or a ResultSet
and stream the rows in flowfiles directly in the queue.
was:
When querying large database tables, the *QueryDatabaseTable * processor does not perform very well.
The processor will always perform the full query and then transfer all flowfiles as a list instead of
transferring them particularly after the *ResultSet *is fetching the next rows(If a fetch size is given).
If you want to query a billion rows from a table,
the processor will add all flowfiles in an ArrayList<FlowFile> in memory
before transferring the whole list after the last row is fetched by the ResultSet.
I've checked the code in *org.apache.nifi.processors.standard.QueryDatabaseTable.java*
and in my opinion, it would be no big deal to move the session.transfer to a proper position in the code (into the while loop where the flowfile is added to the list) to
achieve a real _stream support_. There was also a bug report for this problem
which resulted in adding the new property *Maximum Number of Fragments*,
but this property will just limit the results.
Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of Fragments* to get your limit,
which is not really a solution for the original problem imho.
Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is much slower than using a database cursor or a ResultSet
and stream the rows in flowfiles directly in the queue.
> Adjust the QueryDatabaseTable processor for handling big tables.
> ----------------------------------------------------------------
>
> Key: NIFI-4385
> URL: https://issues.apache.org/jira/browse/NIFI-4385
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Affects Versions: 1.3.0
> Reporter: Tim Späth
>
> When querying large database tables, the *QueryDatabaseTable* processor does not perform very well.
> The processor will always perform the full query and then transfer all flowfiles as a list instead of
> transferring them particularly after the *ResultSet* is fetching the next rows(If a fetch size is given).
> If you want to query a billion rows from a table,
> the processor will add all flowfiles in an ArrayList<FlowFile> in memory
> before transferring the whole list after the last row is fetched by the ResultSet.
> I've checked the code in *org.apache.nifi.processors.standard.QueryDatabaseTable.java*
> and in my opinion, it would be no big deal to move the session.transfer to a proper position in the code (into the while loop where the flowfile is added to the list) to
> achieve a real _stream support_. There was also a bug report for this problem
> which resulted in adding the new property *Maximum Number of Fragments*,
> but this property will just limit the results.
> Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of Fragments* to get your limit,
> which is not really a solution for the original problem imho.
> Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is much slower than using a database cursor or a ResultSet
> and stream the rows in flowfiles directly in the queue.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)