You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Vibhath Ileperuma <vi...@gmail.com> on 2021/05/09 17:11:02 UTC

Issue when extracting a large dataset into CSV files

Hi All,

I'm using NIFI to extract a postgreSQL table into a csv file. I'm using an
'executeSQLRecord' processor with a 'CSVRecordSetWriter' to extract data in
csv format and a PutFile processor to write the csv file into the machine.
From a single sql query, around 60 million rows can be fetched.

When I started the 'executeSQLRecord', after several minutes, NIFI web UI
became not responsive. When I checked the memory, only about 250-300MB RAM
was free. I have given 50GB heap space from the boostracp configs. It seems
like NIFI loads all the records into the RAM. Is there a way to write row
by row without loading all the records into the memory?

Best Regards,
Vibhath.

Re: Issue when extracting a large dataset into CSV files

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi Jim,

Thank you for your kind suggestion.

However, according to the posgreSql JDBC driver document I shared in my
previous email, it seems like fetchSize option works as we expect, only if
auto commit is set to false.
I tried to set it from 'options' as you suggested. But it seems we can't
set autocommit to false from that. Memory growth is still there with this
setting.

Thank You.
Vibhath



On Sun, 16 May 2021, 11:40 pm Jim Williams, <jw...@alertlogic.com>
wrote:

> Vibhath,
>
>
>
> I will preface this by saying that I do not have PostgreSQL in my
> environment to test with.  A couple things you may try to help get over the
> large memory usage:
>
>
>
> The DBCPConnectionPool controller service gives an opportunity to set JDBC
> connection parameters [1].
>
>
>
> An available connection parameter for PostgreSQL is “defaultRowFetchSize”
> [2].  The default for this parameter is zero, which fetches all rows.
> Perhaps setting this parameter to limit the number of rows fetched will
> help with the issue of fetching all rows at once?
>
>
>
> Also documented in [2] is the “options” connection initialization
> parameter.  It may be worth trying a JDBC property with the name “options”
> and the value “autoCommit=false” in the DBCPConnectionPool.  It may be this
> will allow the number of rows fetched to be set to a reasonable number in
> the ExecuteSQLRecord processor.
>
>
>
> Either of those options would not require any code changes, which will
> help to speed your data flow implementation to completion.
>
>
>
>
>
> [1]
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-dbcp-service-nar/1.13.2/org.apache.nifi.dbcp.DBCPConnectionPool/index.html
>
>
>
> [2] https://jdbc.postgresql.org//documentation/head/connect.html
> <https://jdbc.postgresql.org/documentation/head/connect.html>
>
>
>
>
>
>
>
> Warm regards,
>
>
>
> <https://www.alertlogic.com/>
>
> *Jim Williams* | Manager, Site Reliability Engineering
>
> jwilliams@alertlogic.com | alertlogic.com <http://www.alertlogic.com/>
> <https://twitter.com/alertlogic>
> <https://www.linkedin.com/company/alert-logic>
>
>
>
>
>
> *From:* Chris Sampson <ch...@naimuri.com>
> *Sent:* Sunday, May 16, 2021 10:22 AM
> *To:* users@nifi.apache.org
> *Subject:* Re: Issue when extracting a large dataset into CSV files
>
>
>
> Vibhath,
>
>
>
> Probably worth raising an issue in NiFi's Jira board [1] for someone to
> look at. Hopefully someone will take a look at implementing a change, maybe
> the configuration you identified should be something that could be exposed
> via a processor property (but I don't currently use that processor, so I
> don't know what's best).
>
>
>
> Of course, you could submit a PR with such a change for review if you feel
> confident making the change in the Java code.
>
>
>
>
>
> [1] https://issues.apache.org/jira/projects/NIFI/
>
>
>
>
> ---
>
> *Chris Sampson*
>
> IT Consultant
>
> chris.sampson@naimuri.com
>
> <https://www.naimuri.com/>
>
>
>
>
>
> On Sun, 16 May 2021 at 15:01, Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
>
> Hi All,
>
>
>
> According to the PostgreSQL JDBC driver document;
> https://jdbc.postgresql.org//documentation/head/query.html, driver
> collects all the results for the query at once by default. Caching a small
> number of records and fetching another block of rows when the cached rows
> are processed, can be possible only if connection's autocommit
> functionality is off.
>
>
>
> In the NIFI ExecuteSQLRecord processor, autocommit functionality is not
> turned off (conn.setAutoCommit(false)). Hence, if a query returns a large
> data set, it starts to load all the data into memory causing above
> mentioned large heap growth.
>
>
>
> I would be grateful if you could check and fix this issue in the next
> releases.
>
>
>
>
>
> Thank You.
>
> Best regards,
>
> Vibhath.
>
>
>
>
>
> On Fri, May 14, 2021 at 9:20 PM Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
>
> Hi All,
>
>
>
> To further debug this issue, I set the fetch size to 50 and Max rows per
> file to 100 as shown below image.
>
>
>
>
>
> I have set the initial heap size to 8GB and Max heap size to 15GB in the
> bootstrap.conf file.
>
> I started monitoring CPU and Memory with a 'top' command before starting
> the executeSQLRecord processor. *Initially there was about 48GB free
> memory* and CPU usage of NIFI was little.
>
>
>
> When the ExecuteSQLRecord is started, free memory size starts to reduce
> little by little. Until it was reduced to 35-36GB range CPU usage of NIFI
> was around 25%-30% and NIFI webui worked perfectly fine. But when free
> memory size reaches the 35GB-36GB range, NIFI's CPU usage increases to
> 1300% suddenly, and NIFI webui starts not to respond.
>
>
>
> It seems like Heap is filled within a short time and even though GC tries
> to clean the heap blocking all other threads, Further, a considerable
> amount of heap doesn't get cleaned as I noticed. I increased the heap size
> upto 50GB and the same issue occurs after some time. Note that, even though
> I have set batch size property to 1, I couldn't see even a single flowfile
> coming out before NIFI webUI starts not to respond.
>
>
>
> I have checked the GC.log file and it also shows the large heap growing. I
> have attached the GC log with this email.
>
>
>
> Could you please assist me to resolve this issue.
>
>
>
> *Thanks & Regards,*
>
> *Vibhath*
>
>
>
>
>
>
>
> On Wed, May 12, 2021 at 10:30 AM Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
>
> Hi Matt,
>
>
>
> I have set the fetch size to 20000. I tried with small values like 100.
> Got the same issue with the small sizes.
>
> Further I limited the row count fetched by the sql by adding a limit. Then
> I didn't got the issue even with large fetch sizes.
>
>
>
> Thank you.
>
>
>
> On Mon, 10 May 2021, 8:05 pm Matt Burgess, <ma...@apache.org> wrote:
>
> Vibath,
>
> What is the "Fetch Size" property set to? It looks like PostgreSQL
> will load all results if Fetch Size is set to zero [1]. Try setting it
> to 10000 or something like that, whatever doesn't use too much memory
> but doesn't slow down the performance too much.
>
> Regards,
> Matt
>
> [1] https://jdbc.postgresql.org/documentation/head/query.html
>
> On Mon, May 10, 2021 at 3:41 AM Vibhath Ileperuma
> <vi...@gmail.com> wrote:
> >
> > Hi Vijay,
> >
> > Eventhough that property is set to a postive value, same issue happens.
> Seems like NIFI first loads all the rows to RAM and write to multiple files
> when that property is set to a non-zero value.
> > On the other hand, if this property is set to a very small value, a
> large no of flow files can be generated at once.
> >
> > Thank you.
> > Best regards,
> > Vibhath
> >
> > On Mon, 10 May 2021, 9:00 am Vijay Chhipa, <vc...@apple.com> wrote:
> >>
> >> Hi Vibhath,
> >>
> >> There is this  property on the processor
> >>
> >> Max Rows Per Flow File
> >>
> >>
> >> Per docs:  If the value specified is zero, then all rows are returned
> in a single FlowFile.
> >> Which seems to be what is happening in your case.
> >>
> >>
> >>
> >> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
> >>
> >> executeSQLRecord
> >>
> >>
>
> Confidentiality Notice | This email and any included attachments may be
> privileged, confidential and/or otherwise protected from disclosure. Access
> to this email by anyone other than the intended recipient is unauthorized.
> If you believe you have received this email in error, please contact the
> sender immediately and delete all copies. If you are not the intended
> recipient, you are notified that disclosing, copying, distributing or
> taking any action in reliance on the contents of this information is
> strictly prohibited.
>
> *Disclaimer*
>
> The information contained in this communication from the sender is
> confidential. It is intended solely for use by the recipient and others
> authorized to receive it. If you are not the recipient, you are hereby
> notified that any disclosure, copying, distribution or taking action in
> relation of the contents of this information is strictly prohibited and may
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been
> automatically archived by Mimecast, a leader in email security and cyber
> resilience. Mimecast integrates email defenses with brand protection,
> security awareness training, web security, compliance and other essential
> capabilities. Mimecast helps protect large and small organizations from
> malicious activity, human error and technology failure; and to lead the
> movement toward building a more resilient world. To find out more, visit
> our website.
>

RE: Issue when extracting a large dataset into CSV files

Posted by Jim Williams <jw...@alertlogic.com>.
Vibhath,

I will preface this by saying that I do not have PostgreSQL in my environment to test with.  A couple things you may try to help get over the large memory usage:

The DBCPConnectionPool controller service gives an opportunity to set JDBC connection parameters [1].

An available connection parameter for PostgreSQL is “defaultRowFetchSize” [2].  The default for this parameter is zero, which fetches all rows.  Perhaps setting this parameter to limit the number of rows fetched will help with the issue of fetching all rows at once?

Also documented in [2] is the “options” connection initialization parameter.  It may be worth trying a JDBC property with the name “options” and the value “autoCommit=false” in the DBCPConnectionPool.  It may be this will allow the number of rows fetched to be set to a reasonable number in the ExecuteSQLRecord processor.

Either of those options would not require any code changes, which will help to speed your data flow implementation to completion.


[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-dbcp-service-nar/1.13.2/org.apache.nifi.dbcp.DBCPConnectionPool/index.html

[2] https://jdbc.postgresql.org//documentation/head/connect.html<https://jdbc.postgresql.org/documentation/head/connect.html>



Warm regards,

[cid:image001.jpg@01D74A52.06B492B0]<https://www.alertlogic.com/>
Jim Williams | Manager, Site Reliability Engineering
jwilliams@alertlogic.com<ma...@alertlogic.com> | alertlogic.com<http://www.alertlogic.com/> [cid:image002.png@01D74A52.06B492B0] <https://twitter.com/alertlogic> [cid:image003.png@01D74A52.06B492B0] <https://www.linkedin.com/company/alert-logic>

[cid:image004.png@01D74A52.06B492B0]

From: Chris Sampson <ch...@naimuri.com>
Sent: Sunday, May 16, 2021 10:22 AM
To: users@nifi.apache.org
Subject: Re: Issue when extracting a large dataset into CSV files

Vibhath,

Probably worth raising an issue in NiFi's Jira board [1] for someone to look at. Hopefully someone will take a look at implementing a change, maybe the configuration you identified should be something that could be exposed via a processor property (but I don't currently use that processor, so I don't know what's best).

Of course, you could submit a PR with such a change for review if you feel confident making the change in the Java code.


[1] https://issues.apache.org/jira/projects/NIFI/<https://issues.apache.org/jira/projects/NIFI/>


---
Chris Sampson
IT Consultant
chris.sampson@naimuri.com<ma...@naimuri.com>
[https://docs.google.com/uc?export=download&id=1oPtzd0P7DqtuzpjiTRAa6h6coFitpqom&revid=0B9aXwC5rMc6lVlZ2OWpUaVlFVmUwTlZBdjQ0KzAxb1dZS2hJPQ]<https://www.naimuri.com/>


On Sun, 16 May 2021 at 15:01, Vibhath Ileperuma <vi...@gmail.com>> wrote:
Hi All,

According to the PostgreSQL JDBC driver document; https://jdbc.postgresql.org//documentation/head/query.html<https://jdbc.postgresql.org//documentation/head/query.html>, driver collects all the results for the query at once by default. Caching a small number of records and fetching another block of rows when the cached rows are processed, can be possible only if connection's autocommit functionality is off.

In the NIFI ExecuteSQLRecord processor, autocommit functionality is not turned off (conn.setAutoCommit(false)). Hence, if a query returns a large data set, it starts to load all the data into memory causing above mentioned large heap growth.

I would be grateful if you could check and fix this issue in the next releases.

[cid:image005.png@01D74A52.06B492B0]

Thank You.
Best regards,
Vibhath.


On Fri, May 14, 2021 at 9:20 PM Vibhath Ileperuma <vi...@gmail.com>> wrote:
Hi All,

To further debug this issue, I set the fetch size to 50 and Max rows per file to 100 as shown below image.

[cid:image006.png@01D74A52.06B492B0]

I have set the initial heap size to 8GB and Max heap size to 15GB in the bootstrap.conf file.
I started monitoring CPU and Memory with a 'top' command before starting the executeSQLRecord processor. Initially there was about 48GB free memory and CPU usage of NIFI was little.

When the ExecuteSQLRecord is started, free memory size starts to reduce little by little. Until it was reduced to 35-36GB range CPU usage of NIFI was around 25%-30% and NIFI webui worked perfectly fine. But when free memory size reaches the 35GB-36GB range, NIFI's CPU usage increases to 1300% suddenly, and NIFI webui starts not to respond.

It seems like Heap is filled within a short time and even though GC tries to clean the heap blocking all other threads, Further, a considerable amount of heap doesn't get cleaned as I noticed. I increased the heap size upto 50GB and the same issue occurs after some time. Note that, even though I have set batch size property to 1, I couldn't see even a single flowfile coming out before NIFI webUI starts not to respond.

I have checked the GC.log file and it also shows the large heap growing. I have attached the GC log with this email.

Could you please assist me to resolve this issue.


Thanks & Regards,

Vibhath




On Wed, May 12, 2021 at 10:30 AM Vibhath Ileperuma <vi...@gmail.com>> wrote:
Hi Matt,

I have set the fetch size to 20000. I tried with small values like 100. Got the same issue with the small sizes.
Further I limited the row count fetched by the sql by adding a limit. Then I didn't got the issue even with large fetch sizes.

Thank you.

On Mon, 10 May 2021, 8:05 pm Matt Burgess, <ma...@apache.org>> wrote:
Vibath,

What is the "Fetch Size" property set to? It looks like PostgreSQL
will load all results if Fetch Size is set to zero [1]. Try setting it
to 10000 or something like that, whatever doesn't use too much memory
but doesn't slow down the performance too much.

Regards,
Matt

[1] https://jdbc.postgresql.org/documentation/head/query.html<https://jdbc.postgresql.org/documentation/head/query.html>

On Mon, May 10, 2021 at 3:41 AM Vibhath Ileperuma
<vi...@gmail.com>> wrote:
>
> Hi Vijay,
>
> Eventhough that property is set to a postive value, same issue happens. Seems like NIFI first loads all the rows to RAM and write to multiple files when that property is set to a non-zero value.
> On the other hand, if this property is set to a very small value, a large no of flow files can be generated at once.
>
> Thank you.
> Best regards,
> Vibhath
>
> On Mon, 10 May 2021, 9:00 am Vijay Chhipa, <vc...@apple.com>> wrote:
>>
>> Hi Vibhath,
>>
>> There is this  property on the processor
>>
>> Max Rows Per Flow File
>>
>>
>> Per docs:  If the value specified is zero, then all rows are returned in a single FlowFile.
>> Which seems to be what is happening in your case.
>>
>>
>>
>> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <vi...@gmail.com>> wrote:
>>
>> executeSQLRecord
>>
>>
Confidentiality Notice | This email and any included attachments may be privileged, confidential and/or otherwise protected from disclosure. Access to this email by anyone other than the intended recipient is unauthorized. If you believe you have received this email in error, please contact the sender immediately and delete all copies. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website.

Re: Issue when extracting a large dataset into CSV files

Posted by Chris Sampson <ch...@naimuri.com>.
Vibhath,

Probably worth raising an issue in NiFi's Jira board [1] for someone to
look at. Hopefully someone will take a look at implementing a change, maybe
the configuration you identified should be something that could be exposed
via a processor property (but I don't currently use that processor, so I
don't know what's best).

Of course, you could submit a PR with such a change for review if you feel
confident making the change in the Java code.


[1] https://issues.apache.org/jira/projects/NIFI/


---
*Chris Sampson*
IT Consultant
chris.sampson@naimuri.com
<https://www.naimuri.com/>


On Sun, 16 May 2021 at 15:01, Vibhath Ileperuma <vi...@gmail.com>
wrote:

> Hi All,
>
> According to the PostgreSQL JDBC driver document;
> https://jdbc.postgresql.org//documentation/head/query.html, driver
> collects all the results for the query at once by default. Caching a small
> number of records and fetching another block of rows when the cached rows
> are processed, can be possible only if connection's autocommit
> functionality is off.
>
> In the NIFI ExecuteSQLRecord processor, autocommit functionality is not
> turned off (conn.setAutoCommit(false)). Hence, if a query returns a large
> data set, it starts to load all the data into memory causing above
> mentioned large heap growth.
>
> I would be grateful if you could check and fix this issue in the next
> releases.
>
> [image: image.png]
>
> Thank You.
> Best regards,
> Vibhath.
>
>
> On Fri, May 14, 2021 at 9:20 PM Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
>
>> Hi All,
>>
>> To further debug this issue, I set the fetch size to 50 and Max rows per
>> file to 100 as shown below image.
>>
>> [image: image.png]
>>
>> I have set the initial heap size to 8GB and Max heap size to 15GB in the
>> bootstrap.conf file.
>> I started monitoring CPU and Memory with a 'top' command before starting
>> the executeSQLRecord processor. *Initially there was about 48GB free
>> memory* and CPU usage of NIFI was little.
>>
>> When the ExecuteSQLRecord is started, free memory size starts to reduce
>> little by little. Until it was reduced to 35-36GB range CPU usage of NIFI
>> was around 25%-30% and NIFI webui worked perfectly fine. But when free
>> memory size reaches the 35GB-36GB range, NIFI's CPU usage increases to
>> 1300% suddenly, and NIFI webui starts not to respond.
>>
>> It seems like Heap is filled within a short time and even though GC tries
>> to clean the heap blocking all other threads, Further, a considerable
>> amount of heap doesn't get cleaned as I noticed. I increased the heap size
>> upto 50GB and the same issue occurs after some time. Note that, even though
>> I have set batch size property to 1, I couldn't see even a single flowfile
>> coming out before NIFI webUI starts not to respond.
>>
>> I have checked the GC.log file and it also shows the large heap growing.
>> I have attached the GC log with this email.
>>
>> Could you please assist me to resolve this issue.
>>
>> Thanks & Regards,
>>
>> Vibhath
>>
>>
>>
>>
>> On Wed, May 12, 2021 at 10:30 AM Vibhath Ileperuma <
>> vibhatharunapriya@gmail.com> wrote:
>>
>>> Hi Matt,
>>>
>>> I have set the fetch size to 20000. I tried with small values like 100.
>>> Got the same issue with the small sizes.
>>> Further I limited the row count fetched by the sql by adding a limit.
>>> Then I didn't got the issue even with large fetch sizes.
>>>
>>> Thank you.
>>>
>>>
>>> On Mon, 10 May 2021, 8:05 pm Matt Burgess, <ma...@apache.org> wrote:
>>>
>>>> Vibath,
>>>>
>>>> What is the "Fetch Size" property set to? It looks like PostgreSQL
>>>> will load all results if Fetch Size is set to zero [1]. Try setting it
>>>> to 10000 or something like that, whatever doesn't use too much memory
>>>> but doesn't slow down the performance too much.
>>>>
>>>> Regards,
>>>> Matt
>>>>
>>>> [1] https://jdbc.postgresql.org/documentation/head/query.html
>>>>
>>>> On Mon, May 10, 2021 at 3:41 AM Vibhath Ileperuma
>>>> <vi...@gmail.com> wrote:
>>>> >
>>>> > Hi Vijay,
>>>> >
>>>> > Eventhough that property is set to a postive value, same issue
>>>> happens. Seems like NIFI first loads all the rows to RAM and write to
>>>> multiple files when that property is set to a non-zero value.
>>>> > On the other hand, if this property is set to a very small value, a
>>>> large no of flow files can be generated at once.
>>>> >
>>>> > Thank you.
>>>> > Best regards,
>>>> > Vibhath
>>>> >
>>>> > On Mon, 10 May 2021, 9:00 am Vijay Chhipa, <vc...@apple.com> wrote:
>>>> >>
>>>> >> Hi Vibhath,
>>>> >>
>>>> >> There is this  property on the processor
>>>> >>
>>>> >> Max Rows Per Flow File
>>>> >>
>>>> >>
>>>> >> Per docs:  If the value specified is zero, then all rows are
>>>> returned in a single FlowFile.
>>>> >> Which seems to be what is happening in your case.
>>>> >>
>>>> >>
>>>> >>
>>>> >> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <
>>>> vibhatharunapriya@gmail.com> wrote:
>>>> >>
>>>> >> executeSQLRecord
>>>> >>
>>>> >>
>>>>
>>>

Re: Issue when extracting a large dataset into CSV files

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi All,

According to the PostgreSQL JDBC driver document;
https://jdbc.postgresql.org//documentation/head/query.html, driver collects
all the results for the query at once by default. Caching a small number of
records and fetching another block of rows when the cached rows are
processed, can be possible only if connection's autocommit functionality is
off.

In the NIFI ExecuteSQLRecord processor, autocommit functionality is not
turned off (conn.setAutoCommit(false)). Hence, if a query returns a large
data set, it starts to load all the data into memory causing above
mentioned large heap growth.

I would be grateful if you could check and fix this issue in the next
releases.

[image: image.png]

Thank You.
Best regards,
Vibhath.


On Fri, May 14, 2021 at 9:20 PM Vibhath Ileperuma <
vibhatharunapriya@gmail.com> wrote:

> Hi All,
>
> To further debug this issue, I set the fetch size to 50 and Max rows per
> file to 100 as shown below image.
>
> [image: image.png]
>
> I have set the initial heap size to 8GB and Max heap size to 15GB in the
> bootstrap.conf file.
> I started monitoring CPU and Memory with a 'top' command before starting
> the executeSQLRecord processor. *Initially there was about 48GB free
> memory* and CPU usage of NIFI was little.
>
> When the ExecuteSQLRecord is started, free memory size starts to reduce
> little by little. Until it was reduced to 35-36GB range CPU usage of NIFI
> was around 25%-30% and NIFI webui worked perfectly fine. But when free
> memory size reaches the 35GB-36GB range, NIFI's CPU usage increases to
> 1300% suddenly, and NIFI webui starts not to respond.
>
> It seems like Heap is filled within a short time and even though GC tries
> to clean the heap blocking all other threads, Further, a considerable
> amount of heap doesn't get cleaned as I noticed. I increased the heap size
> upto 50GB and the same issue occurs after some time. Note that, even though
> I have set batch size property to 1, I couldn't see even a single flowfile
> coming out before NIFI webUI starts not to respond.
>
> I have checked the GC.log file and it also shows the large heap growing. I
> have attached the GC log with this email.
>
> Could you please assist me to resolve this issue.
>
> Thanks & Regards,
>
> Vibhath
>
>
>
>
> On Wed, May 12, 2021 at 10:30 AM Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
>
>> Hi Matt,
>>
>> I have set the fetch size to 20000. I tried with small values like 100.
>> Got the same issue with the small sizes.
>> Further I limited the row count fetched by the sql by adding a limit.
>> Then I didn't got the issue even with large fetch sizes.
>>
>> Thank you.
>>
>>
>> On Mon, 10 May 2021, 8:05 pm Matt Burgess, <ma...@apache.org> wrote:
>>
>>> Vibath,
>>>
>>> What is the "Fetch Size" property set to? It looks like PostgreSQL
>>> will load all results if Fetch Size is set to zero [1]. Try setting it
>>> to 10000 or something like that, whatever doesn't use too much memory
>>> but doesn't slow down the performance too much.
>>>
>>> Regards,
>>> Matt
>>>
>>> [1] https://jdbc.postgresql.org/documentation/head/query.html
>>>
>>> On Mon, May 10, 2021 at 3:41 AM Vibhath Ileperuma
>>> <vi...@gmail.com> wrote:
>>> >
>>> > Hi Vijay,
>>> >
>>> > Eventhough that property is set to a postive value, same issue
>>> happens. Seems like NIFI first loads all the rows to RAM and write to
>>> multiple files when that property is set to a non-zero value.
>>> > On the other hand, if this property is set to a very small value, a
>>> large no of flow files can be generated at once.
>>> >
>>> > Thank you.
>>> > Best regards,
>>> > Vibhath
>>> >
>>> > On Mon, 10 May 2021, 9:00 am Vijay Chhipa, <vc...@apple.com> wrote:
>>> >>
>>> >> Hi Vibhath,
>>> >>
>>> >> There is this  property on the processor
>>> >>
>>> >> Max Rows Per Flow File
>>> >>
>>> >>
>>> >> Per docs:  If the value specified is zero, then all rows are returned
>>> in a single FlowFile.
>>> >> Which seems to be what is happening in your case.
>>> >>
>>> >>
>>> >>
>>> >> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <
>>> vibhatharunapriya@gmail.com> wrote:
>>> >>
>>> >> executeSQLRecord
>>> >>
>>> >>
>>>
>>

Re: Issue when extracting a large dataset into CSV files

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi All,

To further debug this issue, I set the fetch size to 50 and Max rows per
file to 100 as shown below image.

[image: image.png]

I have set the initial heap size to 8GB and Max heap size to 15GB in the
bootstrap.conf file.
I started monitoring CPU and Memory with a 'top' command before starting
the executeSQLRecord processor. *Initially there was about 48GB free memory*
and CPU usage of NIFI was little.

When the ExecuteSQLRecord is started, free memory size starts to reduce
little by little. Until it was reduced to 35-36GB range CPU usage of NIFI
was around 25%-30% and NIFI webui worked perfectly fine. But when free
memory size reaches the 35GB-36GB range, NIFI's CPU usage increases to
1300% suddenly, and NIFI webui starts not to respond.

It seems like Heap is filled within a short time and even though GC tries
to clean the heap blocking all other threads, Further, a considerable
amount of heap doesn't get cleaned as I noticed. I increased the heap size
upto 50GB and the same issue occurs after some time. Note that, even though
I have set batch size property to 1, I couldn't see even a single flowfile
coming out before NIFI webUI starts not to respond.

I have checked the GC.log file and it also shows the large heap growing. I
have attached the GC log with this email.

Could you please assist me to resolve this issue.

Thanks & Regards,

Vibhath




On Wed, May 12, 2021 at 10:30 AM Vibhath Ileperuma <
vibhatharunapriya@gmail.com> wrote:

> Hi Matt,
>
> I have set the fetch size to 20000. I tried with small values like 100.
> Got the same issue with the small sizes.
> Further I limited the row count fetched by the sql by adding a limit. Then
> I didn't got the issue even with large fetch sizes.
>
> Thank you.
>
>
> On Mon, 10 May 2021, 8:05 pm Matt Burgess, <ma...@apache.org> wrote:
>
>> Vibath,
>>
>> What is the "Fetch Size" property set to? It looks like PostgreSQL
>> will load all results if Fetch Size is set to zero [1]. Try setting it
>> to 10000 or something like that, whatever doesn't use too much memory
>> but doesn't slow down the performance too much.
>>
>> Regards,
>> Matt
>>
>> [1] https://jdbc.postgresql.org/documentation/head/query.html
>>
>> On Mon, May 10, 2021 at 3:41 AM Vibhath Ileperuma
>> <vi...@gmail.com> wrote:
>> >
>> > Hi Vijay,
>> >
>> > Eventhough that property is set to a postive value, same issue happens.
>> Seems like NIFI first loads all the rows to RAM and write to multiple files
>> when that property is set to a non-zero value.
>> > On the other hand, if this property is set to a very small value, a
>> large no of flow files can be generated at once.
>> >
>> > Thank you.
>> > Best regards,
>> > Vibhath
>> >
>> > On Mon, 10 May 2021, 9:00 am Vijay Chhipa, <vc...@apple.com> wrote:
>> >>
>> >> Hi Vibhath,
>> >>
>> >> There is this  property on the processor
>> >>
>> >> Max Rows Per Flow File
>> >>
>> >>
>> >> Per docs:  If the value specified is zero, then all rows are returned
>> in a single FlowFile.
>> >> Which seems to be what is happening in your case.
>> >>
>> >>
>> >>
>> >> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <
>> vibhatharunapriya@gmail.com> wrote:
>> >>
>> >> executeSQLRecord
>> >>
>> >>
>>
>

Re: Issue when extracting a large dataset into CSV files

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi Matt,

I have set the fetch size to 20000. I tried with small values like 100. Got
the same issue with the small sizes.
Further I limited the row count fetched by the sql by adding a limit. Then
I didn't got the issue even with large fetch sizes.

Thank you.


On Mon, 10 May 2021, 8:05 pm Matt Burgess, <ma...@apache.org> wrote:

> Vibath,
>
> What is the "Fetch Size" property set to? It looks like PostgreSQL
> will load all results if Fetch Size is set to zero [1]. Try setting it
> to 10000 or something like that, whatever doesn't use too much memory
> but doesn't slow down the performance too much.
>
> Regards,
> Matt
>
> [1] https://jdbc.postgresql.org/documentation/head/query.html
>
> On Mon, May 10, 2021 at 3:41 AM Vibhath Ileperuma
> <vi...@gmail.com> wrote:
> >
> > Hi Vijay,
> >
> > Eventhough that property is set to a postive value, same issue happens.
> Seems like NIFI first loads all the rows to RAM and write to multiple files
> when that property is set to a non-zero value.
> > On the other hand, if this property is set to a very small value, a
> large no of flow files can be generated at once.
> >
> > Thank you.
> > Best regards,
> > Vibhath
> >
> > On Mon, 10 May 2021, 9:00 am Vijay Chhipa, <vc...@apple.com> wrote:
> >>
> >> Hi Vibhath,
> >>
> >> There is this  property on the processor
> >>
> >> Max Rows Per Flow File
> >>
> >>
> >> Per docs:  If the value specified is zero, then all rows are returned
> in a single FlowFile.
> >> Which seems to be what is happening in your case.
> >>
> >>
> >>
> >> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
> >>
> >> executeSQLRecord
> >>
> >>
>

Re: Issue when extracting a large dataset into CSV files

Posted by Matt Burgess <ma...@apache.org>.
Vibath,

What is the "Fetch Size" property set to? It looks like PostgreSQL
will load all results if Fetch Size is set to zero [1]. Try setting it
to 10000 or something like that, whatever doesn't use too much memory
but doesn't slow down the performance too much.

Regards,
Matt

[1] https://jdbc.postgresql.org/documentation/head/query.html

On Mon, May 10, 2021 at 3:41 AM Vibhath Ileperuma
<vi...@gmail.com> wrote:
>
> Hi Vijay,
>
> Eventhough that property is set to a postive value, same issue happens. Seems like NIFI first loads all the rows to RAM and write to multiple files when that property is set to a non-zero value.
> On the other hand, if this property is set to a very small value, a large no of flow files can be generated at once.
>
> Thank you.
> Best regards,
> Vibhath
>
> On Mon, 10 May 2021, 9:00 am Vijay Chhipa, <vc...@apple.com> wrote:
>>
>> Hi Vibhath,
>>
>> There is this  property on the processor
>>
>> Max Rows Per Flow File
>>
>>
>> Per docs:  If the value specified is zero, then all rows are returned in a single FlowFile.
>> Which seems to be what is happening in your case.
>>
>>
>>
>> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <vi...@gmail.com> wrote:
>>
>> executeSQLRecord
>>
>>

Re: Issue when extracting a large dataset into CSV files

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi Vijay,

Eventhough that property is set to a postive value, same issue happens.
Seems like NIFI first loads all the rows to RAM and write to multiple files
when that property is set to a non-zero value.
On the other hand, if this property is set to a very small value, a large
no of flow files can be generated at once.

Thank you.
Best regards,
Vibhath

On Mon, 10 May 2021, 9:00 am Vijay Chhipa, <vc...@apple.com> wrote:

> Hi Vibhath,
>
> There is this  property on the processor
>
> *Max Rows Per Flow File*
>
>
> Per docs:  If the value specified is zero, then all rows are returned in
> a single FlowFile.
> Which seems to be what is happening in your case.
>
>
>
> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
>
> executeSQLRecord
>
>
>

Re: Issue when extracting a large dataset into CSV files

Posted by Vijay Chhipa <vc...@apple.com>.
Hi Vibhath, 

There is this  property on the processor 

Max Rows Per Flow File


Per docs:  If the value specified is zero, then all rows are returned in a single FlowFile.
Which seems to be what is happening in your case. 



> On May 9, 2021, at 12:11 PM, Vibhath Ileperuma <vi...@gmail.com> wrote:
> 
> executeSQLRecord