You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Vibhath Ileperuma <vi...@gmail.com> on 2021/04/14 06:36:29 UTC

Nifi throws an error when reading a large csv file

Hi All,

I'm using a SplitRecord processor with a CSV Reader and a CSV
RecordSetWriter to split a large csv file (5.5GB-6GB) into multiple small
csv files. When I start the processor, the below exception is thrown.

"failed to process session due to Requested array size exceeds VM limit;
Processor Administratively Yielded for 1 sec: java.lang.OutOfMemoryError:
Requested array size exceeds VM limit"


I would be grateful if someone can suggest a way to overcome this error.

Thanks & Regards

*Vibhath Ileperuma*

Re: Nifi throws an error when reading a large csv file

Posted by Chris Sampson <ch...@naimuri.com>.
For splitting large files, it's often recommended to use a multi-stage
approach.

For example, if your file contains 1_000_000 records and you want to split
it into 1 record per FlowFile, it would be better to split into batches of,
say 1_000 records using SplitRecord and then split each of those FlowFiles
again into files of 1 record each. But bear in mind that this is going to
result in 1_000_000 FlowFiles in the Flow, which is unlikely to be very
performant.

While you may not be trying to split into individual records, the error
suggests you're trying to create too many FlowFiles from an incoming file
in a single operation. All FlowFiles created by a processor in a single
session (i.e. run of the processor) are held in memory until the session is
committed - each FlowFile uses an operating file descriptor, so it's common
to see OS/VM level errors like these in such scenarios.

The general recommendation is to try and use Record-based processors
throughout your Flow in order to avoid the need to Split/Merge file content
(but this isn't always possible, depending upon your use case and the
processors available in your version).

---
*Chris Sampson*
IT Consultant
chris.sampson@naimuri.com
<https://www.naimuri.com/>


On Wed, 14 Apr 2021 at 07:36, Vibhath Ileperuma <vi...@gmail.com>
wrote:

> Hi All,
>
> I'm using a SplitRecord processor with a CSV Reader and a CSV
> RecordSetWriter to split a large csv file (5.5GB-6GB) into multiple small
> csv files. When I start the processor, the below exception is thrown.
>
> "failed to process session due to Requested array size exceeds VM limit;
> Processor Administratively Yielded for 1 sec: java.lang.OutOfMemoryError:
> Requested array size exceeds VM limit"
>
>
> I would be grateful if someone can suggest a way to overcome this error.
>
> Thanks & Regards
>
> *Vibhath Ileperuma*
>

Re: Nifi throws an error when reading a large csv file

Posted by Joe Witt <jo...@gmail.com>.
How large is each line expected to be?  You could have a massive line
or much larger than thought of.  Or you could be creating far more
flowfiles than intended.  If you cut the file in size does it work
better?  Will need more data to help narrow in but obviously we're all
very interested to know what is happening.  These processors and the
readers/writers are meant to be quite bullet proof and handle very
very large data easily in most cases.

On Wed, Apr 14, 2021 at 10:07 AM Vibhath Ileperuma
<vi...@gmail.com> wrote:
>
> Hi Chris,
>
> As you have mentioned, I am trying to split the large csv file in multiple stages. But this error is thrown at the first stage even without creating a single flow file.
> It seems like the issue is not with the processor, but with the CSV record reader. This error is thrown while reading the csv file. I tried to write the data in the large csv file into a kudu table using a putKudu processor with the same CSV reader. Then also I got the same error message.
>
> Hi Otto,
>
> Only following information is available in log file related to the exception
>
> 2021-04-14 17:48:28,628 ERROR [Timer-Driven Process Thread-1] o.a.nifi.processors.standard.SplitRecord SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] failed to process session due to java.lang.OutOfMemoryError: Requested array size exceeds VM limit; Processor Administratively Yielded for 1 sec: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> 2021-04-14 17:48:28,628 WARN [Timer-Driven Process Thread-1] o.a.n.controller.tasks.ConnectableTask Administratively Yielding SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] due to uncaught Exception: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> Thanks & Regards
>
> Vibhath Ileperuma
>
>
>
>
> On Wed, Apr 14, 2021 at 7:47 PM Otto Fowler <ot...@gmail.com> wrote:
>>
>> What is the complete stack trace of that exception?
>>
>> On Apr 14, 2021, at 02:36, Vibhath Ileperuma <vi...@gmail.com> wrote:
>>
>> Requested array size exceeds VM limit
>>
>>

Re: Nifi throws an error when reading a large csv file

Posted by Otto Fowler <ot...@gmail.com>.
It would be good to get the stack trace, or rather a better one to see where the array is being created.

How many columns does the file have?
How are you doing the schema?
Which csv parser have you configured?
Can you reproduce with a smaller file size ( see split command on linux )?


> On Apr 14, 2021, at 13:07, Vibhath Ileperuma <vi...@gmail.com> wrote:
> 
> Hi Chris,
> 
> As you have mentioned, I am trying to split the large csv file in multiple stages. But this error is thrown at the first stage even without creating a single flow file. 
> It seems like the issue is not with the processor, but with the CSV record reader. This error is thrown while reading the csv file. I tried to write the data in the large csv file into a kudu table using a putKudu processor with the same CSV reader. Then also I got the same error message.
> 
> Hi Otto,
> 
> Only following information is available in log file related to the exception
> 
> 2021-04-14 17:48:28,628 ERROR [Timer-Driven Process Thread-1] o.a.nifi.processors.standard.SplitRecord SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] failed to process session due to java.lang.OutOfMemoryError: Requested array size exceeds VM limit; Processor Administratively Yielded for 1 sec: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> 
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> 
> 2021-04-14 17:48:28,628 WARN [Timer-Driven Process Thread-1] o.a.n.controller.tasks.ConnectableTask Administratively Yielding SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] due to uncaught Exception: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> 
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> 
> Thanks & Regards
> Vibhath Ileperuma
> 
> 
> 
> On Wed, Apr 14, 2021 at 7:47 PM Otto Fowler <ottobackwards@gmail.com <ma...@gmail.com>> wrote:
> What is the complete stack trace of that exception?
> 
>> On Apr 14, 2021, at 02:36, Vibhath Ileperuma <vibhatharunapriya@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Requested array size exceeds VM limit
> 


Re: Nifi throws an error when reading a large csv file

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi all,

Sorry for the late reply. It seems like the issue is with the csv files. I
generated csv files using postgreSQL 'COPY' command. According to the
postgresql documentation, double quote is set as the default value for both
quote character and the escape character for this command. It seems like,
when both the quote and escape chars are the same, NIFI CSV reader gets
confused and splits the record into many columns causing this issue.

To overcome the issue, I changed the escape character to back-slash when
generating CSVs using COPY command. But postgresql inserts the escape
character only if quote character is available in data. It doesn't insert
escape char if any other special character(delimiter, same escape char
etc.) is there in data. Hence, if a back-slash is there at the end of a
string data, the delimiter gets escaped when reading from NIFI.

Has anyone tried to read postgre generated CSV from NIFI.? Any workaround
for this issue.

postgreSQL COPY command documentation :
https://www.postgresql.org/docs/13/sql-copy.html

Thanks & Regards

*Vibhath Ileperuma*





On Thu, Apr 15, 2021 at 4:59 AM Mike Thomsen <mi...@gmail.com> wrote:

> I could be totally barking up the wrong tree, but I think this is our
> clue: Requested array size exceeds VM limit
>
> That means that something is causing the reader to try to allocate an
> array with a number of entries greater than the VM allows.
>
> Without seeing the schema, a sample of the CSV and a stacktrace it's
> pretty hard to guess what's going on. For what it's worth, I've split
> 55GB JSON sets using a custom streaming JSON reader without a hiccup
> on a NiFi instance with only 4-8GB of RAM allocated, so I'm fairly
> confident we've got some quirky edge case here.
>
> If you want to sanitize some inputs and share along with a schema that
> might help.
>
> On Wed, Apr 14, 2021 at 1:07 PM Vibhath Ileperuma
> <vi...@gmail.com> wrote:
> >
> > Hi Chris,
> >
> > As you have mentioned, I am trying to split the large csv file in
> multiple stages. But this error is thrown at the first stage even without
> creating a single flow file.
> > It seems like the issue is not with the processor, but with the CSV
> record reader. This error is thrown while reading the csv file. I tried to
> write the data in the large csv file into a kudu table using a putKudu
> processor with the same CSV reader. Then also I got the same error message.
> >
> > Hi Otto,
> >
> > Only following information is available in log file related to the
> exception
> >
> > 2021-04-14 17:48:28,628 ERROR [Timer-Driven Process Thread-1]
> o.a.nifi.processors.standard.SplitRecord
> SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34]
> SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] failed to process
> session due to java.lang.OutOfMemoryError: Requested array size exceeds VM
> limit; Processor Administratively Yielded for 1 sec:
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> >
> > java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> >
> > 2021-04-14 17:48:28,628 WARN [Timer-Driven Process Thread-1]
> o.a.n.controller.tasks.ConnectableTask Administratively Yielding
> SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] due to uncaught
> Exception: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> >
> > java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> >
> > Thanks & Regards
> >
> > Vibhath Ileperuma
> >
> >
> >
> >
> > On Wed, Apr 14, 2021 at 7:47 PM Otto Fowler <ot...@gmail.com>
> wrote:
> >>
> >> What is the complete stack trace of that exception?
> >>
> >> On Apr 14, 2021, at 02:36, Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
> >>
> >> Requested array size exceeds VM limit
> >>
> >>
>

Re: Nifi throws an error when reading a large csv file

Posted by Mike Thomsen <mi...@gmail.com>.
I could be totally barking up the wrong tree, but I think this is our
clue: Requested array size exceeds VM limit

That means that something is causing the reader to try to allocate an
array with a number of entries greater than the VM allows.

Without seeing the schema, a sample of the CSV and a stacktrace it's
pretty hard to guess what's going on. For what it's worth, I've split
55GB JSON sets using a custom streaming JSON reader without a hiccup
on a NiFi instance with only 4-8GB of RAM allocated, so I'm fairly
confident we've got some quirky edge case here.

If you want to sanitize some inputs and share along with a schema that
might help.

On Wed, Apr 14, 2021 at 1:07 PM Vibhath Ileperuma
<vi...@gmail.com> wrote:
>
> Hi Chris,
>
> As you have mentioned, I am trying to split the large csv file in multiple stages. But this error is thrown at the first stage even without creating a single flow file.
> It seems like the issue is not with the processor, but with the CSV record reader. This error is thrown while reading the csv file. I tried to write the data in the large csv file into a kudu table using a putKudu processor with the same CSV reader. Then also I got the same error message.
>
> Hi Otto,
>
> Only following information is available in log file related to the exception
>
> 2021-04-14 17:48:28,628 ERROR [Timer-Driven Process Thread-1] o.a.nifi.processors.standard.SplitRecord SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] failed to process session due to java.lang.OutOfMemoryError: Requested array size exceeds VM limit; Processor Administratively Yielded for 1 sec: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> 2021-04-14 17:48:28,628 WARN [Timer-Driven Process Thread-1] o.a.n.controller.tasks.ConnectableTask Administratively Yielding SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] due to uncaught Exception: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> Thanks & Regards
>
> Vibhath Ileperuma
>
>
>
>
> On Wed, Apr 14, 2021 at 7:47 PM Otto Fowler <ot...@gmail.com> wrote:
>>
>> What is the complete stack trace of that exception?
>>
>> On Apr 14, 2021, at 02:36, Vibhath Ileperuma <vi...@gmail.com> wrote:
>>
>> Requested array size exceeds VM limit
>>
>>

Re: Nifi throws an error when reading a large csv file

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi Chris,

As you have mentioned, I am trying to split the large csv file in multiple
stages. But this error is thrown at the first stage even without creating a
single flow file.
It seems like the issue is not with the processor, but with the CSV record
reader. This error is thrown while reading the csv file. I tried to write
the data in the large csv file into a kudu table using a putKudu processor
with the same CSV reader. Then also I got the same error message.

Hi Otto,

Only following information is available in log file related to the exception

*2021-04-14 17:48:28,628 ERROR [Timer-Driven Process Thread-1]
o.a.nifi.processors.standard.SplitRecord
SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34]
SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] failed to process
session due to java.lang.OutOfMemoryError: Requested array size exceeds VM
limit; Processor Administratively Yielded for 1 sec:
java.lang.OutOfMemoryError: Requested array size exceeds VM limit*

*java.lang.OutOfMemoryError: Requested array size exceeds VM limit*

*2021-04-14 17:48:28,628 WARN [Timer-Driven Process Thread-1]
o.a.n.controller.tasks.ConnectableTask Administratively Yielding
SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] due to uncaught
Exception: java.lang.OutOfMemoryError: Requested array size exceeds VM
limit*

*java.lang.OutOfMemoryError: Requested array size exceeds VM limit*

Thanks & Regards

*Vibhath Ileperuma*




On Wed, Apr 14, 2021 at 7:47 PM Otto Fowler <ot...@gmail.com> wrote:

> What is the complete stack trace of that exception?
>
> On Apr 14, 2021, at 02:36, Vibhath Ileperuma <vi...@gmail.com>
> wrote:
>
> Requested array size exceeds VM limit
>
>
>

Re: Nifi throws an error when reading a large csv file

Posted by Otto Fowler <ot...@gmail.com>.
What is the complete stack trace of that exception?

> On Apr 14, 2021, at 02:36, Vibhath Ileperuma <vi...@gmail.com> wrote:
> 
> Requested array size exceeds VM limit