You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Joseph Witt (JIRA)" <ji...@apache.org> on 2017/07/02 03:46:00 UTC

[jira] [Commented] (NIFI-4146) SplitRecord does not gracefully convert medium sized CSV into individual FlowFiles

    [ https://issues.apache.org/jira/browse/NIFI-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071481#comment-16071481 ] 

Joseph Witt commented on NIFI-4146:
-----------------------------------

[~randerzander] just as is the case with SplitText you cannot safely go from 150K lines to 1 line results.  You need to do a two phase split.  First SplitRecord splits into say 1000 or 500 lines and the second phase can split to one.

Why?  Because going from a single bundle of 150K records to 150K records means you have 151K flowfiles (metadata/references - not content) in memory and that can eat up a lot of heap.  By doing the two phase split you would never have more than 1001 in memory at a time for example.

We do need to improve this by flushing in-flight sessions with lots of flowfile references to disk but we're not there yet.  The suggested approach works well, benefits from backpressure and parallel processing, and will get you on track.

> SplitRecord does not gracefully convert medium sized CSV into individual FlowFiles
> ----------------------------------------------------------------------------------
>
>                 Key: NIFI-4146
>                 URL: https://issues.apache.org/jira/browse/NIFI-4146
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Randy Gelhausen
>         Attachments: flow.xml.gz, nifi-app.log, ubuntu.nifi-app.log
>
>
> SplitRecord fails to split a ~= 150k line (57 Mb) CSV file into individual FlowFiles.
> This could be configuration issues, but with a build from master today, I run into problems out of the box on macOS and Linux: 
> On macOS Sierra, I get a too many open files error (See attached nifi-app.log). On Ubuntu 17.04, I get OOMs (See attached ubuntu.nifi-app.log) and the Web UI fails.
> The CSV file I'm using is available [here|https://opendata.arcgis.com/datasets/229220ee14c147659e1049bd517c0b78_16.csv] and I've attached the flow: [^flow.xml.gz].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)