You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Dmitry Goldenberg (JIRA)" <ji...@apache.org> on 2016/04/01 07:14:25 UTC

[jira] [Updated] (NIFI-1716) Implement a SplitCsv processor, possibly also a GetCSV

     [ https://issues.apache.org/jira/browse/NIFI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Goldenberg updated NIFI-1716:
------------------------------------
    Summary: Implement a SplitCsv processor, possibly also a GetCSV  (was: Implement a SplitCsv processor)

> Implement a SplitCsv processor, possibly also a GetCSV
> ------------------------------------------------------
>
>                 Key: NIFI-1716
>                 URL: https://issues.apache.org/jira/browse/NIFI-1716
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Dmitry Goldenberg
>
> I'm proposing a SplitCSV processor dedicated specifically to splitting CSV content which is assumed to be in the flowfile-content of its incoming flowfiles.
> It appears that the current mode of splitting a CSV file is by using the SplitText processor. However, it'd be great to have a CSV splitter to read CSV records one by one and use the header row's header names to convert each record into a FlowFile, with attributes set to correspond to the headers.
> Whether or not the first row is a header should be a boolean configuration option.  In the absence of a header row, some sensible default column names should be utilized, for example, one convention could be: column1, column2, column3, etc.
> Another option on the splitter needs to be the delimiter character (defaulted to comma).
> Empty lines shall be skipped from processing.
> Extracted cell values shall be (optionally) whitespace-trimmed.
> Jagged rows must have some sensible handling:
> 1) For a given row, if there are fewer cells than in the header row, cells shall be assigned to columns left to right, and any missing cells are considered empty.
> 2) For a given row, if there are more cells than in the header row, a (non-fatal) error shall be generated for the row and the row shall be dropped from processing.
> As typically done with CSV, delimiter characters are ignored within quotes.
> Elements may span multiple lines by having embedded carriage returns; such elements must be quoted.
> NIFI-1280 asks for a way to specify which columns are to be kept or skipped. I'm proposing that instead of a separate processor, this would be implemented as a configuration option on SplitCSV (a list of 0-based indices of columns that are to be kept).
> It may also make sense to expose a GetCSV ingress component which would share most of its functionality with SplitCSV.  Perhaps it's easiest if users just follow a GetFile with SplitCSV, however in some cases it make sense to save on reading the file into a flowfile-content but rather process all CSV data in-place, within a GetCSV.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)