You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nifi.apache.org by "Dmitry Goldenberg (JIRA)" <ji...@apache.org> on 2016/04/01 07:14:25 UTC

[jira] [Created] (NIFI-1716) Implement a SplitCsv processor

Dmitry Goldenberg created NIFI-1716:
---------------------------------------

Summary: Implement a SplitCsv processor
Key: NIFI-1716
URL: https://issues.apache.org/jira/browse/NIFI-1716
Project: Apache NiFi
Issue Type: New Feature
Components: Core Framework
Reporter: Dmitry Goldenberg

I'm proposing a SplitCSV processor dedicated specifically to splitting CSV content which is assumed to be in the flowfile-content of its incoming flowfiles.

It appears that the current mode of splitting a CSV file is by using the SplitText processor. However, it'd be great to have a CSV splitter to read CSV records one by one and use the header row's header names to convert each record into a FlowFile, with attributes set to correspond to the headers.

Whether or not the first row is a header should be a boolean configuration option. In the absence of a header row, some sensible default column names should be utilized, for example, one convention could be: column1, column2, column3, etc.

Another option on the splitter needs to be the delimiter character (defaulted to comma).

Empty lines shall be skipped from processing.

Extracted cell values shall be (optionally) whitespace-trimmed.

Jagged rows must have some sensible handling:
1) For a given row, if there are fewer cells than in the header row, cells shall be assigned to columns left to right, and any missing cells are considered empty.
2) For a given row, if there are more cells than in the header row, a (non-fatal) error shall be generated for the row and the row shall be dropped from processing.

As typically done with CSV, delimiter characters are ignored within quotes.

Elements may span multiple lines by having embedded carriage returns; such elements must be quoted.

NIFI-1280 asks for a way to specify which columns are to be kept or skipped. I'm proposing that instead of a separate processor, this would be implemented as a configuration option on SplitCSV (a list of 0-based indices of columns that are to be kept).

It may also make sense to expose a GetCSV ingress component which would share most of its functionality with SplitCSV. Perhaps it's easiest if users just follow a GetFile with SplitCSV, however in some cases it make sense to save on reading the file into a flowfile-content but rather process all CSV data in-place, within a GetCSV.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)