You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Dmitry Goldenberg (JIRA)" <ji...@apache.org> on 2016/04/01 07:14:25 UTC
[jira] [Updated] (NIFI-1716) Implement a SplitCsv processor,
possibly also a GetCSV
[ https://issues.apache.org/jira/browse/NIFI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitry Goldenberg updated NIFI-1716:
------------------------------------
Summary: Implement a SplitCsv processor, possibly also a GetCSV (was: Implement a SplitCsv processor)
> Implement a SplitCsv processor, possibly also a GetCSV
> ------------------------------------------------------
>
> Key: NIFI-1716
> URL: https://issues.apache.org/jira/browse/NIFI-1716
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Dmitry Goldenberg
>
> I'm proposing a SplitCSV processor dedicated specifically to splitting CSV content which is assumed to be in the flowfile-content of its incoming flowfiles.
> It appears that the current mode of splitting a CSV file is by using the SplitText processor. However, it'd be great to have a CSV splitter to read CSV records one by one and use the header row's header names to convert each record into a FlowFile, with attributes set to correspond to the headers.
> Whether or not the first row is a header should be a boolean configuration option. In the absence of a header row, some sensible default column names should be utilized, for example, one convention could be: column1, column2, column3, etc.
> Another option on the splitter needs to be the delimiter character (defaulted to comma).
> Empty lines shall be skipped from processing.
> Extracted cell values shall be (optionally) whitespace-trimmed.
> Jagged rows must have some sensible handling:
> 1) For a given row, if there are fewer cells than in the header row, cells shall be assigned to columns left to right, and any missing cells are considered empty.
> 2) For a given row, if there are more cells than in the header row, a (non-fatal) error shall be generated for the row and the row shall be dropped from processing.
> As typically done with CSV, delimiter characters are ignored within quotes.
> Elements may span multiple lines by having embedded carriage returns; such elements must be quoted.
> NIFI-1280 asks for a way to specify which columns are to be kept or skipped. I'm proposing that instead of a separate processor, this would be implemented as a configuration option on SplitCSV (a list of 0-based indices of columns that are to be kept).
> It may also make sense to expose a GetCSV ingress component which would share most of its functionality with SplitCSV. Perhaps it's easiest if users just follow a GetFile with SplitCSV, however in some cases it make sense to save on reading the file into a flowfile-content but rather process all CSV data in-place, within a GetCSV.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)