You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Micah Whitacre (JIRA)" <ji...@apache.org> on 2014/04/04 04:29:14 UTC

[jira] [Commented] (CRUNCH-362) Add a CSV File Source

    [ https://issues.apache.org/jira/browse/CRUNCH-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959566#comment-13959566 ] 

Micah Whitacre commented on CRUNCH-362:
---------------------------------------

Thanks for the patch Mac.  

I'm still working on reviewing it but here are a few things to fix up:

* For consistency with other sources CSVFileSource should support List<Path> as well.
* In CSVLineReader you do checking on if the escape character matches the quote.  We should do that checking when we construct the source vs in the reader to give the feedback to the consumer before the job is submitted to the cluster.

> Add a CSV File Source
> ---------------------
>
>                 Key: CRUNCH-362
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-362
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9.0
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Trivial
>              Labels: csv, csvparser, inputformat
>             Fix For: 0.10.0
>
>         Attachments: 0001-CRUNCH-362-Add-CSVFileSource.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> CSV files can be unpredictable. Among other quirks, it is possible for a single CSV record to span multiple lines in a file. In cases like these, TextFileSource is not effective and NLineFileSource is not flexible enough. 
> The result of this JIRA should be a CSVFileSource which, at minimum, should be able to deal with multiple-line CSV records. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)