You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Matt Sun (JIRA)" <ji...@apache.org> on 2017/02/13 18:17:41 UTC

[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

    [ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864138#comment-15864138 ] 

Matt Sun commented on CSV-196:
------------------------------

I just changed the title because I realized that the problem is more complicated than previously thought. Delimiter information might not be the only information missing for downstream user. Again, same scenario as described, Commons CSV library is being used with Hadoop library as an input format (csv). To support splitting the input of Hadoop jobs, the program needs to know how much input file has been read (thus working within the split boundary). And to leverage the capability of CSVParser, we usually want to "ignoreSurroundingSpace", "trim" and also handling "encapsulator". Thus, for a csv field like A,     "    B    "   , C the parser gives us back A, B and C. It seems to the java program that there are only three characters read from input file, which is not true.

So now what I'm proposing:
using another StringBuilder in token which stores the raw data read for the token, including space and encapsulator.

Maintainers, what do you think?

> Store the information of raw data read by lexer
> -----------------------------------------------
>
>                 Key: CSV-196
>                 URL: https://issues.apache.org/jira/browse/CSV-196
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>    Affects Versions: 1.4
>            Reporter: Matt Sun
>              Labels: easyfix, features, patch
>             Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes, but we also lost the information of original data at the same time. We can't tell from the CSVRecord returned whether the original data is enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV is one kind of input of Hadoop Jobs, which should support splitting input data. To accurately split a CSV file into pieces, we need to count the bytes of  data CSVParser actually read. CSVParser doesn't have accurate information of whether a field was enclosed by quotes, neither does it store raw data of the original source. Downstream users of commons CSVParser is not able to get those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field indicating whether the column was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as resolved: [CSV91] https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)