You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Lance Norskog (JIRA)" <ji...@apache.org> on 2010/12/03 05:00:12 UTC

[jira] Created: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Flexible CSV text parser InputFormat
------------------------------------

                 Key: MAPREDUCE-2208
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
            Reporter: Lance Norskog
            Priority: Trivial


CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found.

Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.

This is compiled against hadoop-0.0.20.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068997#comment-13068997 ] 

XiaoboGu commented on MAPREDUCE-2208:
-------------------------------------

How do you handle CSV file header, or is it not supported?

> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "Maksym Kovalenko (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680 ] 

Maksym Kovalenko commented on MAPREDUCE-2208:
---------------------------------------------

So what regex one would need to specify to parse the "normal" CSV that uses comma as a delimiter and happen to have comma in one of the values, for example:

value1,value2,"more,complex,with,commas,value3"

just providing "," as the pattern1 will no longer work as it will produce 7 columns for the above case instead of 3.

Also consider the following use case when value contains a double quoute. In this case according to CSV escaping rules it has to be escaped by another double quote, for example:

column1,"thank you, ""User"" for the report, again, thank you",column3

Considering above two cases what value for pattern1 should I provide?

I think configuration of CSVTextInputFormat would be more natural if instead of patterns, one had to provide delimiter character (comma by default) and quote character (double quote by default). Then I and other users won't have to struggle with possible regex patterns (see my questions above, I'm still curious if you can come up with one).

Another benefit is that from delimiter and quote characters you can create any regexes that you need if necessary (if you want to stick to current implementation). By the way, right now you have some fragility in the implementation when you prepend user provided regex with a "\\". This will break in case when user supplied pattern itself starts with "\\".
                
> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069451#comment-13069451 ] 

Lance Norskog commented on MAPREDUCE-2208:
------------------------------------------

Hadoop assumes that it will process several files of the same format. Will every CSV file have the same header? If you split a giant CSV file into many pieces, will you reproduce the header line on the 2nd through N file?

Hadoop jobs are generally configured with total knowledge of the data. The mappers are hard-coded for the input formats.

The code could include a rule for how to decide that the first line is a header and skip over it. That would be worth adding.

> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970370#action_12970370 ] 

Lance Norskog commented on MAPREDUCE-2208:
------------------------------------------

Another use case: one Wikipedia format is:
{code}
1: 1664968
2: 3 747213 1664968 1691047 4095634 5535664
{code}
which would read in as:
{code}
1: 1664968
2: 3 
2: 747213 
2: 1664968
etc.
{code}



> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAPREDUCE-2208:
-------------------------------------

    Attachment: CSVTextInputFormat.java
                TestCSVTextFormat.java

> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "Harsh J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136765#comment-13136765 ] 

Harsh J commented on MAPREDUCE-2208:
------------------------------------

I'd suggest reusing OpenCSV instead, if it is possible to. I do think the
license is compatible, and it is well maintained.

On Thursday, October 27, 2011, Maksym Kovalenko (Commented) (JIRA) <
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680]
uses comma as a delimiter and happen to have comma in one of the values, for
example:
7 columns for the above case instead of 3.
In this case according to CSV escaping rules it has to be escaped by another
double quote, for example:
instead of patterns, one had to provide delimiter character (comma by
default) and quote character (double quote by default). Then I and other
users won't have to struggle with possible regex patterns (see my questions
above, I'm still curious if you can come up with one).
any regexes that you need if necessary (if you want to stick to current
implementation). By the way, right now you have some fragility in the
implementation when you prepend user provided regex with a "\\". This will
break in case when user supplied pattern itself starts with "\\".
csv-style datasets I've found. The Hadoop samples I've seen all
FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable
key and parse the Text value as a CSV line. But, they are all custom-coded
for the format.
into the format required by a Mapper. You can drop fields & rearrange them.
There is also a random sampling option to make training/test runs easier.
org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa

-- 
Harsh J

                
> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "Allen Wittenauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966669#action_12966669 ] 

Allen Wittenauer commented on MAPREDUCE-2208:
---------------------------------------------

Any chance this could get changed to CombineFile/MultiFile instead?

> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAPREDUCE-2208:
-------------------------------------

    Description: 
CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.

CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.

Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.

This is compiled against hadoop-0.0.20.



  was:
CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found.

Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.

This is compiled against hadoop-0.0.20.



> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069490#comment-13069490 ] 

XiaoboGu commented on MAPREDUCE-2208:
-------------------------------------

There are two senarioes,
1. Single huge CSV file with header.
2. Many middle CSV files with the same format and header.

> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967080#action_12967080 ] 

Lance Norskog commented on MAPREDUCE-2208:
------------------------------------------

Artfully phrased. Ah, the virtues of the passive voice.

I only learned enough file i/o to make this work. And I only work in small development datasets, not production. So, no, it never impinged that it would need more stuff to support multifile directories. This is in hadoop-20.0.2. I work in Mahout, not Hadoop. I'm not upgrading Hadoop until Mahout makes me.

How did you envision this modification? It looks like the RecordReader would be public and would need a constructor matching this line:

org.apache.hadoop.mapred.lib.CombineFileRecordReader<K, V>:144
      curReader =  rrConstructor.newInstance(new Object [] 
                            {split, jc, reporter, Integer.valueOf(idx)});









> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.