You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Christian Tzolov <ch...@gmail.com> on 2013/03/18 11:14:29 UTC

RFC 4180 compliant CSV format

Hi,

I am working on ETL projects that consume and produce data in the RFC4180
[1] CSV format. Although unreliable IMO, this RFC is used as an exchange
format by several Dutch government agencies.

The RFC4180 spec supports multi-line fields (e.g. fields with line
breaks) and escaping of double quotes and delimiters within fields. Because
of the multi-line feature one can't use directly the
FileInputFormat/TextInputFormat or LineRecordReader implementations.
Furthermore as I see it the input splitting must be disabled (not sure if
any efficient splitting strategy is possible at all).

There are several java libraries that provide some RFC4180 support [3]. For
Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
sure about the input splitting though). Also the "Hadoop in Practice"
example [4] does not support the multi-line fields.

Has someone used similar 'multi-line fields' formats? I wonder how common
is this use case.

Also shall we provide support for it in Crunch?

Cheers,
Chris

[1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
[2]  PIG CVSExcelStorage UDF -
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
[3]  jCSV, OpenCSV, SuperCSV
[4]
https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java

Re: RFC 4180 compliant CSV format

Posted by Josh Wills <jw...@cloudera.com>.
On Mon, Mar 18, 2013 at 10:07 AM, Matthias Friedrich <ma...@mafr.de> wrote:

> On Monday, 2013-03-18, Josh Wills wrote:
> > I personally try to steer people away from multi-line input formats b/c
> of
> > how tedious they are to write/maintain.
>
> Same here.
>
> > To me, the question of supporting
> > CSVs maps to a more general question about whether we should support some
> > kind of named Record/Row type for processing data from
> > CSV/Hive/Avro/PB/Thrift/etc. in a generic way. I could make arguments
> > either way, which I'm happy to do if folks are interested, but I'd rather
> > hear from other people first, esp. if anyone feels strongly about it.
>
> I have used something like it in aggregation and machine learning
> systems and I've grown quite fond it. It is basically a HashMap that
> is partially immutable - once you add a value you can't change it
> anymore. You can structure your system as a sequence of rules that
> each adds fields to the record. This is quite flexible, you can work
> with changing schemas and different sets of rules easily.
>

I've been noodling on such a system for some ML tools I'm writing on top of
Crunch. I'll be happy to import the code (or whatever pieces of it seem
generally useful) if there's interest. I'm not quite ready to release it,
but I'll ping the dev list when it's published.


>
> Regards,
>   Matthias
>
> >
> > On Mon, Mar 18, 2013 at 3:14 AM, Christian Tzolov <
> > christian.tzolov@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am working on ETL projects that consume and produce data in the
> RFC4180
> > > [1] CSV format. Although unreliable IMO, this RFC is used as an
> exchange
> > > format by several Dutch government agencies.
> > >
> > > The RFC4180 spec supports multi-line fields (e.g. fields with line
> > > breaks) and escaping of double quotes and delimiters within fields.
> Because
> > > of the multi-line feature one can't use directly the
> > > FileInputFormat/TextInputFormat or LineRecordReader implementations.
> > > Furthermore as I see it the input splitting must be disabled (not sure
> if
> > > any efficient splitting strategy is possible at all).
> > >
> > > There are several java libraries that provide some RFC4180 support
> [3]. For
> > > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job
> (not
> > > sure about the input splitting though). Also the "Hadoop in Practice"
> > > example [4] does not support the multi-line fields.
> > >
> > > Has someone used similar 'multi-line fields' formats? I wonder how
> common
> > > is this use case.
> > >
> > > Also shall we provide support for it in Crunch?
> > >
> > > Cheers,
> > > Chris
> > >
> > > [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> > > [2]  PIG CVSExcelStorage UDF -
> > >
> > >
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> > > [3]  jCSV, OpenCSV, SuperCSV
> > > [4]
> > >
> > >
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: RFC 4180 compliant CSV format

Posted by Matthias Friedrich <ma...@mafr.de>.
On Monday, 2013-03-18, Josh Wills wrote:
> I personally try to steer people away from multi-line input formats b/c of
> how tedious they are to write/maintain.

Same here.

> To me, the question of supporting
> CSVs maps to a more general question about whether we should support some
> kind of named Record/Row type for processing data from
> CSV/Hive/Avro/PB/Thrift/etc. in a generic way. I could make arguments
> either way, which I'm happy to do if folks are interested, but I'd rather
> hear from other people first, esp. if anyone feels strongly about it.

I have used something like it in aggregation and machine learning
systems and I've grown quite fond it. It is basically a HashMap that
is partially immutable - once you add a value you can't change it
anymore. You can structure your system as a sequence of rules that
each adds fields to the record. This is quite flexible, you can work
with changing schemas and different sets of rules easily.

Regards,
  Matthias

> 
> On Mon, Mar 18, 2013 at 3:14 AM, Christian Tzolov <
> christian.tzolov@gmail.com> wrote:
> 
> > Hi,
> >
> > I am working on ETL projects that consume and produce data in the RFC4180
> > [1] CSV format. Although unreliable IMO, this RFC is used as an exchange
> > format by several Dutch government agencies.
> >
> > The RFC4180 spec supports multi-line fields (e.g. fields with line
> > breaks) and escaping of double quotes and delimiters within fields. Because
> > of the multi-line feature one can't use directly the
> > FileInputFormat/TextInputFormat or LineRecordReader implementations.
> > Furthermore as I see it the input splitting must be disabled (not sure if
> > any efficient splitting strategy is possible at all).
> >
> > There are several java libraries that provide some RFC4180 support [3]. For
> > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
> > sure about the input splitting though). Also the "Hadoop in Practice"
> > example [4] does not support the multi-line fields.
> >
> > Has someone used similar 'multi-line fields' formats? I wonder how common
> > is this use case.
> >
> > Also shall we provide support for it in Crunch?
> >
> > Cheers,
> > Chris
> >
> > [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> > [2]  PIG CVSExcelStorage UDF -
> >
> > http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> > [3]  jCSV, OpenCSV, SuperCSV
> > [4]
> >
> > https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java
> >
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: RFC 4180 compliant CSV format

Posted by Josh Wills <jw...@cloudera.com>.
I personally try to steer people away from multi-line input formats b/c of
how tedious they are to write/maintain. To me, the question of supporting
CSVs maps to a more general question about whether we should support some
kind of named Record/Row type for processing data from
CSV/Hive/Avro/PB/Thrift/etc. in a generic way. I could make arguments
either way, which I'm happy to do if folks are interested, but I'd rather
hear from other people first, esp. if anyone feels strongly about it.

J


On Mon, Mar 18, 2013 at 3:14 AM, Christian Tzolov <
christian.tzolov@gmail.com> wrote:

> Hi,
>
> I am working on ETL projects that consume and produce data in the RFC4180
> [1] CSV format. Although unreliable IMO, this RFC is used as an exchange
> format by several Dutch government agencies.
>
> The RFC4180 spec supports multi-line fields (e.g. fields with line
> breaks) and escaping of double quotes and delimiters within fields. Because
> of the multi-line feature one can't use directly the
> FileInputFormat/TextInputFormat or LineRecordReader implementations.
> Furthermore as I see it the input splitting must be disabled (not sure if
> any efficient splitting strategy is possible at all).
>
> There are several java libraries that provide some RFC4180 support [3]. For
> Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
> sure about the input splitting though). Also the "Hadoop in Practice"
> example [4] does not support the multi-line fields.
>
> Has someone used similar 'multi-line fields' formats? I wonder how common
> is this use case.
>
> Also shall we provide support for it in Crunch?
>
> Cheers,
> Chris
>
> [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> [2]  PIG CVSExcelStorage UDF -
>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> [3]  jCSV, OpenCSV, SuperCSV
> [4]
>
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: RFC 4180 compliant CSV format

Posted by Josh Wills <jw...@cloudera.com>.
Inlined.

On Tue, Mar 19, 2013 at 6:54 AM, Christian Tzolov <
christian.tzolov@gmail.com> wrote:

> @Josh, most of the time I can manage to steer away from multiline records
> but with gov. organisations it is difficult to alter what they
> have considered as a 'standard'.


> Can you please elaborate on your idea for named records/rows?
>

Yeah, I posted a library of Crunch-based tools for machine learning that
I've been working on for the past couple of months:

https://github.com/cloudera/ml

The core module defines a Record interface that should eventually support
working w/Avro records, HCatalog records, CSV files, and even Vectors--
anything that can be made to look/feel like a typed tuple of values, and
the parallel module defines associated PTypes for the various
implementations. I don't have the sophistication on the APIs that Matthias
mentioned (in terms of evolving immutable objects), but that is the
direction I expect to go in.

J


> @Harsh, thanks for the references. I remember I had some issues with
> OpenCSV (either the iterator suport or some RFC4180 limitations). But I
> would check the other sources.
>
> Thanks,
> Chris
>
>
>
> On Tue, Mar 19, 2013 at 12:44 AM, Harsh J <ha...@cloudera.com> wrote:
>
> > Does OpenCSV (http://opencsv.sourceforge.net/#what-features) support
> > your format? There's a Hive wrapper for it:
> > http://ogrodnek.github.com/csv-serde and IIRC also a newer InputFormat
> > at https://github.com/mvallebr/CSVInputFormat (via
> > https://issues.apache.org/jira/browse/MAPREDUCE-2208).
> >
> > On Mon, Mar 18, 2013 at 3:44 PM, Christian Tzolov
> > <ch...@gmail.com> wrote:
> > > Hi,
> > >
> > > I am working on ETL projects that consume and produce data in the
> RFC4180
> > > [1] CSV format. Although unreliable IMO, this RFC is used as an
> exchange
> > > format by several Dutch government agencies.
> > >
> > > The RFC4180 spec supports multi-line fields (e.g. fields with line
> > > breaks) and escaping of double quotes and delimiters within fields.
> > Because
> > > of the multi-line feature one can't use directly the
> > > FileInputFormat/TextInputFormat or LineRecordReader implementations.
> > > Furthermore as I see it the input splitting must be disabled (not sure
> if
> > > any efficient splitting strategy is possible at all).
> > >
> > > There are several java libraries that provide some RFC4180 support [3].
> > For
> > > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job
> (not
> > > sure about the input splitting though). Also the "Hadoop in Practice"
> > > example [4] does not support the multi-line fields.
> > >
> > > Has someone used similar 'multi-line fields' formats? I wonder how
> common
> > > is this use case.
> > >
> > > Also shall we provide support for it in Crunch?
> > >
> > > Cheers,
> > > Chris
> > >
> > > [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> > > [2]  PIG CVSExcelStorage UDF -
> > >
> >
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> > > [3]  jCSV, OpenCSV, SuperCSV
> > > [4]
> > >
> >
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java
> >
> >
> >
> > --
> > Harsh J
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: RFC 4180 compliant CSV format

Posted by Christian Tzolov <ch...@gmail.com>.
@Josh, most of the time I can manage to steer away from multiline records
but with gov. organisations it is difficult to alter what they
have considered as a 'standard'.

Can you please elaborate on your idea for named records/rows?

@Harsh, thanks for the references. I remember I had some issues with
OpenCSV (either the iterator suport or some RFC4180 limitations). But I
would check the other sources.

Thanks,
Chris



On Tue, Mar 19, 2013 at 12:44 AM, Harsh J <ha...@cloudera.com> wrote:

> Does OpenCSV (http://opencsv.sourceforge.net/#what-features) support
> your format? There's a Hive wrapper for it:
> http://ogrodnek.github.com/csv-serde and IIRC also a newer InputFormat
> at https://github.com/mvallebr/CSVInputFormat (via
> https://issues.apache.org/jira/browse/MAPREDUCE-2208).
>
> On Mon, Mar 18, 2013 at 3:44 PM, Christian Tzolov
> <ch...@gmail.com> wrote:
> > Hi,
> >
> > I am working on ETL projects that consume and produce data in the RFC4180
> > [1] CSV format. Although unreliable IMO, this RFC is used as an exchange
> > format by several Dutch government agencies.
> >
> > The RFC4180 spec supports multi-line fields (e.g. fields with line
> > breaks) and escaping of double quotes and delimiters within fields.
> Because
> > of the multi-line feature one can't use directly the
> > FileInputFormat/TextInputFormat or LineRecordReader implementations.
> > Furthermore as I see it the input splitting must be disabled (not sure if
> > any efficient splitting strategy is possible at all).
> >
> > There are several java libraries that provide some RFC4180 support [3].
> For
> > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
> > sure about the input splitting though). Also the "Hadoop in Practice"
> > example [4] does not support the multi-line fields.
> >
> > Has someone used similar 'multi-line fields' formats? I wonder how common
> > is this use case.
> >
> > Also shall we provide support for it in Crunch?
> >
> > Cheers,
> > Chris
> >
> > [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> > [2]  PIG CVSExcelStorage UDF -
> >
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> > [3]  jCSV, OpenCSV, SuperCSV
> > [4]
> >
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java
>
>
>
> --
> Harsh J
>

Re: RFC 4180 compliant CSV format

Posted by Harsh J <ha...@cloudera.com>.
Does OpenCSV (http://opencsv.sourceforge.net/#what-features) support
your format? There's a Hive wrapper for it:
http://ogrodnek.github.com/csv-serde and IIRC also a newer InputFormat
at https://github.com/mvallebr/CSVInputFormat (via
https://issues.apache.org/jira/browse/MAPREDUCE-2208).

On Mon, Mar 18, 2013 at 3:44 PM, Christian Tzolov
<ch...@gmail.com> wrote:
> Hi,
>
> I am working on ETL projects that consume and produce data in the RFC4180
> [1] CSV format. Although unreliable IMO, this RFC is used as an exchange
> format by several Dutch government agencies.
>
> The RFC4180 spec supports multi-line fields (e.g. fields with line
> breaks) and escaping of double quotes and delimiters within fields. Because
> of the multi-line feature one can't use directly the
> FileInputFormat/TextInputFormat or LineRecordReader implementations.
> Furthermore as I see it the input splitting must be disabled (not sure if
> any efficient splitting strategy is possible at all).
>
> There are several java libraries that provide some RFC4180 support [3]. For
> Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
> sure about the input splitting though). Also the "Hadoop in Practice"
> example [4] does not support the multi-line fields.
>
> Has someone used similar 'multi-line fields' formats? I wonder how common
> is this use case.
>
> Also shall we provide support for it in Crunch?
>
> Cheers,
> Chris
>
> [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> [2]  PIG CVSExcelStorage UDF -
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> [3]  jCSV, OpenCSV, SuperCSV
> [4]
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java



--
Harsh J