You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2012/02/22 20:01:03 UTC

CSV files as input

It seems nearly impossible to use CSV files as Hadoop input.  I see that there is a CsvRecordInput class, but have found virtually no examples online of how to use it...and the one example I did find blatantly assumed that the CSV records were delimited by endlines...which is not CSV spec.  Based on my analysis below, I don't see how CSV input is possible, so I don't understand how CsvRecordInput can work (and I am having trouble understanding the completely undocumented CsvRecordInput.java; It isn't clear how that class is intended to be used).  If CsvRecordInput solves all my problems, then great, but how do I use it?

I need to process CSV files which will almost certainly contain quoted endlines.  I have attempted to derive my own record reader for this task and conclude that it is virtually impossible without reading from the beginning of the file.  I explain below.

Consider this: Assuming a split starts at some arbitrary point in the file, the standard record reader approach would be to initialize the record reader by reading to the end of the current mid-record and beginning the record reader at the start of the next full record...but there is no way to positively identify the end of CSV record if you start at an arbitrary location without potentially reading to the end of the file!

For example, we must consider the possibility that the split begins in the middle of a quoted string (therefore, endlines do not delimit records because they may be within a string).  We must therefore scan for a possible end-quote to close the string, but if we *didn't* begin within a string there may *be no end-quote at all* (the entire CSV file might not contain a single quoted string).  The only way to identify that we did not begin within a quoted string is to scan to the end of the CSV file (not the end of the *split* mind you).

So, initializing a CSV record reader with absolute error-free confidence potentially requires reading not only the entire split at the time of initialization (grossly inefficient in itself), but potentially requires reading the entire file, which may not even reside on the current node!

I'm at a loss.  How can Hadoop take CSV files as input?  It must be possible.  CSV is a very plain and common way to arrange textual data, which is Hadoop's forte; I'm sure people are processing CSV data with Hadoop, it seems like a natural fit...but I can't imagine how to enable Hadoop to read it under the conditions of Hadoop file splits.

Blech.  Help!

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Re: CSV files as input

Posted by Steve Lewis <lo...@gmail.com>.
Two other points -
if you have several input files make a custom input whose reader make
 protected boolean isSplitable(JobContext context, Path file) return false
and you do not have problems starting in the middle -
If the input is not truly massive you can simply write a piece of code to
find the longest quotes string be reading the entire file - on a single box
you can handle tens of gigs per hour.

On Wed, Feb 22, 2012 at 3:22 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> Thanks for responding.  Unfortunately, the data already exists.  I have no
> way of instituting limitations on the format, much less reformatting it to
> suit my needs.  It is true that I can make some general assumptions about
> the data (unrealistically long strings are unlikely to occur), but I can't
> write a steadfastly robust reader under such assumptions.
>
> The problem is that even if I impose an assumption of limited length
> strings, that doesn't prescribe a method for handling the possibility of an
> error.  If a string really is too long and the reader fails to detect it,
> I'm not sure how to insure that the reader or subsequent map task fails in
> a clean fashion.
>
> If I could at least impose an assumption of this sort...and then detect
> and fail cleanly on violations of the assumption, that would go a long way.
>
> I'll think about it.
>
> Thanks.
>
> On Feb 22, 2012, at 14:59 , Steve Lewis wrote:
>
> > It sounds like you may need to give up a little to make things work -
> Suppose, for example, that you placed a limit on the length of a quoted
> string,
> > say 1024 characters - the reader can then either start at the beginning
> or read back by, say 1024 characters to see if the start is in a quote and
> proceed accordingly - it quoted strings can be of arbitrary length there
> may be no good solution
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>                                           --  Galileo Galilei
>
> ________________________________________________________________________________
>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: CSV files as input

Posted by Keith Wiley <kw...@keithwiley.com>.
Thanks for responding.  Unfortunately, the data already exists.  I have no way of instituting limitations on the format, much less reformatting it to suit my needs.  It is true that I can make some general assumptions about the data (unrealistically long strings are unlikely to occur), but I can't write a steadfastly robust reader under such assumptions.

The problem is that even if I impose an assumption of limited length strings, that doesn't prescribe a method for handling the possibility of an error.  If a string really is too long and the reader fails to detect it, I'm not sure how to insure that the reader or subsequent map task fails in a clean fashion.

If I could at least impose an assumption of this sort...and then detect and fail cleanly on violations of the assumption, that would go a long way.

I'll think about it.

Thanks.

On Feb 22, 2012, at 14:59 , Steve Lewis wrote:

> It sounds like you may need to give up a little to make things work - Suppose, for example, that you placed a limit on the length of a quoted string, 
> say 1024 characters - the reader can then either start at the beginning or read back by, say 1024 characters to see if the start is in a quote and proceed accordingly - it quoted strings can be of arbitrary length there may be no good solution

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________


Re: CSV files as input

Posted by Steve Lewis <lo...@gmail.com>.
It sounds like you may need to give up a little to make things work -
Suppose, for example, that you placed a limit on the length of a quoted
string,
say 1024 characters - the reader can then either start at the beginning or
read back by, say 1024 characters to see if the start is in a quote and
proceed accordingly - it quoted strings can be of arbitrary length there
may be no good solution

On Wed, Feb 22, 2012 at 11:01 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> It seems nearly impossible to use CSV files as Hadoop input.  I see that
> there is a CsvRecordInput class, but have found virtually no examples
> online of how to use it...and the one example I did find blatantly assumed
> that the CSV records were delimited by endlines...which is not CSV spec.
>  Based on my analysis below, I don't see how CSV input is possible, so I
> don't understand how CsvRecordInput can work (and I am having trouble
> understanding the completely undocumented CsvRecordInput.java; It isn't
> clear how that class is intended to be used).  If CsvRecordInput solves all
> my problems, then great, but how do I use it?
>
> I need to process CSV files which will almost certainly contain quoted
> endlines.  I have attempted to derive my own record reader for this task
> and conclude that it is virtually impossible without reading from the
> beginning of the file.  I explain below.
>
> Consider this: Assuming a split starts at some arbitrary point in the
> file, the standard record reader approach would be to initialize the record
> reader by reading to the end of the current mid-record and beginning the
> record reader at the start of the next full record...but there is no way to
> positively identify the end of CSV record if you start at an arbitrary
> location without potentially reading to the end of the file!
>
> For example, we must consider the possibility that the split begins in the
> middle of a quoted string (therefore, endlines do not delimit records
> because they may be within a string).  We must therefore scan for a
> possible end-quote to close the string, but if we *didn't* begin within a
> string there may *be no end-quote at all* (the entire CSV file might not
> contain a single quoted string).  The only way to identify that we did not
> begin within a quoted string is to scan to the end of the CSV file (not the
> end of the *split* mind you).
>
> So, initializing a CSV record reader with absolute error-free confidence
> potentially requires reading not only the entire split at the time of
> initialization (grossly inefficient in itself), but potentially requires
> reading the entire file, which may not even reside on the current node!
>
> I'm at a loss.  How can Hadoop take CSV files as input?  It must be
> possible.  CSV is a very plain and common way to arrange textual data,
> which is Hadoop's forte; I'm sure people are processing CSV data with
> Hadoop, it seems like a natural fit...but I can't imagine how to enable
> Hadoop to read it under the conditions of Hadoop file splits.
>
> Blech.  Help!
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                           --  Yoda
>
> ________________________________________________________________________________
>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com