You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Jaggi <mo...@gmail.com> on 2014/09/13 04:43:58 UTC

sc.textFile problem due to newlines within a CSV record

Folks,
I think this might be due to the default TextInputFormat in Hadoop. Any
pointers to solutions much appreciated.
>>
More powerfully, you can define your own *InputFormat* implementations to
format the input to your programs however you want. For example, the
default TextInputFormat reads lines of text files. The key it emits for
each record is the byte offset of the line read (as a LongWritable), and
the value is the contents of the line up to the terminating '\n' character
(as a Text object). If you have multi-line records each separated by a
$character,
you could write your own *InputFormat* that parses files into records split
on this character instead.
>>

Thanks,
Mohit

Re: sc.textFile problem due to newlines within a CSV record

Posted by Mohit Jaggi <mo...@gmail.com>.

Thanks Xiangrui. This file already exists w/o escapes. I could probably try
to preprocess it and add the escaping.

On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng <me...@gmail.com> wrote:

> I wrote an input format for Redshift's tables unloaded UNLOAD the
> ESCAPE option: https://github.com/mengxr/redshift-input-format , which
> can recognize multi-line records.
>
> Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
> the delimiter character. You can apply the same escaping before
> calling saveAsTextFIle, then use the input format to load them back.
>
> Xiangrui
>
> On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi <mo...@gmail.com> wrote:
> > Folks,
> > I think this might be due to the default TextInputFormat in Hadoop. Any
> > pointers to solutions much appreciated.
> >>>
> > More powerfully, you can define your own InputFormat implementations to
> > format the input to your programs however you want. For example, the
> default
> > TextInputFormat reads lines of text files. The key it emits for each
> record
> > is the byte offset of the line read (as a LongWritable), and the value is
> > the contents of the line up to the terminating '\n' character (as a Text
> > object). If you have multi-line records each separated by a $character,
> you
> > could write your own InputFormat that parses files into records split on
> > this character instead.
> >>>
> >
> > Thanks,
> > Mohit
>

Re: sc.textFile problem due to newlines within a CSV record

Posted by Xiangrui Meng <me...@gmail.com>.

I wrote an input format for Redshift's tables unloaded UNLOAD the
ESCAPE option: https://github.com/mengxr/redshift-input-format , which
can recognize multi-line records.

Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
the delimiter character. You can apply the same escaping before
calling saveAsTextFIle, then use the input format to load them back.

Xiangrui

On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi <mo...@gmail.com> wrote:
> Folks,
> I think this might be due to the default TextInputFormat in Hadoop. Any
> pointers to solutions much appreciated.
>>>
> More powerfully, you can define your own InputFormat implementations to
> format the input to your programs however you want. For example, the default
> TextInputFormat reads lines of text files. The key it emits for each record
> is the byte offset of the line read (as a LongWritable), and the value is
> the contents of the line up to the terminating '\n' character (as a Text
> object). If you have multi-line records each separated by a $character, you
> could write your own InputFormat that parses files into records split on
> this character instead.
>>>
>
> Thanks,
> Mohit

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org