You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by David Kincaid <ki...@gmail.com> on 2012/09/20 19:55:44 UTC

Dropping embedded newlines for csv

I'm brand new to Sqoop and am working on importing data from an Oracle database
into HDFS. It is going to solve a number of problems I've been trying to
solve, so I'm really excited about it. I have it working great right now
except for one thing. One of the columns in one of that tables has newline
characters in it. I'm importing to comma delimited files and need to strip
off those embedded newline characters since the tool I'm reading the .csv
files with isn't handling those well.

I saw the option --hive-drop-import-delims which is exactly what I want,
but I assume that only works when importing to Hive. How have others solved
this problem?

Thanks,
Dave

Re: Dropping embedded newlines for csv

Posted by David Kincaid <ki...@gmail.com>.
That is awesome! That did it. Thank you very much for a great tool.

Dave

On Thu, Sep 20, 2012 at 1:51 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Well, I have to admit that it's quite confusing especially when we have
> this command line parameter mentioned in user guide only in HIVE section.
>
> I would probably prefer not to change the parameter name currently as it
> would break backward compatibility. Instead we might consider adjusting our
> User Guide. Please file a JIRA if you feel so as well.
>
> Jarcec
>
> On Thu, Sep 20, 2012 at 02:23:35PM -0400, Chalcy wrote:
> > I got that, Jarcec.  If the parameter does not need hive, then why call
> > this as --hive-import-drop-delims.  Instead can be called,
> > --import-drop-delims, right?
> >
> > hive-import in the name causes confusion :)  that was my point.
> >
> > Sorry I did not spell your name right, Jarcec.
> >
> > --Chalcy
> > On Thu, Sep 20, 2012 at 2:17 PM, Jarek Jarcec Cecho <jarcec@apache.org
> >wrote:
> >
> > > Hi Chalcy,
> > > I'm glad that you're enjoying sqoop a lot :-)
> > >
> > > I'm sorry for the confusion I've mistakenly caused. Name of the
> parameter
> > > is --hive-import-drop-delims in all cases. What I meant is that this
> > > argument can be used independently on argument --hive-import. So that
> you
> > > can drop HIVE delimiters (\n, \r, \0) and still be importing data
> directly
> > > into HDFS without any other HIVE interaction - I believe that you even
> do
> > > not need HIVE installation for doing so at all. Hope that this helps to
> > > clarify the confusion a bit.
> > >
> > > Jarcec
> > >
> > > On Thu, Sep 20, 2012 at 02:07:16PM -0400, Chalcy wrote:
> > > > Hi Jarec,
> > > >
> > > > I did not know that hive-import-drop-delims works wihout
> hive-import.  In
> > > > that case, do we want to call this parameter as just
> --drop-import-delims
> > > > instead of hive-drop-import-delims?
> > > >
> > > > Thanks,
> > > > Chalcy
> > > >
> > > > On Thu, Sep 20, 2012 at 2:04 PM, Chalcy <ch...@gmail.com> wrote:
> > > >
> > > > > I use the hive-drop-import-delims for hive import and that was the
> > > problem
> > > > > I had to solve a year ago.  Since you want the data in hdfs, you
> can
> > > do a
> > > > > workaround, like do hive import and use the underlying hdfs, like
> > > > > /user/hive/warehouse/mynewlineremoveddata.
> > > > >
> > > > > Sqoop is a great tool.  Using sqoop for all database imports.
> > > > >
> > > > > Thanks,
> > > > > Chalcy
> > > > >
> > > > >
> > > > > On Thu, Sep 20, 2012 at 1:55 PM, David Kincaid <
> kincaid.dave@gmail.com
> > > >wrote:
> > > > >
> > > > >> I'm brand new to Sqoop and am working on importing data from an
> > > Oracle database
> > > > >> into HDFS. It is going to solve a number of problems I've been
> trying
> > > to
> > > > >> solve, so I'm really excited about it. I have it working great
> right
> > > now
> > > > >> except for one thing. One of the columns in one of that tables has
> > > > >> newline characters in it. I'm importing to comma delimited files
> and
> > > > >> need to strip off those embedded newline characters since the
> tool I'm
> > > > >> reading the .csv files with isn't handling those well.
> > > > >>
> > > > >> I saw the option --hive-drop-import-delims which is exactly what I
> > > want,
> > > > >> but I assume that only works when importing to Hive. How have
> others
> > > > >> solved this problem?
> > > > >>
> > > > >> Thanks,
> > > > >> Dave
> > > > >>
> > > > >
> > > > >
> > >
>

Re: Dropping embedded newlines for csv

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Well, I have to admit that it's quite confusing especially when we have this command line parameter mentioned in user guide only in HIVE section.

I would probably prefer not to change the parameter name currently as it would break backward compatibility. Instead we might consider adjusting our User Guide. Please file a JIRA if you feel so as well.

Jarcec

On Thu, Sep 20, 2012 at 02:23:35PM -0400, Chalcy wrote:
> I got that, Jarcec.  If the parameter does not need hive, then why call
> this as --hive-import-drop-delims.  Instead can be called,
> --import-drop-delims, right?
> 
> hive-import in the name causes confusion :)  that was my point.
> 
> Sorry I did not spell your name right, Jarcec.
> 
> --Chalcy
> On Thu, Sep 20, 2012 at 2:17 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:
> 
> > Hi Chalcy,
> > I'm glad that you're enjoying sqoop a lot :-)
> >
> > I'm sorry for the confusion I've mistakenly caused. Name of the parameter
> > is --hive-import-drop-delims in all cases. What I meant is that this
> > argument can be used independently on argument --hive-import. So that you
> > can drop HIVE delimiters (\n, \r, \0) and still be importing data directly
> > into HDFS without any other HIVE interaction - I believe that you even do
> > not need HIVE installation for doing so at all. Hope that this helps to
> > clarify the confusion a bit.
> >
> > Jarcec
> >
> > On Thu, Sep 20, 2012 at 02:07:16PM -0400, Chalcy wrote:
> > > Hi Jarec,
> > >
> > > I did not know that hive-import-drop-delims works wihout hive-import.  In
> > > that case, do we want to call this parameter as just --drop-import-delims
> > > instead of hive-drop-import-delims?
> > >
> > > Thanks,
> > > Chalcy
> > >
> > > On Thu, Sep 20, 2012 at 2:04 PM, Chalcy <ch...@gmail.com> wrote:
> > >
> > > > I use the hive-drop-import-delims for hive import and that was the
> > problem
> > > > I had to solve a year ago.  Since you want the data in hdfs, you can
> > do a
> > > > workaround, like do hive import and use the underlying hdfs, like
> > > > /user/hive/warehouse/mynewlineremoveddata.
> > > >
> > > > Sqoop is a great tool.  Using sqoop for all database imports.
> > > >
> > > > Thanks,
> > > > Chalcy
> > > >
> > > >
> > > > On Thu, Sep 20, 2012 at 1:55 PM, David Kincaid <kincaid.dave@gmail.com
> > >wrote:
> > > >
> > > >> I'm brand new to Sqoop and am working on importing data from an
> > Oracle database
> > > >> into HDFS. It is going to solve a number of problems I've been trying
> > to
> > > >> solve, so I'm really excited about it. I have it working great right
> > now
> > > >> except for one thing. One of the columns in one of that tables has
> > > >> newline characters in it. I'm importing to comma delimited files and
> > > >> need to strip off those embedded newline characters since the tool I'm
> > > >> reading the .csv files with isn't handling those well.
> > > >>
> > > >> I saw the option --hive-drop-import-delims which is exactly what I
> > want,
> > > >> but I assume that only works when importing to Hive. How have others
> > > >> solved this problem?
> > > >>
> > > >> Thanks,
> > > >> Dave
> > > >>
> > > >
> > > >
> >

Re: Dropping embedded newlines for csv

Posted by Chalcy <ch...@gmail.com>.
I got that, Jarcec.  If the parameter does not need hive, then why call
this as --hive-import-drop-delims.  Instead can be called,
--import-drop-delims, right?

hive-import in the name causes confusion :)  that was my point.

Sorry I did not spell your name right, Jarcec.

--Chalcy
On Thu, Sep 20, 2012 at 2:17 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Hi Chalcy,
> I'm glad that you're enjoying sqoop a lot :-)
>
> I'm sorry for the confusion I've mistakenly caused. Name of the parameter
> is --hive-import-drop-delims in all cases. What I meant is that this
> argument can be used independently on argument --hive-import. So that you
> can drop HIVE delimiters (\n, \r, \0) and still be importing data directly
> into HDFS without any other HIVE interaction - I believe that you even do
> not need HIVE installation for doing so at all. Hope that this helps to
> clarify the confusion a bit.
>
> Jarcec
>
> On Thu, Sep 20, 2012 at 02:07:16PM -0400, Chalcy wrote:
> > Hi Jarec,
> >
> > I did not know that hive-import-drop-delims works wihout hive-import.  In
> > that case, do we want to call this parameter as just --drop-import-delims
> > instead of hive-drop-import-delims?
> >
> > Thanks,
> > Chalcy
> >
> > On Thu, Sep 20, 2012 at 2:04 PM, Chalcy <ch...@gmail.com> wrote:
> >
> > > I use the hive-drop-import-delims for hive import and that was the
> problem
> > > I had to solve a year ago.  Since you want the data in hdfs, you can
> do a
> > > workaround, like do hive import and use the underlying hdfs, like
> > > /user/hive/warehouse/mynewlineremoveddata.
> > >
> > > Sqoop is a great tool.  Using sqoop for all database imports.
> > >
> > > Thanks,
> > > Chalcy
> > >
> > >
> > > On Thu, Sep 20, 2012 at 1:55 PM, David Kincaid <kincaid.dave@gmail.com
> >wrote:
> > >
> > >> I'm brand new to Sqoop and am working on importing data from an
> Oracle database
> > >> into HDFS. It is going to solve a number of problems I've been trying
> to
> > >> solve, so I'm really excited about it. I have it working great right
> now
> > >> except for one thing. One of the columns in one of that tables has
> > >> newline characters in it. I'm importing to comma delimited files and
> > >> need to strip off those embedded newline characters since the tool I'm
> > >> reading the .csv files with isn't handling those well.
> > >>
> > >> I saw the option --hive-drop-import-delims which is exactly what I
> want,
> > >> but I assume that only works when importing to Hive. How have others
> > >> solved this problem?
> > >>
> > >> Thanks,
> > >> Dave
> > >>
> > >
> > >
>

Re: Dropping embedded newlines for csv

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Chalcy,
I'm glad that you're enjoying sqoop a lot :-)

I'm sorry for the confusion I've mistakenly caused. Name of the parameter is --hive-import-drop-delims in all cases. What I meant is that this argument can be used independently on argument --hive-import. So that you can drop HIVE delimiters (\n, \r, \0) and still be importing data directly into HDFS without any other HIVE interaction - I believe that you even do not need HIVE installation for doing so at all. Hope that this helps to clarify the confusion a bit.

Jarcec

On Thu, Sep 20, 2012 at 02:07:16PM -0400, Chalcy wrote:
> Hi Jarec,
> 
> I did not know that hive-import-drop-delims works wihout hive-import.  In
> that case, do we want to call this parameter as just --drop-import-delims
> instead of hive-drop-import-delims?
> 
> Thanks,
> Chalcy
> 
> On Thu, Sep 20, 2012 at 2:04 PM, Chalcy <ch...@gmail.com> wrote:
> 
> > I use the hive-drop-import-delims for hive import and that was the problem
> > I had to solve a year ago.  Since you want the data in hdfs, you can do a
> > workaround, like do hive import and use the underlying hdfs, like
> > /user/hive/warehouse/mynewlineremoveddata.
> >
> > Sqoop is a great tool.  Using sqoop for all database imports.
> >
> > Thanks,
> > Chalcy
> >
> >
> > On Thu, Sep 20, 2012 at 1:55 PM, David Kincaid <ki...@gmail.com>wrote:
> >
> >> I'm brand new to Sqoop and am working on importing data from an Oracle database
> >> into HDFS. It is going to solve a number of problems I've been trying to
> >> solve, so I'm really excited about it. I have it working great right now
> >> except for one thing. One of the columns in one of that tables has
> >> newline characters in it. I'm importing to comma delimited files and
> >> need to strip off those embedded newline characters since the tool I'm
> >> reading the .csv files with isn't handling those well.
> >>
> >> I saw the option --hive-drop-import-delims which is exactly what I want,
> >> but I assume that only works when importing to Hive. How have others
> >> solved this problem?
> >>
> >> Thanks,
> >> Dave
> >>
> >
> >

Re: Dropping embedded newlines for csv

Posted by Chalcy <ch...@gmail.com>.
Hi Jarec,

I did not know that hive-import-drop-delims works wihout hive-import.  In
that case, do we want to call this parameter as just --drop-import-delims
instead of hive-drop-import-delims?

Thanks,
Chalcy

On Thu, Sep 20, 2012 at 2:04 PM, Chalcy <ch...@gmail.com> wrote:

> I use the hive-drop-import-delims for hive import and that was the problem
> I had to solve a year ago.  Since you want the data in hdfs, you can do a
> workaround, like do hive import and use the underlying hdfs, like
> /user/hive/warehouse/mynewlineremoveddata.
>
> Sqoop is a great tool.  Using sqoop for all database imports.
>
> Thanks,
> Chalcy
>
>
> On Thu, Sep 20, 2012 at 1:55 PM, David Kincaid <ki...@gmail.com>wrote:
>
>> I'm brand new to Sqoop and am working on importing data from an Oracle database
>> into HDFS. It is going to solve a number of problems I've been trying to
>> solve, so I'm really excited about it. I have it working great right now
>> except for one thing. One of the columns in one of that tables has
>> newline characters in it. I'm importing to comma delimited files and
>> need to strip off those embedded newline characters since the tool I'm
>> reading the .csv files with isn't handling those well.
>>
>> I saw the option --hive-drop-import-delims which is exactly what I want,
>> but I assume that only works when importing to Hive. How have others
>> solved this problem?
>>
>> Thanks,
>> Dave
>>
>
>

Re: Dropping embedded newlines for csv

Posted by Chalcy <ch...@gmail.com>.
I use the hive-drop-import-delims for hive import and that was the problem
I had to solve a year ago.  Since you want the data in hdfs, you can do a
workaround, like do hive import and use the underlying hdfs, like
/user/hive/warehouse/mynewlineremoveddata.

Sqoop is a great tool.  Using sqoop for all database imports.

Thanks,
Chalcy

On Thu, Sep 20, 2012 at 1:55 PM, David Kincaid <ki...@gmail.com>wrote:

> I'm brand new to Sqoop and am working on importing data from an Oracle database
> into HDFS. It is going to solve a number of problems I've been trying to
> solve, so I'm really excited about it. I have it working great right now
> except for one thing. One of the columns in one of that tables has
> newline characters in it. I'm importing to comma delimited files and need
> to strip off those embedded newline characters since the tool I'm reading
> the .csv files with isn't handling those well.
>
> I saw the option --hive-drop-import-delims which is exactly what I want,
> but I assume that only works when importing to Hive. How have others
> solved this problem?
>
> Thanks,
> Dave
>

Re: Dropping embedded newlines for csv

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Dave,
even thought that the name --hive-drop-import-delims implies that it is connected to HIVE import, it's not the case. This argument should be independent on argument --hive-import and should normally work in non hive import.

Jarcec

On Thu, Sep 20, 2012 at 12:55:44PM -0500, David Kincaid wrote:
> I'm brand new to Sqoop and am working on importing data from an Oracle database
> into HDFS. It is going to solve a number of problems I've been trying to
> solve, so I'm really excited about it. I have it working great right now
> except for one thing. One of the columns in one of that tables has newline
> characters in it. I'm importing to comma delimited files and need to strip
> off those embedded newline characters since the tool I'm reading the .csv
> files with isn't handling those well.
> 
> I saw the option --hive-drop-import-delims which is exactly what I want,
> but I assume that only works when importing to Hive. How have others solved
> this problem?
> 
> Thanks,
> Dave