You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Glen Hein <gl...@gmail.com> on 2015/03/10 17:34:47 UTC

having trouble with --hive-delims-replacement

I have an oracle db were some of the records contain \n and \r characters. I
m trying to use --hive-delims-replacement to convert them to spaces. I've
had mixed results. It seems that when the \n and \r appear at the beginning
of the fields data, then they are converted as expected. But when they are
embedded in the middle of the field data, then they are not being replaced.
Here's a hexdump showing the unconverted \n and \r characters:

000000c0  6e 75 6c 6c 01 48 65 6c  6c 6f 2c 0d 0a 20 20 20
|null.Hello,..   |
000000d0  20 20 20 4d 79 20 6e 61  6d 65 20 69 73 20 4d 61  |   My name is
Ma|
000000e0  72 79 20 48 75 6e 74 2c  20 49 20 61 6d 20 31 39  |ry Hunt, I am
19|
000000f0  20 79 65 61 72 73 20 6f  6c 64 20 61 6e 64 20 6c  | years old and
l|
00000100  69 76 65 20 69 6e 20 43  61 6c 69 66 6f 72 6e 69  |ive in
Californi|
00000110  61 2e 20 49 20 67 72 61  64 75 61 74 65 64 20 68  |a. I graduated
h|
00000120  69 67 68 20 73 63 68 6f  6f 6c 20 6f 6e 20 32 30  |igh school on
20|
00000130  31 31 20 61 6e 64 20 64  65 63 69 64 65 64 20 74  |11 and decided
t|
00000140  6f 20 74 61 6b 65 20 61  20 79 65 61 72 20 6f 66  |o take a year
of|
00000150  66 20 74 6f 20 73 65 65  20 69 66 20 77 68 61 74  |f to see if
what|
00000160  20 49 20 63 68 6f 73 65  20 61 73 20 61 20 66 75  | I chose as a
fu|
00000170  74 75 72 65 20 63 61 72  65 65 72 20 69 73 20 77  |ture career is
w|
00000180  68 61 74 20 49 20 72 65  61 6c 79 20 77 61 6e 74  |hat I realy
want|
00000190  65 64 20 74 6f 20 64 6f  2e 20 4d 79 20 79 65 61  |ed to do. My
yea|
000001a0  72 20 69 73 20 75 70 20  61 6e 64 20 49 20 73 74  |r is up and I
st|
000001b0  69 6c 6c 20 77 61 6e 74  20 74 6f 20 62 65 20 61  |ill want to be
a|
000001c0  20 74 65 61 63 68 65 72  2e 20 49 20 6c 6f 6f 6b  | teacher. I
look|
000001d0  20 66 6f 77 61 72 64 20  74 6f 20 6c 65 61 72 6e  | foward to
learn|
000001e0  69 6e 67 20 61 6c 6c 20  74 68 61 74 20 49 20 63  |ing all that I
c|
000001f0  61 6e 20 61 6e 64 20 62  65 69 6e 67 20 61 6e 20  |an and being
an |
00000200  65 6c 65 6d 65 6e 74 72  79 20 73 63 68 6f 6f 6c  |elementry
school|
00000210  20 74 65 61 63 68 65 72  2e 01 6e 75 6c 6c 01 31  |
teacher..null.1|

You can see in the first row of data there is a "01" that starts the field,
and then a few characters later the 0d and 0a. I'm passing the following
option to sqoop:

                <arg>--hive-delims-replacement</arg>
                <arg>ABCDEFG</arg>

I've tried several different values for the actual replacement string, but
the results are always the same.

Is there a better way to replace the unwanted characters?

Thanks,
Glen

Re: having trouble with --hive-delims-replacement

Posted by Abraham Elmahrek <ab...@cloudera.com>.
No idea if CLOB handling has changed. I'll let someone else in the
community answer that one.

On that note... here's an interesting google search:
https://www.google.com/search?q=site%3Aissues.apache.org%20sqoop%20clob

On Wed, Mar 11, 2015 at 11:42 AM, Glen Hein <gl...@gmail.com> wrote:

> I meant to say that the older system is running WITHOUT the extra mapping
> arguments.
>
>
> On Wed, Mar 11, 2015 at 10:20 AM, Glen Hein <gl...@gmail.com> wrote:
>
>> I've narrowed down the problem to CLOB fields. I did figure out that if I
>> add the option "--map-column-java BIO=String
>> " (BIO is the name of the CLOB field), then newlines are filtered as
>> expected.
>>
>> We do have an older system running Sqoop 1.3.0 that seems to filter the
>> newlines for the CLOB fields with adding the --map-column-java. Do you know
>> if 1.4.3 differs from 1.3.0 on how CLOB are handled?
>>
>> -Glen
>>
>> On Wed, Mar 11, 2015 at 10:03 AM, Abraham Elmahrek <ab...@cloudera.com>
>> wrote:
>>
>>> Hey man,
>>>
>>> Maybe regexp_replace + free-form query?
>>>
>>>
>>> http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_free_form_query_imports
>>> http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions130.htm
>>>
>>> -Abe
>>>
>>> On Tue, Mar 10, 2015 at 9:34 AM, Glen Hein <gl...@gmail.com> wrote:
>>>
>>>>
>>>> I have an oracle db were some of the records contain \n and \r
>>>> characters. I
>>>> m trying to use --hive-delims-replacement to convert them to spaces.
>>>> I've had mixed results. It seems that when the \n and \r appear at the
>>>> beginning of the fields data, then they are converted as expected. But when
>>>> they are embedded in the middle of the field data, then they are not being
>>>> replaced. Here's a hexdump showing the unconverted \n and \r characters:
>>>>
>>>> 000000c0  6e 75 6c 6c 01 48 65 6c  6c 6f 2c 0d 0a 20 20 20
>>>> |null.Hello,..   |
>>>> 000000d0  20 20 20 4d 79 20 6e 61  6d 65 20 69 73 20 4d 61  |   My name
>>>> is Ma|
>>>> 000000e0  72 79 20 48 75 6e 74 2c  20 49 20 61 6d 20 31 39  |ry Hunt, I
>>>> am 19|
>>>> 000000f0  20 79 65 61 72 73 20 6f  6c 64 20 61 6e 64 20 6c  | years old
>>>> and l|
>>>> 00000100  69 76 65 20 69 6e 20 43  61 6c 69 66 6f 72 6e 69  |ive in
>>>> Californi|
>>>> 00000110  61 2e 20 49 20 67 72 61  64 75 61 74 65 64 20 68  |a. I
>>>> graduated h|
>>>> 00000120  69 67 68 20 73 63 68 6f  6f 6c 20 6f 6e 20 32 30  |igh school
>>>> on 20|
>>>> 00000130  31 31 20 61 6e 64 20 64  65 63 69 64 65 64 20 74  |11 and
>>>> decided t|
>>>> 00000140  6f 20 74 61 6b 65 20 61  20 79 65 61 72 20 6f 66  |o take a
>>>> year of|
>>>> 00000150  66 20 74 6f 20 73 65 65  20 69 66 20 77 68 61 74  |f to see
>>>> if what|
>>>> 00000160  20 49 20 63 68 6f 73 65  20 61 73 20 61 20 66 75  | I chose
>>>> as a fu|
>>>> 00000170  74 75 72 65 20 63 61 72  65 65 72 20 69 73 20 77  |ture
>>>> career is w|
>>>> 00000180  68 61 74 20 49 20 72 65  61 6c 79 20 77 61 6e 74  |hat I
>>>> realy want|
>>>> 00000190  65 64 20 74 6f 20 64 6f  2e 20 4d 79 20 79 65 61  |ed to do.
>>>> My yea|
>>>> 000001a0  72 20 69 73 20 75 70 20  61 6e 64 20 49 20 73 74  |r is up
>>>> and I st|
>>>> 000001b0  69 6c 6c 20 77 61 6e 74  20 74 6f 20 62 65 20 61  |ill want
>>>> to be a|
>>>> 000001c0  20 74 65 61 63 68 65 72  2e 20 49 20 6c 6f 6f 6b  | teacher.
>>>> I look|
>>>> 000001d0  20 66 6f 77 61 72 64 20  74 6f 20 6c 65 61 72 6e  | foward to
>>>> learn|
>>>> 000001e0  69 6e 67 20 61 6c 6c 20  74 68 61 74 20 49 20 63  |ing all
>>>> that I c|
>>>> 000001f0  61 6e 20 61 6e 64 20 62  65 69 6e 67 20 61 6e 20  |an and
>>>> being an |
>>>> 00000200  65 6c 65 6d 65 6e 74 72  79 20 73 63 68 6f 6f 6c  |elementry
>>>> school|
>>>> 00000210  20 74 65 61 63 68 65 72  2e 01 6e 75 6c 6c 01 31  |
>>>> teacher..null.1|
>>>>
>>>> You can see in the first row of data there is a "01" that starts the
>>>> field, and then a few characters later the 0d and 0a. I'm passing the
>>>> following option to sqoop:
>>>>
>>>>                 <arg>--hive-delims-replacement</arg>
>>>>                 <arg>ABCDEFG</arg>
>>>>
>>>> I've tried several different values for the actual replacement string,
>>>> but the results are always the same.
>>>>
>>>> Is there a better way to replace the unwanted characters?
>>>>
>>>> Thanks,
>>>> Glen
>>>>
>>>>
>>>
>>
>

Re: having trouble with --hive-delims-replacement

Posted by Glen Hein <gl...@gmail.com>.
I meant to say that the older system is running WITHOUT the extra mapping
arguments.


On Wed, Mar 11, 2015 at 10:20 AM, Glen Hein <gl...@gmail.com> wrote:

> I've narrowed down the problem to CLOB fields. I did figure out that if I
> add the option "--map-column-java BIO=String
> " (BIO is the name of the CLOB field), then newlines are filtered as
> expected.
>
> We do have an older system running Sqoop 1.3.0 that seems to filter the
> newlines for the CLOB fields with adding the --map-column-java. Do you know
> if 1.4.3 differs from 1.3.0 on how CLOB are handled?
>
> -Glen
>
> On Wed, Mar 11, 2015 at 10:03 AM, Abraham Elmahrek <ab...@cloudera.com>
> wrote:
>
>> Hey man,
>>
>> Maybe regexp_replace + free-form query?
>>
>>
>> http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_free_form_query_imports
>> http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions130.htm
>>
>> -Abe
>>
>> On Tue, Mar 10, 2015 at 9:34 AM, Glen Hein <gl...@gmail.com> wrote:
>>
>>>
>>> I have an oracle db were some of the records contain \n and \r
>>> characters. I
>>> m trying to use --hive-delims-replacement to convert them to spaces.
>>> I've had mixed results. It seems that when the \n and \r appear at the
>>> beginning of the fields data, then they are converted as expected. But when
>>> they are embedded in the middle of the field data, then they are not being
>>> replaced. Here's a hexdump showing the unconverted \n and \r characters:
>>>
>>> 000000c0  6e 75 6c 6c 01 48 65 6c  6c 6f 2c 0d 0a 20 20 20
>>> |null.Hello,..   |
>>> 000000d0  20 20 20 4d 79 20 6e 61  6d 65 20 69 73 20 4d 61  |   My name
>>> is Ma|
>>> 000000e0  72 79 20 48 75 6e 74 2c  20 49 20 61 6d 20 31 39  |ry Hunt, I
>>> am 19|
>>> 000000f0  20 79 65 61 72 73 20 6f  6c 64 20 61 6e 64 20 6c  | years old
>>> and l|
>>> 00000100  69 76 65 20 69 6e 20 43  61 6c 69 66 6f 72 6e 69  |ive in
>>> Californi|
>>> 00000110  61 2e 20 49 20 67 72 61  64 75 61 74 65 64 20 68  |a. I
>>> graduated h|
>>> 00000120  69 67 68 20 73 63 68 6f  6f 6c 20 6f 6e 20 32 30  |igh school
>>> on 20|
>>> 00000130  31 31 20 61 6e 64 20 64  65 63 69 64 65 64 20 74  |11 and
>>> decided t|
>>> 00000140  6f 20 74 61 6b 65 20 61  20 79 65 61 72 20 6f 66  |o take a
>>> year of|
>>> 00000150  66 20 74 6f 20 73 65 65  20 69 66 20 77 68 61 74  |f to see if
>>> what|
>>> 00000160  20 49 20 63 68 6f 73 65  20 61 73 20 61 20 66 75  | I chose as
>>> a fu|
>>> 00000170  74 75 72 65 20 63 61 72  65 65 72 20 69 73 20 77  |ture career
>>> is w|
>>> 00000180  68 61 74 20 49 20 72 65  61 6c 79 20 77 61 6e 74  |hat I realy
>>> want|
>>> 00000190  65 64 20 74 6f 20 64 6f  2e 20 4d 79 20 79 65 61  |ed to do.
>>> My yea|
>>> 000001a0  72 20 69 73 20 75 70 20  61 6e 64 20 49 20 73 74  |r is up and
>>> I st|
>>> 000001b0  69 6c 6c 20 77 61 6e 74  20 74 6f 20 62 65 20 61  |ill want to
>>> be a|
>>> 000001c0  20 74 65 61 63 68 65 72  2e 20 49 20 6c 6f 6f 6b  | teacher. I
>>> look|
>>> 000001d0  20 66 6f 77 61 72 64 20  74 6f 20 6c 65 61 72 6e  | foward to
>>> learn|
>>> 000001e0  69 6e 67 20 61 6c 6c 20  74 68 61 74 20 49 20 63  |ing all
>>> that I c|
>>> 000001f0  61 6e 20 61 6e 64 20 62  65 69 6e 67 20 61 6e 20  |an and
>>> being an |
>>> 00000200  65 6c 65 6d 65 6e 74 72  79 20 73 63 68 6f 6f 6c  |elementry
>>> school|
>>> 00000210  20 74 65 61 63 68 65 72  2e 01 6e 75 6c 6c 01 31  |
>>> teacher..null.1|
>>>
>>> You can see in the first row of data there is a "01" that starts the
>>> field, and then a few characters later the 0d and 0a. I'm passing the
>>> following option to sqoop:
>>>
>>>                 <arg>--hive-delims-replacement</arg>
>>>                 <arg>ABCDEFG</arg>
>>>
>>> I've tried several different values for the actual replacement string,
>>> but the results are always the same.
>>>
>>> Is there a better way to replace the unwanted characters?
>>>
>>> Thanks,
>>> Glen
>>>
>>>
>>
>

Re: having trouble with --hive-delims-replacement

Posted by Glen Hein <gl...@gmail.com>.
I've narrowed down the problem to CLOB fields. I did figure out that if I
add the option "--map-column-java BIO=String
" (BIO is the name of the CLOB field), then newlines are filtered as
expected.

We do have an older system running Sqoop 1.3.0 that seems to filter the
newlines for the CLOB fields with adding the --map-column-java. Do you know
if 1.4.3 differs from 1.3.0 on how CLOB are handled?

-Glen

On Wed, Mar 11, 2015 at 10:03 AM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> Hey man,
>
> Maybe regexp_replace + free-form query?
>
>
> http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_free_form_query_imports
> http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions130.htm
>
> -Abe
>
> On Tue, Mar 10, 2015 at 9:34 AM, Glen Hein <gl...@gmail.com> wrote:
>
>>
>> I have an oracle db were some of the records contain \n and \r
>> characters. I
>> m trying to use --hive-delims-replacement to convert them to spaces. I've
>> had mixed results. It seems that when the \n and \r appear at the beginning
>> of the fields data, then they are converted as expected. But when they are
>> embedded in the middle of the field data, then they are not being replaced.
>> Here's a hexdump showing the unconverted \n and \r characters:
>>
>> 000000c0  6e 75 6c 6c 01 48 65 6c  6c 6f 2c 0d 0a 20 20 20
>> |null.Hello,..   |
>> 000000d0  20 20 20 4d 79 20 6e 61  6d 65 20 69 73 20 4d 61  |   My name
>> is Ma|
>> 000000e0  72 79 20 48 75 6e 74 2c  20 49 20 61 6d 20 31 39  |ry Hunt, I
>> am 19|
>> 000000f0  20 79 65 61 72 73 20 6f  6c 64 20 61 6e 64 20 6c  | years old
>> and l|
>> 00000100  69 76 65 20 69 6e 20 43  61 6c 69 66 6f 72 6e 69  |ive in
>> Californi|
>> 00000110  61 2e 20 49 20 67 72 61  64 75 61 74 65 64 20 68  |a. I
>> graduated h|
>> 00000120  69 67 68 20 73 63 68 6f  6f 6c 20 6f 6e 20 32 30  |igh school
>> on 20|
>> 00000130  31 31 20 61 6e 64 20 64  65 63 69 64 65 64 20 74  |11 and
>> decided t|
>> 00000140  6f 20 74 61 6b 65 20 61  20 79 65 61 72 20 6f 66  |o take a
>> year of|
>> 00000150  66 20 74 6f 20 73 65 65  20 69 66 20 77 68 61 74  |f to see if
>> what|
>> 00000160  20 49 20 63 68 6f 73 65  20 61 73 20 61 20 66 75  | I chose as
>> a fu|
>> 00000170  74 75 72 65 20 63 61 72  65 65 72 20 69 73 20 77  |ture career
>> is w|
>> 00000180  68 61 74 20 49 20 72 65  61 6c 79 20 77 61 6e 74  |hat I realy
>> want|
>> 00000190  65 64 20 74 6f 20 64 6f  2e 20 4d 79 20 79 65 61  |ed to do. My
>> yea|
>> 000001a0  72 20 69 73 20 75 70 20  61 6e 64 20 49 20 73 74  |r is up and
>> I st|
>> 000001b0  69 6c 6c 20 77 61 6e 74  20 74 6f 20 62 65 20 61  |ill want to
>> be a|
>> 000001c0  20 74 65 61 63 68 65 72  2e 20 49 20 6c 6f 6f 6b  | teacher. I
>> look|
>> 000001d0  20 66 6f 77 61 72 64 20  74 6f 20 6c 65 61 72 6e  | foward to
>> learn|
>> 000001e0  69 6e 67 20 61 6c 6c 20  74 68 61 74 20 49 20 63  |ing all that
>> I c|
>> 000001f0  61 6e 20 61 6e 64 20 62  65 69 6e 67 20 61 6e 20  |an and being
>> an |
>> 00000200  65 6c 65 6d 65 6e 74 72  79 20 73 63 68 6f 6f 6c  |elementry
>> school|
>> 00000210  20 74 65 61 63 68 65 72  2e 01 6e 75 6c 6c 01 31  |
>> teacher..null.1|
>>
>> You can see in the first row of data there is a "01" that starts the
>> field, and then a few characters later the 0d and 0a. I'm passing the
>> following option to sqoop:
>>
>>                 <arg>--hive-delims-replacement</arg>
>>                 <arg>ABCDEFG</arg>
>>
>> I've tried several different values for the actual replacement string,
>> but the results are always the same.
>>
>> Is there a better way to replace the unwanted characters?
>>
>> Thanks,
>> Glen
>>
>>
>

Re: having trouble with --hive-delims-replacement

Posted by Abraham Elmahrek <ab...@cloudera.com>.
Hey man,

Maybe regexp_replace + free-form query?

http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_free_form_query_imports
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions130.htm

-Abe

On Tue, Mar 10, 2015 at 9:34 AM, Glen Hein <gl...@gmail.com> wrote:

>
> I have an oracle db were some of the records contain \n and \r characters.
> I
> m trying to use --hive-delims-replacement to convert them to spaces. I've
> had mixed results. It seems that when the \n and \r appear at the beginning
> of the fields data, then they are converted as expected. But when they are
> embedded in the middle of the field data, then they are not being replaced.
> Here's a hexdump showing the unconverted \n and \r characters:
>
> 000000c0  6e 75 6c 6c 01 48 65 6c  6c 6f 2c 0d 0a 20 20 20
> |null.Hello,..   |
> 000000d0  20 20 20 4d 79 20 6e 61  6d 65 20 69 73 20 4d 61  |   My name is
> Ma|
> 000000e0  72 79 20 48 75 6e 74 2c  20 49 20 61 6d 20 31 39  |ry Hunt, I am
> 19|
> 000000f0  20 79 65 61 72 73 20 6f  6c 64 20 61 6e 64 20 6c  | years old
> and l|
> 00000100  69 76 65 20 69 6e 20 43  61 6c 69 66 6f 72 6e 69  |ive in
> Californi|
> 00000110  61 2e 20 49 20 67 72 61  64 75 61 74 65 64 20 68  |a. I
> graduated h|
> 00000120  69 67 68 20 73 63 68 6f  6f 6c 20 6f 6e 20 32 30  |igh school on
> 20|
> 00000130  31 31 20 61 6e 64 20 64  65 63 69 64 65 64 20 74  |11 and
> decided t|
> 00000140  6f 20 74 61 6b 65 20 61  20 79 65 61 72 20 6f 66  |o take a year
> of|
> 00000150  66 20 74 6f 20 73 65 65  20 69 66 20 77 68 61 74  |f to see if
> what|
> 00000160  20 49 20 63 68 6f 73 65  20 61 73 20 61 20 66 75  | I chose as a
> fu|
> 00000170  74 75 72 65 20 63 61 72  65 65 72 20 69 73 20 77  |ture career
> is w|
> 00000180  68 61 74 20 49 20 72 65  61 6c 79 20 77 61 6e 74  |hat I realy
> want|
> 00000190  65 64 20 74 6f 20 64 6f  2e 20 4d 79 20 79 65 61  |ed to do. My
> yea|
> 000001a0  72 20 69 73 20 75 70 20  61 6e 64 20 49 20 73 74  |r is up and I
> st|
> 000001b0  69 6c 6c 20 77 61 6e 74  20 74 6f 20 62 65 20 61  |ill want to
> be a|
> 000001c0  20 74 65 61 63 68 65 72  2e 20 49 20 6c 6f 6f 6b  | teacher. I
> look|
> 000001d0  20 66 6f 77 61 72 64 20  74 6f 20 6c 65 61 72 6e  | foward to
> learn|
> 000001e0  69 6e 67 20 61 6c 6c 20  74 68 61 74 20 49 20 63  |ing all that
> I c|
> 000001f0  61 6e 20 61 6e 64 20 62  65 69 6e 67 20 61 6e 20  |an and being
> an |
> 00000200  65 6c 65 6d 65 6e 74 72  79 20 73 63 68 6f 6f 6c  |elementry
> school|
> 00000210  20 74 65 61 63 68 65 72  2e 01 6e 75 6c 6c 01 31  |
> teacher..null.1|
>
> You can see in the first row of data there is a "01" that starts the
> field, and then a few characters later the 0d and 0a. I'm passing the
> following option to sqoop:
>
>                 <arg>--hive-delims-replacement</arg>
>                 <arg>ABCDEFG</arg>
>
> I've tried several different values for the actual replacement string, but
> the results are always the same.
>
> Is there a better way to replace the unwanted characters?
>
> Thanks,
> Glen
>
>