You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Nikolay Christov <nm...@gmail.com> on 2015/07/21 21:43:54 UTC

Sqoop pgsql import utf-8 error

Hello,

I just upgraded from postgresql 9.3 to 9.4 and start seeing errors during
import:
org.postgresql.util.PSQLException: ERROR: invalid byte sequence for
encoding "UTF8": 0xae

The database has SQL_ASCII encoding and in the past that wasn't an issue. I
have upgraded the pg driver with the 9.4 one but that didn't solve the
problem either.

Has anyone experienced similar issues?

Thanks,
-Nicky

Re: Sqoop pgsql import utf-8 error

Posted by Abraham Elmahrek <ab...@cloudera.com>.
Thanks for getting back to the mailing list and figuring this out!

-Abe

On Tue, Jul 21, 2015 at 7:32 PM, Nikolay Christov <nm...@gmail.com>
wrote:

> Hey Abe, thanks for the response I figured what the issue is and will post
> here just in case someone else bumps into this or has better solution.
>
> The data I was reading was badly formatted UTF-8 (non valid UTF8) but
> postgresql 9.3 wasn't handling this correctly and was returning stripped or
> altered characters depend on your pg client's encoding. in postgresql 9.4
> they fixed the issue and now when the client requests non UTF8 but client
> encoding is set to UTF8 then it throws an error. To handle this I am
> converting the bad UTF8 into byte array and then convert to base64 during
> the sqoop:
> encode(name::bytea, 'base64')
> since hive has native support for base64, I can use unbase64() and get my
> original non UTF8 string. It will be nice to skip this encoding yoga and
> directly dump those non UTF8 strings into hive.
>
> I am on pgsql 9.4
> sqoop 1.4.5.2.2.0.0-2041
> byte representation of bad UTF8: \x695465787420ae2c2061204a4156412d504446206c696272617279
> (iText ®, a JAVA-PDF library)
> command: sqoop import --verbose --username xxxxx --password xxxxx
> --connect "jdbc:postgresql://xxxxx:5432/db_name" --split-by id --target-dir
> /tmp/oozietemp --num-mappers 20 --direct --hive-table=metadata
> --hive-import --hive-overwrite --query "select name, id from table where
> \$CONDITIONS"
>
> some stack trace:
> 15/07/21 17:43:44 INFO mapreduce.Job: Task Id :
> attempt_1433455729125_6011_m_000011_2, Status : FAILED
> Error: java.io.IOException: SQLException in nextKeyValue
>         at
> org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
>         at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:553)
>         at
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>         at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at
> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: org.postgresql.util.PSQLException: ERROR: invalid regular
> expression: invalid escape \ sequence
>         at
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2270)
>         at
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1998)
>         at
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
>         at
> org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:570)
>         at
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:420)
>         at
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:305)
>         at
> org.apache.sqoop.mapreduce.db.DBRecordReader.executeQuery(DBRecordReader.java:111)
>         at
> org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:235)
>         ... 12 more
>
> On Tue, Jul 21, 2015 at 4:49 PM, Abraham Elmahrek <ab...@cloudera.com>
> wrote:
>
>> Hey man,
>>
>> Could you send over the following:
>>
>>    - Command that you're running
>>    - Version of Sqoop you're using
>>    - Run the command with --verbose and send the full output?
>>
>> -Abe
>>
>> On Tue, Jul 21, 2015 at 12:43 PM, Nikolay Christov <nm...@gmail.com>
>> wrote:
>>
>>> co
>>
>>
>>
>>
>

Re: Sqoop pgsql import utf-8 error

Posted by Nikolay Christov <nm...@gmail.com>.
Hey Abe, thanks for the response I figured what the issue is and will post
here just in case someone else bumps into this or has better solution.

The data I was reading was badly formatted UTF-8 (non valid UTF8) but
postgresql 9.3 wasn't handling this correctly and was returning stripped or
altered characters depend on your pg client's encoding. in postgresql 9.4
they fixed the issue and now when the client requests non UTF8 but client
encoding is set to UTF8 then it throws an error. To handle this I am
converting the bad UTF8 into byte array and then convert to base64 during
the sqoop:
encode(name::bytea, 'base64')
since hive has native support for base64, I can use unbase64() and get my
original non UTF8 string. It will be nice to skip this encoding yoga and
directly dump those non UTF8 strings into hive.

I am on pgsql 9.4
sqoop 1.4.5.2.2.0.0-2041
byte representation of bad UTF8:
\x695465787420ae2c2061204a4156412d504446206c696272617279
(iText ®, a JAVA-PDF library)
command: sqoop import --verbose --username xxxxx --password xxxxx --connect
"jdbc:postgresql://xxxxx:5432/db_name" --split-by id --target-dir
/tmp/oozietemp --num-mappers 20 --direct --hive-table=metadata
--hive-import --hive-overwrite --query "select name, id from table where
\$CONDITIONS"

some stack trace:
15/07/21 17:43:44 INFO mapreduce.Job: Task Id :
attempt_1433455729125_6011_m_000011_2, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
        at
org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:553)
        at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at
org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.postgresql.util.PSQLException: ERROR: invalid regular
expression: invalid escape \ sequence
        at
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2270)
        at
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1998)
        at
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:570)
        at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:420)
        at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:305)
        at
org.apache.sqoop.mapreduce.db.DBRecordReader.executeQuery(DBRecordReader.java:111)
        at
org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:235)
        ... 12 more

On Tue, Jul 21, 2015 at 4:49 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> Hey man,
>
> Could you send over the following:
>
>    - Command that you're running
>    - Version of Sqoop you're using
>    - Run the command with --verbose and send the full output?
>
> -Abe
>
> On Tue, Jul 21, 2015 at 12:43 PM, Nikolay Christov <nm...@gmail.com>
> wrote:
>
>> co
>
>
>
>

Re: Sqoop pgsql import utf-8 error

Posted by Abraham Elmahrek <ab...@cloudera.com>.
Hey man,

Could you send over the following:

   - Command that you're running
   - Version of Sqoop you're using
   - Run the command with --verbose and send the full output?

-Abe

On Tue, Jul 21, 2015 at 12:43 PM, Nikolay Christov <nm...@gmail.com>
wrote:

> co