You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Jason Rosenberg <jb...@squareup.com> on 2014/03/18 22:23:36 UTC

direct mode for mysql export vs. default jdbc

Hi,

I'm wondering if there is expected performance increases with using the
--direct flag for exporting from hive to mysql.  If so, how much speedup?

Also, I've been getting lock contention errors during export, and I'm
wondering if these are less likely using --direct mode?  E.g. I'm getting
these sorts of exceptions on the sqoop console:

14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
attempt_201403180842_0202_m_000002_1, Status : FAILED
java.io.IOException: Can't export data, please check failed map task logs
        at
org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
        at
org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
        at
org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.io.IOException: java.sql.BatchUpdateException: Deadlock
found when trying to get lock; try restarting transaction
        at
org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr


Thanks,

Jason

Re: direct mode for mysql export vs. default jdbc

Posted by Kathleen Ting <ka...@apache.org>.
Hi Jason, unfortunately that is a limitation with the mysqldump utility itself.

Did reducing the number of mappers help?

Thanks,
Kate

On Tue, Mar 25, 2014 at 7:38 PM, Jason Rosenberg <jb...@squareup.com> wrote:
> Thanks Kate,
>
> Very helpful.  Is there any chance in the future of having the null
> substitution handling supported in direct mode?  Or is it a fundamental
> issue with hive itself?
>
> I'll try reducing the number of mappers further....
>
> Jason
>
>
> On Tue, Mar 25, 2014 at 8:26 PM, Kathleen Ting <ka...@apache.org> wrote:
>>
>> Hi Jason,
>>
>> Sorry for the delay.
>>
>> Generally speaking, you'll want to decrease num-mappers (default is 4)
>> to lessen the load on the db (but ingest rate will decline) and you'll
>> want to increase num-mappers to improve the ingest rate (but db load
>> will be negatively impacted). Beyond that we can't give any
>> recommendation with regards to the number of mappers as every
>> environment is different. The main bottleneck in a Sqoop job is the
>> shared database system. Sqoop can scale only to the extent that is
>> allowed by the particular database. Furthermore, this can differ from
>> table to table even within a single database system depending on what
>> disks the particular import table uses.
>>
>> The MySQL direct connector uses a native utility called mysqldump to
>> perform a highly efficient data transfer between the MySQL server and
>> Hadoop cluster. This utility unfortunately does not support using
>> custom NULL substitution strings and will always import missing values
>> as a string constant NULL. This is very confusing on the Hive side, as
>> the Hive shell will display the value as NULL as well. It won't be
>> perceived as a missing value, but as a valid string constant. You need
>> to turn off direct mode (by omitting the --direct option) in order to
>> override the default NULL substitution string.
>>
>> Regards,
>> Kate
>>
>> On Tue, Mar 25, 2014 at 8:26 AM, Jason Rosenberg <jb...@squareup.com> wrote:
>> > Ping!
>> >
>> >
>> > On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <jb...@squareup.com>
>> > wrote:
>> >>
>> >> Hi Kathleen, Et al,
>> >>
>> >> Any thoughts on my parameters above?  Would this work for --direct mode
>> >> likely?
>> >>
>> >> Thanks,
>> >>
>> >> Jason
>> >>
>> >>
>> >> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <jb...@squareup.com>
>> >> wrote:
>> >>>
>> >>> Hi Kathleen,
>> >>>
>> >>> The sqoop command looks like this (a bash command, that takes a few
>> >>> args
>> >>> naming the source/dest table names, update key, and dest columns).
>> >>>
>> >>> sqoop_table () {
>> >>>   sqoop --export
>> >>>     --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
>> >>>     --username $MYSQL_USER \
>> >>>     --outdir $LOCAL_TEMP \
>> >>>     --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
>> >>>     --table $2 \
>> >>>     --update-key $3 \
>> >>>     --update-mode allowinsert \
>> >>>     --columns $4 \
>> >>>     --input-fields-terminated-by "\t" \
>> >>>     --input-null-string '\\N' \
>> >>>     --input-null-non-string '\\N' \
>> >>>     --batch \
>> >>>     --num-mappers 8
>> >>> }
>> >>>
>> >>> The lock contention exceptions seem to be retried, up to a point.  If
>> >>> they happen too often, the hadoop job tracker eventually decides to
>> >>> kill the
>> >>> job.  Even when the job succeeds, there are usually a few of these
>> >>> exceptions.  I originally had much more mappers configured, but here,
>> >>> I've
>> >>> reduced it to 8, and this seems to help (although it's not a
>> >>> guarantee).
>> >>> Reducing the num-mappers also makes it less likely that the target
>> >>> mysql db
>> >>> machine will get overloaded (e.g. was getting maxed out on cpu usage
>> >>> with 64
>> >>> mappers, etc.).
>> >>>
>> >>> The number of records being sqooped is on the order of 1-5M at a time.
>> >>>
>> >>> Sounds like I should try --direct mode (but not sure if the
>> >>> null/delimiters we're using will work with it?).
>> >>>
>> >>> Jason
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <ka...@apache.org>
>> >>> wrote:
>> >>>>
>> >>>> Hi Jason,
>> >>>>
>> >>>> Rather than using the JDBC interface for transferring data, the
>> >>>> direct
>> >>>> mode delegates the job of transferring data to the native utilities
>> >>>> provided by the database vendor. In the case of MySQL, the mysqldump
>> >>>> and mysqlimport will be used for retrieving data from the database
>> >>>> server or moving data back. Using native utilities will greatly
>> >>>> improve performance, as they are optimized to provide the best
>> >>>> possible transfer speed while putting less burden on the database
>> >>>> server. That said, there are several limitations that come with this
>> >>>> faster import. In the case of MySQL, each node hosting a TaskTracker
>> >>>> service needs to have both mysqldump and mysqlimport utilities
>> >>>> installed. Another limitation of the direct mode is that not all
>> >>>> parameters are supported. As the native utilities usually produce
>> >>>> text
>> >>>> output, binary formats like SequenceFile or Avro won't work. Also,
>> >>>> parameters that customize the escape characters, type mapping, column
>> >>>> and row delimiters, or the NULL substitution string might not be
>> >>>> supported in all cases.
>> >>>>
>> >>>> Can you share your entire Sqoop command and the contents of failed
>> >>>> task attempt attempt_201403180842_0202_m_000002_1?
>> >>>>
>> >>>> Thanks,
>> >>>> Kate
>> >>>>
>> >>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jb...@squareup.com>
>> >>>> wrote:
>> >>>> > Thoughts anyone?
>> >>>> >
>> >>>> > Thanks,
>> >>>> >
>> >>>> > Jason
>> >>>> >
>> >>>> >
>> >>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jb...@squareup.com>
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Hi,
>> >>>> >>
>> >>>> >> I'm wondering if there is expected performance increases with
>> >>>> >> using
>> >>>> >> the
>> >>>> >> --direct flag for exporting from hive to mysql.  If so, how much
>> >>>> >> speedup?
>> >>>> >>
>> >>>> >> Also, I've been getting lock contention errors during export, and
>> >>>> >> I'm
>> >>>> >> wondering if these are less likely using --direct mode?  E.g. I'm
>> >>>> >> getting
>> >>>> >> these sorts of exceptions on the sqoop console:
>> >>>> >>
>> >>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
>> >>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
>> >>>> >> java.io.IOException: Can't export data, please check failed map
>> >>>> >> task
>> >>>> >> logs
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>> >>>> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>> >>>> >>         at
>> >>>> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>> >>>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>> >>>> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> >>>> >>         at java.security.AccessController.doPrivileged(Native
>> >>>> >> Method)
>> >>>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>> >>>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> >>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException:
>> >>>> >> Deadlock
>> >>>> >> found when trying to get lock; try restarting transaction
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
>> >>>> >>
>> >>>> >>
>> >>>> >> Thanks,
>> >>>> >>
>> >>>> >> Jason
>> >>>> >
>> >>>> >
>> >>>
>> >>>
>> >>
>> >
>
>

Re: direct mode for mysql export vs. default jdbc

Posted by Jason Rosenberg <jb...@squareup.com>.
Thanks Kate,

Very helpful.  Is there any chance in the future of having the null
substitution handling supported in direct mode?  Or is it a fundamental
issue with hive itself?

I'll try reducing the number of mappers further....

Jason


On Tue, Mar 25, 2014 at 8:26 PM, Kathleen Ting <ka...@apache.org> wrote:

> Hi Jason,
>
> Sorry for the delay.
>
> Generally speaking, you'll want to decrease num-mappers (default is 4)
> to lessen the load on the db (but ingest rate will decline) and you'll
> want to increase num-mappers to improve the ingest rate (but db load
> will be negatively impacted). Beyond that we can't give any
> recommendation with regards to the number of mappers as every
> environment is different. The main bottleneck in a Sqoop job is the
> shared database system. Sqoop can scale only to the extent that is
> allowed by the particular database. Furthermore, this can differ from
> table to table even within a single database system depending on what
> disks the particular import table uses.
>
> The MySQL direct connector uses a native utility called mysqldump to
> perform a highly efficient data transfer between the MySQL server and
> Hadoop cluster. This utility unfortunately does not support using
> custom NULL substitution strings and will always import missing values
> as a string constant NULL. This is very confusing on the Hive side, as
> the Hive shell will display the value as NULL as well. It won't be
> perceived as a missing value, but as a valid string constant. You need
> to turn off direct mode (by omitting the --direct option) in order to
> override the default NULL substitution string.
>
> Regards,
> Kate
>
> On Tue, Mar 25, 2014 at 8:26 AM, Jason Rosenberg <jb...@squareup.com> wrote:
> > Ping!
> >
> >
> > On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >>
> >> Hi Kathleen, Et al,
> >>
> >> Any thoughts on my parameters above?  Would this work for --direct mode
> >> likely?
> >>
> >> Thanks,
> >>
> >> Jason
> >>
> >>
> >> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >>>
> >>> Hi Kathleen,
> >>>
> >>> The sqoop command looks like this (a bash command, that takes a few
> args
> >>> naming the source/dest table names, update key, and dest columns).
> >>>
> >>> sqoop_table () {
> >>>   sqoop --export
> >>>     --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
> >>>     --username $MYSQL_USER \
> >>>     --outdir $LOCAL_TEMP \
> >>>     --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
> >>>     --table $2 \
> >>>     --update-key $3 \
> >>>     --update-mode allowinsert \
> >>>     --columns $4 \
> >>>     --input-fields-terminated-by "\t" \
> >>>     --input-null-string '\\N' \
> >>>     --input-null-non-string '\\N' \
> >>>     --batch \
> >>>     --num-mappers 8
> >>> }
> >>>
> >>> The lock contention exceptions seem to be retried, up to a point.  If
> >>> they happen too often, the hadoop job tracker eventually decides to
> kill the
> >>> job.  Even when the job succeeds, there are usually a few of these
> >>> exceptions.  I originally had much more mappers configured, but here,
> I've
> >>> reduced it to 8, and this seems to help (although it's not a
> guarantee).
> >>> Reducing the num-mappers also makes it less likely that the target
> mysql db
> >>> machine will get overloaded (e.g. was getting maxed out on cpu usage
> with 64
> >>> mappers, etc.).
> >>>
> >>> The number of records being sqooped is on the order of 1-5M at a time.
> >>>
> >>> Sounds like I should try --direct mode (but not sure if the
> >>> null/delimiters we're using will work with it?).
> >>>
> >>> Jason
> >>>
> >>>
> >>>
> >>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <ka...@apache.org>
> >>> wrote:
> >>>>
> >>>> Hi Jason,
> >>>>
> >>>> Rather than using the JDBC interface for transferring data, the direct
> >>>> mode delegates the job of transferring data to the native utilities
> >>>> provided by the database vendor. In the case of MySQL, the mysqldump
> >>>> and mysqlimport will be used for retrieving data from the database
> >>>> server or moving data back. Using native utilities will greatly
> >>>> improve performance, as they are optimized to provide the best
> >>>> possible transfer speed while putting less burden on the database
> >>>> server. That said, there are several limitations that come with this
> >>>> faster import. In the case of MySQL, each node hosting a TaskTracker
> >>>> service needs to have both mysqldump and mysqlimport utilities
> >>>> installed. Another limitation of the direct mode is that not all
> >>>> parameters are supported. As the native utilities usually produce text
> >>>> output, binary formats like SequenceFile or Avro won't work. Also,
> >>>> parameters that customize the escape characters, type mapping, column
> >>>> and row delimiters, or the NULL substitution string might not be
> >>>> supported in all cases.
> >>>>
> >>>> Can you share your entire Sqoop command and the contents of failed
> >>>> task attempt attempt_201403180842_0202_m_000002_1?
> >>>>
> >>>> Thanks,
> >>>> Kate
> >>>>
> >>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jb...@squareup.com>
> >>>> wrote:
> >>>> > Thoughts anyone?
> >>>> >
> >>>> > Thanks,
> >>>> >
> >>>> > Jason
> >>>> >
> >>>> >
> >>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jb...@squareup.com>
> >>>> > wrote:
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> I'm wondering if there is expected performance increases with using
> >>>> >> the
> >>>> >> --direct flag for exporting from hive to mysql.  If so, how much
> >>>> >> speedup?
> >>>> >>
> >>>> >> Also, I've been getting lock contention errors during export, and
> I'm
> >>>> >> wondering if these are less likely using --direct mode?  E.g. I'm
> >>>> >> getting
> >>>> >> these sorts of exceptions on the sqoop console:
> >>>> >>
> >>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
> >>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
> >>>> >> java.io.IOException: Can't export data, please check failed map
> task
> >>>> >> logs
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
> >>>> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
> >>>> >>         at
> >>>> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
> >>>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> >>>> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> >>>> >>         at java.security.AccessController.doPrivileged(Native
> Method)
> >>>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> >>>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
> >>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException:
> >>>> >> Deadlock
> >>>> >> found when trying to get lock; try restarting transaction
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
> >>>> >>
> >>>> >>
> >>>> >> Thanks,
> >>>> >>
> >>>> >> Jason
> >>>> >
> >>>> >
> >>>
> >>>
> >>
> >
>

Re: direct mode for mysql export vs. default jdbc

Posted by Kathleen Ting <ka...@apache.org>.
Hi Jason,

Sorry for the delay.

Generally speaking, you'll want to decrease num-mappers (default is 4)
to lessen the load on the db (but ingest rate will decline) and you'll
want to increase num-mappers to improve the ingest rate (but db load
will be negatively impacted). Beyond that we can't give any
recommendation with regards to the number of mappers as every
environment is different. The main bottleneck in a Sqoop job is the
shared database system. Sqoop can scale only to the extent that is
allowed by the particular database. Furthermore, this can differ from
table to table even within a single database system depending on what
disks the particular import table uses.

The MySQL direct connector uses a native utility called mysqldump to
perform a highly efficient data transfer between the MySQL server and
Hadoop cluster. This utility unfortunately does not support using
custom NULL substitution strings and will always import missing values
as a string constant NULL. This is very confusing on the Hive side, as
the Hive shell will display the value as NULL as well. It won't be
perceived as a missing value, but as a valid string constant. You need
to turn off direct mode (by omitting the --direct option) in order to
override the default NULL substitution string.

Regards,
Kate

On Tue, Mar 25, 2014 at 8:26 AM, Jason Rosenberg <jb...@squareup.com> wrote:
> Ping!
>
>
> On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <jb...@squareup.com> wrote:
>>
>> Hi Kathleen, Et al,
>>
>> Any thoughts on my parameters above?  Would this work for --direct mode
>> likely?
>>
>> Thanks,
>>
>> Jason
>>
>>
>> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>>>
>>> Hi Kathleen,
>>>
>>> The sqoop command looks like this (a bash command, that takes a few args
>>> naming the source/dest table names, update key, and dest columns).
>>>
>>> sqoop_table () {
>>>   sqoop --export
>>>     --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
>>>     --username $MYSQL_USER \
>>>     --outdir $LOCAL_TEMP \
>>>     --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
>>>     --table $2 \
>>>     --update-key $3 \
>>>     --update-mode allowinsert \
>>>     --columns $4 \
>>>     --input-fields-terminated-by "\t" \
>>>     --input-null-string '\\N' \
>>>     --input-null-non-string '\\N' \
>>>     --batch \
>>>     --num-mappers 8
>>> }
>>>
>>> The lock contention exceptions seem to be retried, up to a point.  If
>>> they happen too often, the hadoop job tracker eventually decides to kill the
>>> job.  Even when the job succeeds, there are usually a few of these
>>> exceptions.  I originally had much more mappers configured, but here, I've
>>> reduced it to 8, and this seems to help (although it's not a guarantee).
>>> Reducing the num-mappers also makes it less likely that the target mysql db
>>> machine will get overloaded (e.g. was getting maxed out on cpu usage with 64
>>> mappers, etc.).
>>>
>>> The number of records being sqooped is on the order of 1-5M at a time.
>>>
>>> Sounds like I should try --direct mode (but not sure if the
>>> null/delimiters we're using will work with it?).
>>>
>>> Jason
>>>
>>>
>>>
>>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <ka...@apache.org>
>>> wrote:
>>>>
>>>> Hi Jason,
>>>>
>>>> Rather than using the JDBC interface for transferring data, the direct
>>>> mode delegates the job of transferring data to the native utilities
>>>> provided by the database vendor. In the case of MySQL, the mysqldump
>>>> and mysqlimport will be used for retrieving data from the database
>>>> server or moving data back. Using native utilities will greatly
>>>> improve performance, as they are optimized to provide the best
>>>> possible transfer speed while putting less burden on the database
>>>> server. That said, there are several limitations that come with this
>>>> faster import. In the case of MySQL, each node hosting a TaskTracker
>>>> service needs to have both mysqldump and mysqlimport utilities
>>>> installed. Another limitation of the direct mode is that not all
>>>> parameters are supported. As the native utilities usually produce text
>>>> output, binary formats like SequenceFile or Avro won't work. Also,
>>>> parameters that customize the escape characters, type mapping, column
>>>> and row delimiters, or the NULL substitution string might not be
>>>> supported in all cases.
>>>>
>>>> Can you share your entire Sqoop command and the contents of failed
>>>> task attempt attempt_201403180842_0202_m_000002_1?
>>>>
>>>> Thanks,
>>>> Kate
>>>>
>>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jb...@squareup.com>
>>>> wrote:
>>>> > Thoughts anyone?
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Jason
>>>> >
>>>> >
>>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jb...@squareup.com>
>>>> > wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> I'm wondering if there is expected performance increases with using
>>>> >> the
>>>> >> --direct flag for exporting from hive to mysql.  If so, how much
>>>> >> speedup?
>>>> >>
>>>> >> Also, I've been getting lock contention errors during export, and I'm
>>>> >> wondering if these are less likely using --direct mode?  E.g. I'm
>>>> >> getting
>>>> >> these sorts of exceptions on the sqoop console:
>>>> >>
>>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
>>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
>>>> >> java.io.IOException: Can't export data, please check failed map task
>>>> >> logs
>>>> >>         at
>>>> >>
>>>> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>>>> >>         at
>>>> >>
>>>> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>>>> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>>>> >>         at
>>>> >>
>>>> >> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>>>> >>         at
>>>> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>>>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>>>> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>>> >>         at java.security.AccessController.doPrivileged(Native Method)
>>>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> >>         at
>>>> >>
>>>> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>>>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException:
>>>> >> Deadlock
>>>> >> found when trying to get lock; try restarting transaction
>>>> >>         at
>>>> >>
>>>> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
>>>> >>
>>>> >>
>>>> >> Thanks,
>>>> >>
>>>> >> Jason
>>>> >
>>>> >
>>>
>>>
>>
>

Re: direct mode for mysql export vs. default jdbc

Posted by Jason Rosenberg <jb...@squareup.com>.
Ping!


On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <jb...@squareup.com> wrote:

> Hi Kathleen, Et al,
>
> Any thoughts on my parameters above?  Would this work for --direct mode
> likely?
>
> Thanks,
>
> Jason
>
>
> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>
>> Hi Kathleen,
>>
>> The sqoop command looks like this (a bash command, that takes a few args
>> naming the source/dest table names, update key, and dest columns).
>>
>> sqoop_table () {
>>   sqoop --export
>>     --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
>>     --username $MYSQL_USER \
>>     --outdir $LOCAL_TEMP \
>>     --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
>>     --table $2 \
>>     --update-key $3 \
>>     --update-mode allowinsert \
>>     --columns $4 \
>>     --input-fields-terminated-by "\t" \
>>     --input-null-string '\\N' \
>>     --input-null-non-string '\\N' \
>>     --batch \
>>     --num-mappers 8
>> }
>>
>> The lock contention exceptions seem to be retried, up to a point.  If
>> they happen too often, the hadoop job tracker eventually decides to kill
>> the job.  Even when the job succeeds, there are usually a few of these
>> exceptions.  I originally had much more mappers configured, but here, I've
>> reduced it to 8, and this seems to help (although it's not a guarantee).
>>  Reducing the num-mappers also makes it less likely that the target mysql
>> db machine will get overloaded (e.g. was getting maxed out on cpu usage
>> with 64 mappers, etc.).
>>
>> The number of records being sqooped is on the order of 1-5M at a time.
>>
>> Sounds like I should try --direct mode (but not sure if the
>> null/delimiters we're using will work with it?).
>>
>> Jason
>>
>>
>>
>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <ka...@apache.org>wrote:
>>
>>> Hi Jason,
>>>
>>> Rather than using the JDBC interface for transferring data, the direct
>>> mode delegates the job of transferring data to the native utilities
>>> provided by the database vendor. In the case of MySQL, the mysqldump
>>> and mysqlimport will be used for retrieving data from the database
>>> server or moving data back. Using native utilities will greatly
>>> improve performance, as they are optimized to provide the best
>>> possible transfer speed while putting less burden on the database
>>> server. That said, there are several limitations that come with this
>>> faster import. In the case of MySQL, each node hosting a TaskTracker
>>> service needs to have both mysqldump and mysqlimport utilities
>>> installed. Another limitation of the direct mode is that not all
>>> parameters are supported. As the native utilities usually produce text
>>> output, binary formats like SequenceFile or Avro won't work. Also,
>>> parameters that customize the escape characters, type mapping, column
>>> and row delimiters, or the NULL substitution string might not be
>>> supported in all cases.
>>>
>>> Can you share your entire Sqoop command and the contents of failed
>>> task attempt attempt_201403180842_0202_m_000002_1?
>>>
>>> Thanks,
>>> Kate
>>>
>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jb...@squareup.com>
>>> wrote:
>>> > Thoughts anyone?
>>> >
>>> > Thanks,
>>> >
>>> > Jason
>>> >
>>> >
>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jb...@squareup.com>
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I'm wondering if there is expected performance increases with using
>>> the
>>> >> --direct flag for exporting from hive to mysql.  If so, how much
>>> speedup?
>>> >>
>>> >> Also, I've been getting lock contention errors during export, and I'm
>>> >> wondering if these are less likely using --direct mode?  E.g. I'm
>>> getting
>>> >> these sorts of exceptions on the sqoop console:
>>> >>
>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
>>> >> java.io.IOException: Can't export data, please check failed map task
>>> logs
>>> >>         at
>>> >>
>>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>>> >>         at
>>> >>
>>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>>> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>>> >>         at
>>> >>
>>> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>>> >>         at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>>> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>> >>         at java.security.AccessController.doPrivileged(Native Method)
>>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>> >>         at
>>> >>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException:
>>> Deadlock
>>> >> found when trying to get lock; try restarting transaction
>>> >>         at
>>> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
>>> >>
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Jason
>>> >
>>> >
>>>
>>
>>
>

Re: direct mode for mysql export vs. default jdbc

Posted by Jason Rosenberg <jb...@squareup.com>.
Hi Kathleen, Et al,

Any thoughts on my parameters above?  Would this work for --direct mode
likely?

Thanks,

Jason


On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> Hi Kathleen,
>
> The sqoop command looks like this (a bash command, that takes a few args
> naming the source/dest table names, update key, and dest columns).
>
> sqoop_table () {
>   sqoop --export
>     --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
>     --username $MYSQL_USER \
>     --outdir $LOCAL_TEMP \
>     --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
>     --table $2 \
>     --update-key $3 \
>     --update-mode allowinsert \
>     --columns $4 \
>     --input-fields-terminated-by "\t" \
>     --input-null-string '\\N' \
>     --input-null-non-string '\\N' \
>     --batch \
>     --num-mappers 8
> }
>
> The lock contention exceptions seem to be retried, up to a point.  If they
> happen too often, the hadoop job tracker eventually decides to kill the
> job.  Even when the job succeeds, there are usually a few of these
> exceptions.  I originally had much more mappers configured, but here, I've
> reduced it to 8, and this seems to help (although it's not a guarantee).
>  Reducing the num-mappers also makes it less likely that the target mysql
> db machine will get overloaded (e.g. was getting maxed out on cpu usage
> with 64 mappers, etc.).
>
> The number of records being sqooped is on the order of 1-5M at a time.
>
> Sounds like I should try --direct mode (but not sure if the
> null/delimiters we're using will work with it?).
>
> Jason
>
>
>
> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <ka...@apache.org>wrote:
>
>> Hi Jason,
>>
>> Rather than using the JDBC interface for transferring data, the direct
>> mode delegates the job of transferring data to the native utilities
>> provided by the database vendor. In the case of MySQL, the mysqldump
>> and mysqlimport will be used for retrieving data from the database
>> server or moving data back. Using native utilities will greatly
>> improve performance, as they are optimized to provide the best
>> possible transfer speed while putting less burden on the database
>> server. That said, there are several limitations that come with this
>> faster import. In the case of MySQL, each node hosting a TaskTracker
>> service needs to have both mysqldump and mysqlimport utilities
>> installed. Another limitation of the direct mode is that not all
>> parameters are supported. As the native utilities usually produce text
>> output, binary formats like SequenceFile or Avro won't work. Also,
>> parameters that customize the escape characters, type mapping, column
>> and row delimiters, or the NULL substitution string might not be
>> supported in all cases.
>>
>> Can you share your entire Sqoop command and the contents of failed
>> task attempt attempt_201403180842_0202_m_000002_1?
>>
>> Thanks,
>> Kate
>>
>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jb...@squareup.com>
>> wrote:
>> > Thoughts anyone?
>> >
>> > Thanks,
>> >
>> > Jason
>> >
>> >
>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jb...@squareup.com>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm wondering if there is expected performance increases with using the
>> >> --direct flag for exporting from hive to mysql.  If so, how much
>> speedup?
>> >>
>> >> Also, I've been getting lock contention errors during export, and I'm
>> >> wondering if these are less likely using --direct mode?  E.g. I'm
>> getting
>> >> these sorts of exceptions on the sqoop console:
>> >>
>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
>> >> java.io.IOException: Can't export data, please check failed map task
>> logs
>> >>         at
>> >>
>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>> >>         at
>> >>
>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>> >>         at
>> >>
>> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>> >>         at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> >>         at java.security.AccessController.doPrivileged(Native Method)
>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>> >>         at
>> >>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException: Deadlock
>> >> found when trying to get lock; try restarting transaction
>> >>         at
>> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Jason
>> >
>> >
>>
>
>

Re: direct mode for mysql export vs. default jdbc

Posted by Jason Rosenberg <jb...@squareup.com>.
Hi Kathleen,

The sqoop command looks like this (a bash command, that takes a few args
naming the source/dest table names, update key, and dest columns).

sqoop_table () {
  sqoop --export
    --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
    --username $MYSQL_USER \
    --outdir $LOCAL_TEMP \
    --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
    --table $2 \
    --update-key $3 \
    --update-mode allowinsert \
    --columns $4 \
    --input-fields-terminated-by "\t" \
    --input-null-string '\\N' \
    --input-null-non-string '\\N' \
    --batch \
    --num-mappers 8
}

The lock contention exceptions seem to be retried, up to a point.  If they
happen too often, the hadoop job tracker eventually decides to kill the
job.  Even when the job succeeds, there are usually a few of these
exceptions.  I originally had much more mappers configured, but here, I've
reduced it to 8, and this seems to help (although it's not a guarantee).
 Reducing the num-mappers also makes it less likely that the target mysql
db machine will get overloaded (e.g. was getting maxed out on cpu usage
with 64 mappers, etc.).

The number of records being sqooped is on the order of 1-5M at a time.

Sounds like I should try --direct mode (but not sure if the null/delimiters
we're using will work with it?).

Jason



On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <ka...@apache.org> wrote:

> Hi Jason,
>
> Rather than using the JDBC interface for transferring data, the direct
> mode delegates the job of transferring data to the native utilities
> provided by the database vendor. In the case of MySQL, the mysqldump
> and mysqlimport will be used for retrieving data from the database
> server or moving data back. Using native utilities will greatly
> improve performance, as they are optimized to provide the best
> possible transfer speed while putting less burden on the database
> server. That said, there are several limitations that come with this
> faster import. In the case of MySQL, each node hosting a TaskTracker
> service needs to have both mysqldump and mysqlimport utilities
> installed. Another limitation of the direct mode is that not all
> parameters are supported. As the native utilities usually produce text
> output, binary formats like SequenceFile or Avro won't work. Also,
> parameters that customize the escape characters, type mapping, column
> and row delimiters, or the NULL substitution string might not be
> supported in all cases.
>
> Can you share your entire Sqoop command and the contents of failed
> task attempt attempt_201403180842_0202_m_000002_1?
>
> Thanks,
> Kate
>
> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jb...@squareup.com> wrote:
> > Thoughts anyone?
> >
> > Thanks,
> >
> > Jason
> >
> >
> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >>
> >> Hi,
> >>
> >> I'm wondering if there is expected performance increases with using the
> >> --direct flag for exporting from hive to mysql.  If so, how much
> speedup?
> >>
> >> Also, I've been getting lock contention errors during export, and I'm
> >> wondering if these are less likely using --direct mode?  E.g. I'm
> getting
> >> these sorts of exceptions on the sqoop console:
> >>
> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
> >> java.io.IOException: Can't export data, please check failed map task
> logs
> >>         at
> >>
> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
> >>         at
> >>
> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
> >>         at
> >>
> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
> >>         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> >>         at java.security.AccessController.doPrivileged(Native Method)
> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
> >>         at
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
> >> Caused by: java.io.IOException: java.sql.BatchUpdateException: Deadlock
> >> found when trying to get lock; try restarting transaction
> >>         at
> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
> >>
> >>
> >> Thanks,
> >>
> >> Jason
> >
> >
>

Re: direct mode for mysql export vs. default jdbc

Posted by Kathleen Ting <ka...@apache.org>.
Hi Jason,

Rather than using the JDBC interface for transferring data, the direct
mode delegates the job of transferring data to the native utilities
provided by the database vendor. In the case of MySQL, the mysqldump
and mysqlimport will be used for retrieving data from the database
server or moving data back. Using native utilities will greatly
improve performance, as they are optimized to provide the best
possible transfer speed while putting less burden on the database
server. That said, there are several limitations that come with this
faster import. In the case of MySQL, each node hosting a TaskTracker
service needs to have both mysqldump and mysqlimport utilities
installed. Another limitation of the direct mode is that not all
parameters are supported. As the native utilities usually produce text
output, binary formats like SequenceFile or Avro won't work. Also,
parameters that customize the escape characters, type mapping, column
and row delimiters, or the NULL substitution string might not be
supported in all cases.

Can you share your entire Sqoop command and the contents of failed
task attempt attempt_201403180842_0202_m_000002_1?

Thanks,
Kate

On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jb...@squareup.com> wrote:
> Thoughts anyone?
>
> Thanks,
>
> Jason
>
>
> On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>>
>> Hi,
>>
>> I'm wondering if there is expected performance increases with using the
>> --direct flag for exporting from hive to mysql.  If so, how much speedup?
>>
>> Also, I've been getting lock contention errors during export, and I'm
>> wondering if these are less likely using --direct mode?  E.g. I'm getting
>> these sorts of exceptions on the sqoop console:
>>
>> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
>> attempt_201403180842_0202_m_000002_1, Status : FAILED
>> java.io.IOException: Can't export data, please check failed map task logs
>>         at
>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>>         at
>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>>         at
>> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> Caused by: java.io.IOException: java.sql.BatchUpdateException: Deadlock
>> found when trying to get lock; try restarting transaction
>>         at
>> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
>>
>>
>> Thanks,
>>
>> Jason
>
>

Re: direct mode for mysql export vs. default jdbc

Posted by Jason Rosenberg <jb...@squareup.com>.
Thoughts anyone?

Thanks,

Jason


On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> Hi,
>
> I'm wondering if there is expected performance increases with using the
> --direct flag for exporting from hive to mysql.  If so, how much speedup?
>
> Also, I've been getting lock contention errors during export, and I'm
> wondering if these are less likely using --direct mode?  E.g. I'm getting
> these sorts of exceptions on the sqoop console:
>
> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
> attempt_201403180842_0202_m_000002_1, Status : FAILED
> java.io.IOException: Can't export data, please check failed map task logs
>         at
> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>         at
> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>         at
> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
> Caused by: java.io.IOException: java.sql.BatchUpdateException: Deadlock
> found when trying to get lock; try restarting transaction
>         at
> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
>
>
> Thanks,
>
> Jason
>