You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Nick Jones <ni...@amd.com> on 2010/01/28 20:39:05 UTC

DBOutputFormat Speed Issues

Hi all,
I have a use case for collecting several rows from MySQL of 
compressed/unstructured data (n rows), expanding the data set, and 
storing the expanded results back into a MySQL DB (100,000n rows). 
DBInputFormat seems to perform reasonably well but DBOutputFormat is 
inserting rows one-by-one.  How can I take advantage of MySQL's support 
of generating fewer insert statements with more values within each one?

Thanks.
-- 
Nick Jones

RE: DBOutputFormat Speed Issues

Posted by "Jones, Nick" <ni...@amd.com>.

Sonal,
I'll try and put together something soon.  I haven't located documentation
on DataDrivenInputFormat yet to implement it. :)

Nick Jones

-----Original Message-----
From: Sonal Goyal [mailto:sonalgoyal4@gmail.com] 
Sent: Monday, February 01, 2010 10:03 AM
To: common-user@hadoop.apache.org
Subject: Re: DBOutputFormat Speed Issues

Hi Nick,

If you dont mind, can you please share your performance benchmarks of using
DataDrivenInputFormat/DBInputFormat and MySQL?

Thanks and Regards,
Sonal


On Mon, Feb 1, 2010 at 3:33 AM, Aaron Kimball <aa...@cloudera.com> wrote:

> Nick,
>
> I'm afraid that right now the only available OutputFormat for JDBC is that
> one. You'll note that DBOutputFormat doesn't really include much support
> for
> special-casing to MySQL or other targets.
>
> Your best bet is to probably copy the code from DBOutputFormat and
> DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and
modify
> the code in the RecordWriter to generate PreparedStatements containing
> batched insert statements.
>
> If you arrive at a solution which is pretty general-purpose/robust, please
> consider contributing it back to the Hadoop project :) If you do so, send
> me
> an email off-list; I'm happy to help with advice on developing better DB
> integration code, reviewing your work, etc.
>
> Also on the input side, you should really be using DataDrivenDBInputFormat
> instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 /
> CDH 0.20) has pretty good support for parallel imports, and uses this
> InputFormat instead.
>
> - Aaron
>
> On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones <ni...@amd.com> wrote:
>
> > Hi all,
> > I have a use case for collecting several rows from MySQL of
> > compressed/unstructured data (n rows), expanding the data set, and
> storing
> > the expanded results back into a MySQL DB (100,000n rows). DBInputFormat
> > seems to perform reasonably well but DBOutputFormat is inserting rows
> > one-by-one.  How can I take advantage of MySQL's support of generating
> fewer
> > insert statements with more values within each one?
> >
> > Thanks.
> > --
> > Nick Jones
> >
> >
>

Re: DBOutputFormat Speed Issues

Posted by Sonal Goyal <so...@gmail.com>.

Hi Nick,

If you dont mind, can you please share your performance benchmarks of using
DataDrivenInputFormat/DBInputFormat and MySQL?

Thanks and Regards,
Sonal


On Mon, Feb 1, 2010 at 3:33 AM, Aaron Kimball <aa...@cloudera.com> wrote:

> Nick,
>
> I'm afraid that right now the only available OutputFormat for JDBC is that
> one. You'll note that DBOutputFormat doesn't really include much support
> for
> special-casing to MySQL or other targets.
>
> Your best bet is to probably copy the code from DBOutputFormat and
> DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and modify
> the code in the RecordWriter to generate PreparedStatements containing
> batched insert statements.
>
> If you arrive at a solution which is pretty general-purpose/robust, please
> consider contributing it back to the Hadoop project :) If you do so, send
> me
> an email off-list; I'm happy to help with advice on developing better DB
> integration code, reviewing your work, etc.
>
> Also on the input side, you should really be using DataDrivenDBInputFormat
> instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 /
> CDH 0.20) has pretty good support for parallel imports, and uses this
> InputFormat instead.
>
> - Aaron
>
> On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones <ni...@amd.com> wrote:
>
> > Hi all,
> > I have a use case for collecting several rows from MySQL of
> > compressed/unstructured data (n rows), expanding the data set, and
> storing
> > the expanded results back into a MySQL DB (100,000n rows). DBInputFormat
> > seems to perform reasonably well but DBOutputFormat is inserting rows
> > one-by-one.  How can I take advantage of MySQL's support of generating
> fewer
> > insert statements with more values within each one?
> >
> > Thanks.
> > --
> > Nick Jones
> >
> >
>

Re: DBOutputFormat Speed Issues

Posted by Aaron Kimball <aa...@cloudera.com>.

Nick,

I'm afraid that right now the only available OutputFormat for JDBC is that
one. You'll note that DBOutputFormat doesn't really include much support for
special-casing to MySQL or other targets.

Your best bet is to probably copy the code from DBOutputFormat and
DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and modify
the code in the RecordWriter to generate PreparedStatements containing
batched insert statements.

If you arrive at a solution which is pretty general-purpose/robust, please
consider contributing it back to the Hadoop project :) If you do so, send me
an email off-list; I'm happy to help with advice on developing better DB
integration code, reviewing your work, etc.

Also on the input side, you should really be using DataDrivenDBInputFormat
instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 /
CDH 0.20) has pretty good support for parallel imports, and uses this
InputFormat instead.

- Aaron

On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones <ni...@amd.com> wrote:

> Hi all,
> I have a use case for collecting several rows from MySQL of
> compressed/unstructured data (n rows), expanding the data set, and storing
> the expanded results back into a MySQL DB (100,000n rows). DBInputFormat
> seems to perform reasonably well but DBOutputFormat is inserting rows
> one-by-one.  How can I take advantage of MySQL's support of generating fewer
> insert statements with more values within each one?
>
> Thanks.
> --
> Nick Jones
>
>