You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Martijn Lenderink <ma...@gmail.com> on 2013/03/18 14:50:01 UTC

JDBC parallel

Hello,

I have a working JDBC-connection to get data from an MSSQL source.
Its all works great except my cluster only opens one connection to the
MSSQL server.

I have multiple nodes running but the data gets pulled only from one node
and then the data get send to other nodes for processing.

I'am using code similar to the following:
https://github.com/apache/incubator-crunch/blob/master/crunch-contrib/src/it/java/org/apache/crunch/contrib/io/jdbc/DataBaseSourceIT.java

The only difference is the i'am using the DataDrivenDBInputFormat.

When i debug the source-code the query gets split into multiple queries but
only get executed on one machine.
Why isn't this executed in parallel with multiple connections to the MSSQL
server?

Greetings,
Martijn Lenderink

Re: JDBC parallel

Posted by Martijn Lenderink <ma...@gmail.com>.
Hello,

Thanks for the response.
I found out the problem was an error in my YARN config, not in Crunch.
Its fixed now.

Greetings,
Martijn Lenderink

2013/3/18 Matthias Friedrich <ma...@mafr.de>

> Hi,
>
> IIRC, the code in Crunch is inherently sequential and meant for
> small(ish) amounts of data. After all, distributed read with Hadoop
> from a RDBMS is often considered a DDoS attack :)
>
> Regards,
>   Matthias
>
> On Monday, 2013-03-18, Josh Wills wrote:
> > Hey Martjin,
> >
> > I don't have any intuition on this one-- is this code that you could post
> > as a gist or something so I could play with it and see if I see anything
> > amiss? The trick will be figuring out if the problem is in Crunch, the
> > underlying DB library, or the config.
> >
> > J
> >
> >
> > On Mon, Mar 18, 2013 at 6:50 AM, Martijn Lenderink
> > <ma...@gmail.com>wrote:
> >
> > > Hello,
> > >
> > > I have a working JDBC-connection to get data from an MSSQL source.
> > > Its all works great except my cluster only opens one connection to the
> > > MSSQL server.
> > >
> > > I have multiple nodes running but the data gets pulled only from one
> node
> > > and then the data get send to other nodes for processing.
> > >
> > > I'am using code similar to the following:
> > >
> > >
> https://github.com/apache/incubator-crunch/blob/master/crunch-contrib/src/it/java/org/apache/crunch/contrib/io/jdbc/DataBaseSourceIT.java
> > >
> > > The only difference is the i'am using the DataDrivenDBInputFormat.
> > >
> > > When i debug the source-code the query gets split into multiple queries
> > > but only get executed on one machine.
> > > Why isn't this executed in parallel with multiple connections to the
> MSSQL
> > > server?
> > >
> > > Greetings,
> > > Martijn Lenderink
> > >
> > >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 

Met vriendelijke groet,
Martijn Lenderink

Re: JDBC parallel

Posted by Matthias Friedrich <ma...@mafr.de>.
Hi,

IIRC, the code in Crunch is inherently sequential and meant for
small(ish) amounts of data. After all, distributed read with Hadoop
from a RDBMS is often considered a DDoS attack :)

Regards,
  Matthias

On Monday, 2013-03-18, Josh Wills wrote:
> Hey Martjin,
> 
> I don't have any intuition on this one-- is this code that you could post
> as a gist or something so I could play with it and see if I see anything
> amiss? The trick will be figuring out if the problem is in Crunch, the
> underlying DB library, or the config.
> 
> J
> 
> 
> On Mon, Mar 18, 2013 at 6:50 AM, Martijn Lenderink
> <ma...@gmail.com>wrote:
> 
> > Hello,
> >
> > I have a working JDBC-connection to get data from an MSSQL source.
> > Its all works great except my cluster only opens one connection to the
> > MSSQL server.
> >
> > I have multiple nodes running but the data gets pulled only from one node
> > and then the data get send to other nodes for processing.
> >
> > I'am using code similar to the following:
> >
> > https://github.com/apache/incubator-crunch/blob/master/crunch-contrib/src/it/java/org/apache/crunch/contrib/io/jdbc/DataBaseSourceIT.java
> >
> > The only difference is the i'am using the DataDrivenDBInputFormat.
> >
> > When i debug the source-code the query gets split into multiple queries
> > but only get executed on one machine.
> > Why isn't this executed in parallel with multiple connections to the MSSQL
> > server?
> >
> > Greetings,
> > Martijn Lenderink
> >
> >
> 
> 
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: JDBC parallel

Posted by Josh Wills <jw...@cloudera.com>.
Hey Martjin,

I don't have any intuition on this one-- is this code that you could post
as a gist or something so I could play with it and see if I see anything
amiss? The trick will be figuring out if the problem is in Crunch, the
underlying DB library, or the config.

J


On Mon, Mar 18, 2013 at 6:50 AM, Martijn Lenderink
<ma...@gmail.com>wrote:

> Hello,
>
> I have a working JDBC-connection to get data from an MSSQL source.
> Its all works great except my cluster only opens one connection to the
> MSSQL server.
>
> I have multiple nodes running but the data gets pulled only from one node
> and then the data get send to other nodes for processing.
>
> I'am using code similar to the following:
>
> https://github.com/apache/incubator-crunch/blob/master/crunch-contrib/src/it/java/org/apache/crunch/contrib/io/jdbc/DataBaseSourceIT.java
>
> The only difference is the i'am using the DataDrivenDBInputFormat.
>
> When i debug the source-code the query gets split into multiple queries
> but only get executed on one machine.
> Why isn't this executed in parallel with multiple connections to the MSSQL
> server?
>
> Greetings,
> Martijn Lenderink
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>