You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dexin Wang <wa...@gmail.com> on 2011/02/18 22:41:53 UTC

Pig optimization getting in the way?

I ran into a problem that I have spent quite some time on and start to think
it's probably pig's doing something optimization that makes this thing hard.

This is my pseudo code:

raw = LOAD ...

then some crazy stuff like
filter
join
group
UDF
etc

A = the result from above operation
STORE A INTO 'dummy' USING myJDBC(write to table1);

This works fine and I have 4 map-red jobs.

Then I add this after that:

B = FILTER A BY col1="xyz";
STORE B INTO 'dummy2' USING myJDBC(write to table2);

basically I do some filtering of A and write it to another table thru JDBC.

Then I had the problem of jobs failing and saying "PSQLException: This
statement has been closed".

My workaround now is to add "EXEC;" before B line and make them write to DB
in sequence. This works but now it would run the same map-red jobs twice - I
ended up with 8 jobs.

I think the reason for the failure without EXEC line is because pig tries to
do the two STORE in the same reducer (or mapper maybe) since B only involves
FILTER which doesn't require a separate map-red job and then got confused.

Is there a way for this to work without having to duplicate the jobs? Thanks
a lot!

Re: Pig optimization getting in the way?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I was suggesting just a connection per record writer, not a connection per
(jdbc, table). That way you are safe even if you are writing into the same
table from two streams in the same jvm.

D

On Tue, Feb 22, 2011 at 10:18 AM, Dexin Wang <wa...@gmail.com> wrote:

> So I can create multiple db connections for each (jdbc_url, table) pairs
> and
> map each pair to its own connection for record writer. Is that what you are
> suggesting? Sounds like a good plan. Thanks.
>
> On Fri, Feb 18, 2011 at 5:31 PM, Thejas M Nair <te...@yahoo-inc.com>
> wrote:
>
> >  As you are suspecting, both store functions are probably running in the
> > same map or reduce task. This is a result of multi-query optimization.
> > Try pig –e ‘explain –script yourscript.pig’ to see the query plan, and
> you
> > will be able to verify if the store is happening the same map/reduce
> task.
> >
> > Can you can make the db connection a member of the store function/ record
> > writer?
> > You can also use  "-no_multiquery" to prevent multi-query optimization
> from
> > happening, but that will also result in the MR job being executed again
> for
> > other output.
> >
> > Thanks,
> > Thejas
> >
> >
> >
> >
> > On 2/18/11 4:48 PM, "Dexin Wang" <wa...@gmail.com> wrote:
> >
> > I hope that's the case. But
> >
> >  *mapred.job.reuse.jvm.num.tasks* 1
> > However it does seem to be doing the write to two DB tables in the same
> job
> > so although it's not re-using jvm, it is already in one jvm since it's
> the
> > same task!
> >
> > And since the DB connection is static/singleton as you mentioned, and
> table
> > name (which is the only thing that's different) is not part of connection
> > URL, they share the same DB connection, and one of them will close the
> > connection when it's done.
> >
> > Hmm, any suggestions how we can handle this? Thanks.
> >
> > On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > Let me guess -- you have a static JDBC connection that you open in
> > myJDBC,
> > > and you have jvm reuse turned on.
> > >
> > > On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <wa...@gmail.com>
> wrote:
> > >
> > > > I ran into a problem that I have spent quite some time on and start
> to
> > > > think
> > > > it's probably pig's doing something optimization that makes this
> thing
> > > > hard.
> > > >
> > > > This is my pseudo code:
> > > >
> > > > raw = LOAD ...
> > > >
> > > > then some crazy stuff like
> > > > filter
> > > > join
> > > > group
> > > > UDF
> > > > etc
> > > >
> > > > A = the result from above operation
> > > > STORE A INTO 'dummy' USING myJDBC(write to table1);
> > > >
> > > > This works fine and I have 4 map-red jobs.
> > > >
> > > > Then I add this after that:
> > > >
> > > > B = FILTER A BY col1="xyz";
> > > > STORE B INTO 'dummy2' USING myJDBC(write to table2);
> > > >
> > > > basically I do some filtering of A and write it to another table thru
> > > JDBC.
> > > >
> > > > Then I had the problem of jobs failing and saying "PSQLException:
> This
> > > > statement has been closed".
> > > >
> > > > My workaround now is to add "EXEC;" before B line and make them write
> > to
> > > DB
> > > > in sequence. This works but now it would run the same map-red jobs
> > twice
> > > -
> > > > I
> > > > ended up with 8 jobs.
> > > >
> > > > I think the reason for the failure without EXEC line is because pig
> > tries
> > > > to
> > > > do the two STORE in the same reducer (or mapper maybe) since B only
> > > > involves
> > > > FILTER which doesn't require a separate map-red job and then got
> > > confused.
> > > >
> > > > Is there a way for this to work without having to duplicate the jobs?
> > > > Thanks
> > > > a lot!
> > > >
> > >
> >
> >
> >
>

Re: Pig optimization getting in the way?

Posted by Dexin Wang <wa...@gmail.com>.
So I can create multiple db connections for each (jdbc_url, table) pairs and
map each pair to its own connection for record writer. Is that what you are
suggesting? Sounds like a good plan. Thanks.

On Fri, Feb 18, 2011 at 5:31 PM, Thejas M Nair <te...@yahoo-inc.com> wrote:

>  As you are suspecting, both store functions are probably running in the
> same map or reduce task. This is a result of multi-query optimization.
> Try pig –e ‘explain –script yourscript.pig’ to see the query plan, and you
> will be able to verify if the store is happening the same map/reduce task.
>
> Can you can make the db connection a member of the store function/ record
> writer?
> You can also use  "-no_multiquery" to prevent multi-query optimization from
> happening, but that will also result in the MR job being executed again for
> other output.
>
> Thanks,
> Thejas
>
>
>
>
> On 2/18/11 4:48 PM, "Dexin Wang" <wa...@gmail.com> wrote:
>
> I hope that's the case. But
>
>  *mapred.job.reuse.jvm.num.tasks* 1
> However it does seem to be doing the write to two DB tables in the same job
> so although it's not re-using jvm, it is already in one jvm since it's the
> same task!
>
> And since the DB connection is static/singleton as you mentioned, and table
> name (which is the only thing that's different) is not part of connection
> URL, they share the same DB connection, and one of them will close the
> connection when it's done.
>
> Hmm, any suggestions how we can handle this? Thanks.
>
> On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > Let me guess -- you have a static JDBC connection that you open in
> myJDBC,
> > and you have jvm reuse turned on.
> >
> > On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <wa...@gmail.com> wrote:
> >
> > > I ran into a problem that I have spent quite some time on and start to
> > > think
> > > it's probably pig's doing something optimization that makes this thing
> > > hard.
> > >
> > > This is my pseudo code:
> > >
> > > raw = LOAD ...
> > >
> > > then some crazy stuff like
> > > filter
> > > join
> > > group
> > > UDF
> > > etc
> > >
> > > A = the result from above operation
> > > STORE A INTO 'dummy' USING myJDBC(write to table1);
> > >
> > > This works fine and I have 4 map-red jobs.
> > >
> > > Then I add this after that:
> > >
> > > B = FILTER A BY col1="xyz";
> > > STORE B INTO 'dummy2' USING myJDBC(write to table2);
> > >
> > > basically I do some filtering of A and write it to another table thru
> > JDBC.
> > >
> > > Then I had the problem of jobs failing and saying "PSQLException: This
> > > statement has been closed".
> > >
> > > My workaround now is to add "EXEC;" before B line and make them write
> to
> > DB
> > > in sequence. This works but now it would run the same map-red jobs
> twice
> > -
> > > I
> > > ended up with 8 jobs.
> > >
> > > I think the reason for the failure without EXEC line is because pig
> tries
> > > to
> > > do the two STORE in the same reducer (or mapper maybe) since B only
> > > involves
> > > FILTER which doesn't require a separate map-red job and then got
> > confused.
> > >
> > > Is there a way for this to work without having to duplicate the jobs?
> > > Thanks
> > > a lot!
> > >
> >
>
>
>

Re: Pig optimization getting in the way?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Open a new connection per Storage instance?
Better yet, use a connection pool?

On Fri, Feb 18, 2011 at 4:48 PM, Dexin Wang <wa...@gmail.com> wrote:

> I hope that's the case. But
>
>  *mapred.job.reuse.jvm.num.tasks* 1
> However it does seem to be doing the write to two DB tables in the same job
> so although it's not re-using jvm, it is already in one jvm since it's the
> same task!
>
> And since the DB connection is static/singleton as you mentioned, and table
> name (which is the only thing that's different) is not part of connection
> URL, they share the same DB connection, and one of them will close the
> connection when it's done.
>
> Hmm, any suggestions how we can handle this? Thanks.
>
> On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Let me guess -- you have a static JDBC connection that you open in myJDBC,
>> and you have jvm reuse turned on.
>>
>> On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <wa...@gmail.com> wrote:
>>
>> > I ran into a problem that I have spent quite some time on and start to
>> > think
>> > it's probably pig's doing something optimization that makes this thing
>> > hard.
>> >
>> > This is my pseudo code:
>> >
>> > raw = LOAD ...
>> >
>> > then some crazy stuff like
>> > filter
>> > join
>> > group
>> > UDF
>> > etc
>> >
>> > A = the result from above operation
>> > STORE A INTO 'dummy' USING myJDBC(write to table1);
>> >
>> > This works fine and I have 4 map-red jobs.
>> >
>> > Then I add this after that:
>> >
>> > B = FILTER A BY col1="xyz";
>> > STORE B INTO 'dummy2' USING myJDBC(write to table2);
>> >
>> > basically I do some filtering of A and write it to another table thru
>> JDBC.
>> >
>> > Then I had the problem of jobs failing and saying "PSQLException: This
>> > statement has been closed".
>> >
>> > My workaround now is to add "EXEC;" before B line and make them write to
>> DB
>> > in sequence. This works but now it would run the same map-red jobs twice
>> -
>> > I
>> > ended up with 8 jobs.
>> >
>> > I think the reason for the failure without EXEC line is because pig
>> tries
>> > to
>> > do the two STORE in the same reducer (or mapper maybe) since B only
>> > involves
>> > FILTER which doesn't require a separate map-red job and then got
>> confused.
>> >
>> > Is there a way for this to work without having to duplicate the jobs?
>> > Thanks
>> > a lot!
>> >
>>
>
>

Re: Pig optimization getting in the way?

Posted by Thejas M Nair <te...@yahoo-inc.com>.
As you are suspecting, both store functions are probably running in the same map or reduce task. This is a result of multi-query optimization.
Try pig -e 'explain -script yourscript.pig' to see the query plan, and you will be able to verify if the store is happening the same map/reduce task.

Can you can make the db connection a member of the store function/ record writer?
You can also use  "-no_multiquery" to prevent multi-query optimization from happening, but that will also result in the MR job being executed again for other output.

Thanks,
Thejas



On 2/18/11 4:48 PM, "Dexin Wang" <wa...@gmail.com> wrote:

I hope that's the case. But

 *mapred.job.reuse.jvm.num.tasks* 1
However it does seem to be doing the write to two DB tables in the same job
so although it's not re-using jvm, it is already in one jvm since it's the
same task!

And since the DB connection is static/singleton as you mentioned, and table
name (which is the only thing that's different) is not part of connection
URL, they share the same DB connection, and one of them will close the
connection when it's done.

Hmm, any suggestions how we can handle this? Thanks.

On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Let me guess -- you have a static JDBC connection that you open in myJDBC,
> and you have jvm reuse turned on.
>
> On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <wa...@gmail.com> wrote:
>
> > I ran into a problem that I have spent quite some time on and start to
> > think
> > it's probably pig's doing something optimization that makes this thing
> > hard.
> >
> > This is my pseudo code:
> >
> > raw = LOAD ...
> >
> > then some crazy stuff like
> > filter
> > join
> > group
> > UDF
> > etc
> >
> > A = the result from above operation
> > STORE A INTO 'dummy' USING myJDBC(write to table1);
> >
> > This works fine and I have 4 map-red jobs.
> >
> > Then I add this after that:
> >
> > B = FILTER A BY col1="xyz";
> > STORE B INTO 'dummy2' USING myJDBC(write to table2);
> >
> > basically I do some filtering of A and write it to another table thru
> JDBC.
> >
> > Then I had the problem of jobs failing and saying "PSQLException: This
> > statement has been closed".
> >
> > My workaround now is to add "EXEC;" before B line and make them write to
> DB
> > in sequence. This works but now it would run the same map-red jobs twice
> -
> > I
> > ended up with 8 jobs.
> >
> > I think the reason for the failure without EXEC line is because pig tries
> > to
> > do the two STORE in the same reducer (or mapper maybe) since B only
> > involves
> > FILTER which doesn't require a separate map-red job and then got
> confused.
> >
> > Is there a way for this to work without having to duplicate the jobs?
> > Thanks
> > a lot!
> >
>



Re: Pig optimization getting in the way?

Posted by Dexin Wang <wa...@gmail.com>.
I hope that's the case. But

 *mapred.job.reuse.jvm.num.tasks* 1
However it does seem to be doing the write to two DB tables in the same job
so although it's not re-using jvm, it is already in one jvm since it's the
same task!

And since the DB connection is static/singleton as you mentioned, and table
name (which is the only thing that's different) is not part of connection
URL, they share the same DB connection, and one of them will close the
connection when it's done.

Hmm, any suggestions how we can handle this? Thanks.

On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Let me guess -- you have a static JDBC connection that you open in myJDBC,
> and you have jvm reuse turned on.
>
> On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <wa...@gmail.com> wrote:
>
> > I ran into a problem that I have spent quite some time on and start to
> > think
> > it's probably pig's doing something optimization that makes this thing
> > hard.
> >
> > This is my pseudo code:
> >
> > raw = LOAD ...
> >
> > then some crazy stuff like
> > filter
> > join
> > group
> > UDF
> > etc
> >
> > A = the result from above operation
> > STORE A INTO 'dummy' USING myJDBC(write to table1);
> >
> > This works fine and I have 4 map-red jobs.
> >
> > Then I add this after that:
> >
> > B = FILTER A BY col1="xyz";
> > STORE B INTO 'dummy2' USING myJDBC(write to table2);
> >
> > basically I do some filtering of A and write it to another table thru
> JDBC.
> >
> > Then I had the problem of jobs failing and saying "PSQLException: This
> > statement has been closed".
> >
> > My workaround now is to add "EXEC;" before B line and make them write to
> DB
> > in sequence. This works but now it would run the same map-red jobs twice
> -
> > I
> > ended up with 8 jobs.
> >
> > I think the reason for the failure without EXEC line is because pig tries
> > to
> > do the two STORE in the same reducer (or mapper maybe) since B only
> > involves
> > FILTER which doesn't require a separate map-red job and then got
> confused.
> >
> > Is there a way for this to work without having to duplicate the jobs?
> > Thanks
> > a lot!
> >
>

Re: Pig optimization getting in the way?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Let me guess -- you have a static JDBC connection that you open in myJDBC,
and you have jvm reuse turned on.

On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <wa...@gmail.com> wrote:

> I ran into a problem that I have spent quite some time on and start to
> think
> it's probably pig's doing something optimization that makes this thing
> hard.
>
> This is my pseudo code:
>
> raw = LOAD ...
>
> then some crazy stuff like
> filter
> join
> group
> UDF
> etc
>
> A = the result from above operation
> STORE A INTO 'dummy' USING myJDBC(write to table1);
>
> This works fine and I have 4 map-red jobs.
>
> Then I add this after that:
>
> B = FILTER A BY col1="xyz";
> STORE B INTO 'dummy2' USING myJDBC(write to table2);
>
> basically I do some filtering of A and write it to another table thru JDBC.
>
> Then I had the problem of jobs failing and saying "PSQLException: This
> statement has been closed".
>
> My workaround now is to add "EXEC;" before B line and make them write to DB
> in sequence. This works but now it would run the same map-red jobs twice -
> I
> ended up with 8 jobs.
>
> I think the reason for the failure without EXEC line is because pig tries
> to
> do the two STORE in the same reducer (or mapper maybe) since B only
> involves
> FILTER which doesn't require a separate map-red job and then got confused.
>
> Is there a way for this to work without having to duplicate the jobs?
> Thanks
> a lot!
>