You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dragos Munteanu <dm...@sdl.com> on 2011/02/18 21:26:13 UTC

Pig script only works with no_multiquery

Hi all,

I have a Pig script that only runs if I turn on "-no_multiquery".

What the script does is this:
- read from disk a relation where each tuple has 10 fields, one of which is
a count
- take each non-count field in turn, group by it, and sum the counts for
each group.
The full code is included at the end of the email.

With "-no_multiquery" each of the groups is processed individually, and
things work just fine.
Without that option, I get a bunch of
java.lang.OutOfMemoryError: GC overhead limit exceeded
And the failed job message says:
JobId   Alias   Feature Message Outputs
job_201101201235_0287
merged_rules,statL_rules,statL_rules_grouped,statL_totals,statLt_rules,statL
t_rules_grouped,statLt_totals,statR_rules,statR_rules_grouped,statR_totals,s
tatRt_rules,statRt_rules_grouped,statRt_totals,statT_rules,statT_rules_group
ed,statT_totals    MULTI_QUERY,COMBINER Message: Job failed!

I'm running pig 0.8.0, on hadoop 0.20.2 and java 1.6.0_06.

My questions are:
- is it expected that Pig's multiquery execution would create enough of an
overhead that the execution should fail?
- can someone explain (or point me to an explanation) of where the
multiquery overhead comes from? I'd really like to understand it
- is there a better way to write the pig code to do that computation? Maybe
I can re-structure my computation, or configure my cluster differently? Or
am I stuck with a no_multiquery execution?

Many thanks,
Dragos Munteanu


CODE:
merged_rules = LOAD 'RuleProcess.xTCxi/rules' AS (pruneType:int,
dbkey:chararray, root:chararray, lhs:chararray, lhsTokens:chararray,
rhs:chararray, rhsTokens:chararray, align:chararray, count:long,
features:chararray);
-- compute stats: root
statT_rules = FOREACH merged_rules GENERATE root, count;
statT_rules_grouped = GROUP statT_rules BY root PARALLEL 30;
statT_totals = FOREACH statT_rules_grouped GENERATE FLATTEN(group),
SUM(statT_rules.count) AS total;
STORE statT_totals INTO 'RuleProcess.xTCxi.4/stats.root' using PigStorage;
-- compute stats: lhs
statL_rules = FOREACH merged_rules GENERATE lhs, count;
statL_rules_grouped = GROUP statL_rules BY lhs PARALLEL 30;
statL_totals = FOREACH statL_rules_grouped GENERATE FLATTEN(group),
SUM(statL_rules.count) AS total;
STORE statL_totals INTO 'RuleProcess.xTCxi.4/stats.lhs' using PigStorage;
-- compute stats: lhsTokens
statLt_rules = FOREACH merged_rules GENERATE lhsTokens, count;
statLt_rules_grouped = GROUP statLt_rules BY lhsTokens PARALLEL 30;
statLt_totals = FOREACH statLt_rules_grouped GENERATE FLATTEN(group),
SUM(statLt_rules.count) AS total;
STORE statLt_totals INTO 'RuleProcess.xTCxi.4/stats.lhsTokens' using
PigStorage;
-- compute stats: rhs
statR_rules = FOREACH merged_rules GENERATE rhs, count;
statR_rules_grouped = GROUP statR_rules BY rhs PARALLEL 30;
statR_totals = FOREACH statR_rules_grouped GENERATE FLATTEN(group),
SUM(statR_rules.count) AS total;
STORE statR_totals INTO 'RuleProcess.xTCxi.4/stats.rhs' using PigStorage;
-- compute stats: rhsTokens
statRt_rules = FOREACH merged_rules GENERATE rhsTokens, count;
statRt_rules_grouped = GROUP statRt_rules BY rhsTokens PARALLEL 30;
statRt_totals = FOREACH statRt_rules_grouped GENERATE FLATTEN(group),
SUM(statRt_rules.count) AS total;
STORE statRt_totals INTO 'RuleProcess.xTCxi.4/stats.rhsTokens' using
PigStorage;

</pre>
<BR style="font-size:4px;">
<a href = "http://www.sdl.com/innovate"><img src="http://www.sdl.com/images/Innovate2011_emailsignature_final.png" alt="www.sdl.com" border="0"/></a>
<BR>
<font face="arial"  size="2"><a href ="http://www.sdl.com/innovate" style="color:005740; font-weight: bold">www.sdl.com/innovate</a></font>
<BR>
<BR>
<font face="arial"  size="1" color="#736F6E">
<b>SDL PLC confidential, all rights reserved.</b>
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.<BR>
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.<BR>
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
</font>

Re: Pig script only works with no_multiquery

Posted by Thejas M Nair <te...@yahoo-inc.com>.
Can you please open a jira with this information ? -  https://issues.apache.org/jira/browse/PIG
If you are able to create a sample script/data that can reproduce this issue, that will also be very useful.

As a workaround, you can probably split the query into independent  queries each having a smaller number of group-by-and-sum .

Thanks,
Thejas




On 2/22/11 3:53 PM, "Dragos Munteanu" <dm...@sdl.com> wrote:

Thanks Thejas!

I tried the patch that you mentioned, that solves issue
https://issues.apache.org/jira/browse/PIG-1815
It helps a little, but as I try more complex scripts, the multiquery failures come back. Below are the details.

I'm running pig compiled from http://svn.apache.org/repos/asf/pig/branches/branch-0.8
checked out on Feb. 18, compiled with jdk1.6.0_24

My script does the following:
- read from disk a relation where each tuple has 10 fields, one of which is
a count
- take each non-count field in turn, group by it, and sum the counts for
each group.

Initially my script computed 5 such group-by-and-sum, which failed on the non-patched pig-0.8.
With the patch, this script worked just fine.
I then ran a script that does 15 group-by-and-sum (grouping also by pairs of fields). In this run, a couple of reducer attempts failed (Map output copy failure : java.lang.OutOfMemoryError: Java heap space) but the job as a whole succeeded.

I then ran a script that also does 15 group-bys, but for each group it performs a more complex computation (I provide a code example below). This time the job fails, and quickly. Just like above, a bunch of reducers fail with the "Java heap space" error; and the log of the entire job says:

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201101201235_0292   merged_rules,statLR_ccounts,statLR_rules,statLR_rules_grouped,statLR_tcounts,statLR_tcounts_grouped,statLR_totals,statL_ccounts,statL_rules,statL_rules_grouped,statL_tcounts,statL_tcounts_grouped,statL_totals,statLt_ccounts,statLt_rules,statLt_rules_grouped,statLt_tcounts,statLt_tcounts_grouped,statLt_totals,statR_ccounts,statR_rules,statR_rules_grouped,statR_tcounts,statR_tcounts_grouped,statR_totals,statRt_ccounts,statRt_rules,statRt_rules_grouped,statRt_tcounts,statRt_tcounts_grouped,statRt_totals,statTLR_ccounts,statTLR_rules,statTLR_rules_grouped,statTLR_tcounts,statTLR_tcounts_grouped,statTLR_totals,statTL_ccounts,statTL_rules,statTL_rules_grouped,statTL_tcounts,statTL_tcounts_grouped,statTL_totals,statTLtRt_ccounts,statTLtRt_rules,statTLtRt_rules_grouped,statTLtRt_tcounts,statTLtRt_tcounts_grouped,statTLtRt_totals,statTLt_ccounts,statTLt_rules,statTLt_rules_grouped,statTLt_tcounts,statTLt_tcounts_grouped,statTLt_totals,statTR_ccounts,statTR_rules,statTR_rules_grouped,statTR_tcounts,statTR_tcounts_grouped,statTR_totals,statTRt_ccounts,statTRt_rules,statTRt_rules_grouped,statTRt_tcounts,statTRt_tcounts_grouped,statTRt_totals,statT_ccounts,statT_rules,statT_rules_grouped,statT_tcounts,statT_tcounts_grouped,statT_totals      MULTI_QUERY,COMBINER    Message: Job failed!

Input(s):
Failed to read data from "hdfs://pig1/user/dmunteanu/RuleProcess.xTCxi/rules"

Output(s):

Counters:
Blah-blah

I'm not sure why it complains about "failed to read data"; my best guess is that it's because the job fails even before all mappers could be run.
The exact same script runs just fine with "no_multiquery", so the problem has to come from the multiquery optimization.

Below is a sample of my script.
Basically, it groups by something and then:

 *   sums up all the counts for the members of the group
 *   computes, for all members of the group, counts-of-counts (i.e. how many tuples in the group have the same count as the current tuple)

The example shows the computation for one group; this is the code is then repeated (with different relation names) for the other groups

-- compute totals
statT_rules = FOREACH merged_rules GENERATE root, count;
statT_rules_grouped = GROUP statT_rules BY root PARALLEL 30;
statT_totals = FOREACH statT_rules_grouped GENERATE FLATTEN(group), SUM(statT_rules.count) AS total;

-- compute truncated counts (tcounts) and counts-of-counts (ccounts)
statT_tcounts = FOREACH statT_rules GENERATE root, count, (count >= 5 ? 5 : count) as tcount;
statT_tcounts_grouped = GROUP statT_tcounts BY (root,tcount) PARALLEL 30;
statT_ccounts = FOREACH statT_tcounts_grouped GENERATE FLATTEN(group), COUNT(statT_tcounts) AS ccount;

-- join and print
statT_joined = JOIN statT_totals BY group, statT_ccounts BY root;
-- the join caused the root to appear twice (WHADJP,1L,WHADJP,1L,1L), get rid of the second
statT_joined_filtered = FOREACH statT_joined GENERATE statT_totals::group AS root, statT_totals::total AS total, statT_ccounts::group::tcount AS tcount, statT_ccounts::ccount AS ccount;
statT_joined_grouped = GROUP statT_joined_filtered BY (root,total) PARALLEL 30;
statT_joined_print = FOREACH statT_joined_grouped GENERATE FLATTEN(group), statT_joined_filtered.(tcount,ccount);

STORE statT_joined_print INTO 'RuleProcess.xTCxi.2/stats.root' using PigStorage;


Many thanks,
Dragos


On 2/18/11 1:08 PM, "Thejas M Nair" <te...@yahoo-inc.com> wrote:

> Hi Dragos,
> You might be facing this issue -
> https://issues.apache.org/jira/browse/PIG-1815, it has been resolved in pig
> 0.8 branch after the official release.
> We are likely to release a new 0.8 patch (pending discussion) with the
> fixes. Does your pig jar have this fix ?
> If not , can you please try building with
> http://svn.apache.org/repos/asf/pig/branches/branch-0.8 and try again with
> the new jar?
>
>
>
>
> On 2/18/11 12:26 PM, "Dragos Munteanu" <dm...@sdl.com> wrote:
>
>> Hi all,
>>
>> I have a Pig script that only runs if I turn on "-no_multiquery".
>
>
>
>>
>> My questions are:
>> - is it expected that Pig's multiquery execution would create enough of an
>> overhead that the execution should fail?
>
> It is not expected to fail.
>
>> - can someone explain (or point me to an explanation) of where the
>> multiquery overhead comes from? I'd really like to understand it
>
> In case of multi-query you end up doing more computation per task, so an
> issue such as one PIG-1815 might not be causing failures in the non
> multiquery case. Also PIG-1815 is caused by physical plan copies not being
> freed and multi-query physical plan will be larger.
>
>> - is there a better way to write the pig code to do that computation? Maybe
>> I can re-structure my computation, or configure my cluster differently? Or
>> am I stuck with a no_multiquery execution?
>
> If your query does not work with latest from 0.8 branch, please let us know.
> -Thejas
>

 <http://www.sdl.com/innovate><http://www.sdl.com/innovate>
www.sdl.com/innovate <http://www.sdl.com/innovate><http://www.sdl.com/innovate>

SDL PLC confidential, all rights reserved. If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.






Re: Pig script only works with no_multiquery

Posted by Dragos Munteanu <dm...@sdl.com>.
Thanks Thejas!

I tried the patch that you mentioned, that solves issue
https://issues.apache.org/jira/browse/PIG-1815
It helps a little, but as I try more complex scripts, the multiquery
failures come back. Below are the details.

I'm running pig compiled from
http://svn.apache.org/repos/asf/pig/branches/branch-0.8
checked out on Feb. 18, compiled with jdk1.6.0_24

My script does the following:
- read from disk a relation where each tuple has 10 fields, one of which is
a count
- take each non-count field in turn, group by it, and sum the counts for
each group.

Initially my script computed 5 such group-by-and-sum, which failed on the
non-patched pig-0.8.
With the patch, this script worked just fine.
I then ran a script that does 15 group-by-and-sum (grouping also by pairs of
fields). In this run, a couple of reducer attempts failed (Map output copy
failure : java.lang.OutOfMemoryError: Java heap space) but the job as a
whole succeeded.

I then ran a script that also does 15 group-bys, but for each group it
performs a more complex computation (I provide a code example below). This
time the job fails, and quickly. Just like above, a bunch of reducers fail
with the ³Java heap space² error; and the log of the entire job says:

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201101201235_0292
merged_rules,statLR_ccounts,statLR_rules,statLR_rules_grouped,statLR_tcounts
,statLR_tcounts_grouped,statLR_totals,statL_ccounts,statL_rules,statL_rules_
grouped,statL_tcounts,statL_tcounts_grouped,statL_totals,statLt_ccounts,stat
Lt_rules,statLt_rules_grouped,statLt_tcounts,statLt_tcounts_grouped,statLt_t
otals,statR_ccounts,statR_rules,statR_rules_grouped,statR_tcounts,statR_tcou
nts_grouped,statR_totals,statRt_ccounts,statRt_rules,statRt_rules_grouped,st
atRt_tcounts,statRt_tcounts_grouped,statRt_totals,statTLR_ccounts,statTLR_ru
les,statTLR_rules_grouped,statTLR_tcounts,statTLR_tcounts_grouped,statTLR_to
tals,statTL_ccounts,statTL_rules,statTL_rules_grouped,statTL_tcounts,statTL_
tcounts_grouped,statTL_totals,statTLtRt_ccounts,statTLtRt_rules,statTLtRt_ru
les_grouped,statTLtRt_tcounts,statTLtRt_tcounts_grouped,statTLtRt_totals,sta
tTLt_ccounts,statTLt_rules,statTLt_rules_grouped,statTLt_tcounts,statTLt_tco
unts_grouped,statTLt_totals,statTR_ccounts,statTR_rules,statTR_rules_grouped
,statTR_tcounts,statTR_tcounts_grouped,statTR_totals,statTRt_ccounts,statTRt
_rules,statTRt_rules_grouped,statTRt_tcounts,statTRt_tcounts_grouped,statTRt
_totals,statT_ccounts,statT_rules,statT_rules_grouped,statT_tcounts,statT_tc
ounts_grouped,statT_totals      MULTI_QUERY,COMBINER    Message: Job failed!

Input(s):
Failed to read data from
"hdfs://pig1/user/dmunteanu/RuleProcess.xTCxi/rules"

Output(s):

Counters:
Blah-blah

I¹m not sure why it complains about ³failed to read data²; my best guess is
that it¹s because the job fails even before all mappers could be run.
The exact same script runs just fine with ³no_multiquery², so the problem
has to come from the multiquery optimization.

Below is a sample of my script.
Basically, it groups by something and then:
* sums up all the counts for the members of the group
* computes, for all members of the group, counts-of-counts (i.e. how many
tuples in the group have the same count as the current tuple)
The example shows the computation for one group; this is the code is then
repeated (with different relation names) for the other groups

-- compute totals
statT_rules = FOREACH merged_rules GENERATE root, count;
statT_rules_grouped = GROUP statT_rules BY root PARALLEL 30;
statT_totals = FOREACH statT_rules_grouped GENERATE FLATTEN(group),
SUM(statT_rules.count) AS total;

-- compute truncated counts (tcounts) and counts-of-counts (ccounts)
statT_tcounts = FOREACH statT_rules GENERATE root, count, (count >= 5 ? 5 :
count) as tcount;
statT_tcounts_grouped = GROUP statT_tcounts BY (root,tcount) PARALLEL 30;
statT_ccounts = FOREACH statT_tcounts_grouped GENERATE FLATTEN(group),
COUNT(statT_tcounts) AS ccount;

-- join and print
statT_joined = JOIN statT_totals BY group, statT_ccounts BY root;
-- the join caused the root to appear twice (WHADJP,1L,WHADJP,1L,1L), get
rid of the second
statT_joined_filtered = FOREACH statT_joined GENERATE statT_totals::group AS
root, statT_totals::total AS total, statT_ccounts::group::tcount AS tcount,
statT_ccounts::ccount AS ccount;
statT_joined_grouped = GROUP statT_joined_filtered BY (root,total) PARALLEL
30;
statT_joined_print = FOREACH statT_joined_grouped GENERATE FLATTEN(group),
statT_joined_filtered.(tcount,ccount);

STORE statT_joined_print INTO 'RuleProcess.xTCxi.2/stats.root' using
PigStorage;


Many thanks,
Dragos


On 2/18/11 1:08 PM, "Thejas M Nair" <te...@yahoo-inc.com> wrote:

> Hi Dragos,
> You might be facing this issue -
> https://issues.apache.org/jira/browse/PIG-1815, it has been resolved in pig
> 0.8 branch after the official release.
> We are likely to release a new 0.8 patch (pending discussion) with the
> fixes. Does your pig jar have this fix ?
> If not , can you please try building with
> http://svn.apache.org/repos/asf/pig/branches/branch-0.8 and try again with
> the new jar?
> 
> 
> 
> 
> On 2/18/11 12:26 PM, "Dragos Munteanu" <dm...@sdl.com> wrote:
> 
>> Hi all,
>> 
>> I have a Pig script that only runs if I turn on "-no_multiquery".
> 
> 
>  
>> 
>> My questions are:
>> - is it expected that Pig's multiquery execution would create enough of an
>> overhead that the execution should fail?
> 
> It is not expected to fail.
> 
>> - can someone explain (or point me to an explanation) of where the
>> multiquery overhead comes from? I'd really like to understand it
> 
> In case of multi-query you end up doing more computation per task, so an
> issue such as one PIG-1815 might not be causing failures in the non
> multiquery case. Also PIG-1815 is caused by physical plan copies not being
> freed and multi-query physical plan will be larger.
> 
>> - is there a better way to write the pig code to do that computation? Maybe
>> I can re-structure my computation, or configure my cluster differently? Or
>> am I stuck with a no_multiquery execution?
> 
> If your query does not work with latest from 0.8 branch, please let us know.
> -Thejas
> 

</pre>
<BR style="font-size:4px;">
<a href = "http://www.sdl.com/innovate"><img src="http://www.sdl.com/images/Innovate2011_emailsignature_final.png" alt="www.sdl.com" border="0"/></a>
<BR>
<font face="arial"  size="2"><a href ="http://www.sdl.com/innovate" style="color:005740; font-weight: bold">www.sdl.com/innovate</a></font>
<BR>
<BR>
<font face="arial"  size="1" color="#736F6E">
<b>SDL PLC confidential, all rights reserved.</b>
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.<BR>
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.<BR>
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
</font>

Re: Pig script only works with no_multiquery

Posted by Thejas M Nair <te...@yahoo-inc.com>.
Hi Dragos,
You might be facing this issue -
https://issues.apache.org/jira/browse/PIG-1815, it has been resolved in pig
0.8 branch after the official release.
We are likely to release a new 0.8 patch (pending discussion) with the
fixes. Does your pig jar have this fix ?
If not , can you please try building with
http://svn.apache.org/repos/asf/pig/branches/branch-0.8 and try again with
the new jar?




On 2/18/11 12:26 PM, "Dragos Munteanu" <dm...@sdl.com> wrote:

> Hi all,
> 
> I have a Pig script that only runs if I turn on "-no_multiquery".


 
> 
> My questions are:
> - is it expected that Pig's multiquery execution would create enough of an
> overhead that the execution should fail?

It is not expected to fail.

> - can someone explain (or point me to an explanation) of where the
> multiquery overhead comes from? I'd really like to understand it

In case of multi-query you end up doing more computation per task, so an
issue such as one PIG-1815 might not be causing failures in the non
multiquery case. Also PIG-1815 is caused by physical plan copies not being
freed and multi-query physical plan will be larger.

> - is there a better way to write the pig code to do that computation? Maybe
> I can re-structure my computation, or configure my cluster differently? Or
> am I stuck with a no_multiquery execution?

If your query does not work with latest from 0.8 branch, please let us know.
-Thejas