You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Greg Langmead <gl...@sdl.com> on 2011/01/26 23:49:13 UTC

Perusing the input data to a failed map task

Pig 0.8 executes my script by running six jobs. One of them is identified as
"MAP_ONLY" and it always fails, with the innermost error I can find either
saying "GC overhead limit exceeded" or "Java heap space". I suspect I have a
piece that is too large. How can I get my hands on the actual data it was
processing, so I can ascertain the cause? The task log says "Input records
from tmp1872359169" can I view that data?

Thanks,

Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310
437 7300
 
</pre>
<BR style="font-size:4px;">
<a href = "http://www.sdl.com/innovate"><img src="http://www.sdl.com/images/Innovate2011_emailsignature_final.png" alt="www.sdl.com" border="0"/></a>
<BR>
<font face="arial"  size="2"><a href ="http://www.sdl.com/innovate" style="color:005740; font-weight: bold">www.sdl.com/innovate</a></font>
<BR>
<BR>
<font face="arial"  size="1" color="#736F6E">
<b>SDL PLC confidential, all rights reserved.</b>
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.<BR>
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.<BR>
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
</font>

Re: Perusing the input data to a failed map task

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Pass input through a logging UDF (a basic udf that just echoes input to
stderr)?

On Thu, Jan 27, 2011 at 8:46 AM, Greg Langmead <gl...@sdl.com> wrote:

> Thank you, Dmitriy. I do see which relations my map-only job was working
> on,
> but how do I see which subset of the data a given piece of that Map job was
> working on, e.g. attempt_201101201235_0064_m_000243_0
>
> If I save the input data by storing it before the Map job runs, I will
> still
> have the conundrum of identifying which subset of it went to piece 243,
> unless I'm misunderstanding.
>
> Greg
>
> On 1/26/11 6:23 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:
>
> > Greg,
> > Pig 8 tells you which job is responsible for which set of operators; you
> can
> > save all the inputs to the map only job by inserting intermediate stores,
> > and debug just the map-only job.
> >
> > D
> >
> > On Wed, Jan 26, 2011 at 2:49 PM, Greg Langmead <gl...@sdl.com>
> wrote:
> >
> >> Pig 0.8 executes my script by running six jobs. One of them is
> identified
> >> as
> >> "MAP_ONLY" and it always fails, with the innermost error I can find
> either
> >> saying "GC overhead limit exceeded" or "Java heap space". I suspect I
> have
> >> a
> >> piece that is too large. How can I get my hands on the actual data it
> was
> >> processing, so I can ascertain the cause? The task log says "Input
> records
> >> from tmp1872359169" can I view that data?
> >>
> >> Thanks,
> >>
> >> Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1
> >> 310
> >> 437 7300
> >>
> >> </pre>
> >> <BR style="font-size:4px;">
> >> <a href = "http://www.sdl.com/innovate"><img src="
> >> http://www.sdl.com/images/Innovate2011_emailsignature_final.png" alt="
> >> www.sdl.com" border="0"/></a>
> >> <BR>
> >> <font face="arial"  size="2"><a href ="http://www.sdl.com/innovate"
> >> style="color:005740; font-weight: bold">www.sdl.com/innovate</a></font>
> >> <BR>
> >> <BR>
> >> <font face="arial"  size="1" color="#736F6E">
> >> <b>SDL PLC confidential, all rights reserved.</b>
> >> If you are not the intended recipient of this mail SDL requests and
> >> requires that you delete it without acting upon or copying any of its
> >> contents, and we further request that you advise us.<BR>
> >> SDL PLC is a public limited company registered in England and Wales.
> >>  Registered number: 02675207.<BR>
> >> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire
> SL6
> >> 7DY, UK.
> >> </font>
> >>
>
>

Re: Perusing the input data to a failed map task

Posted by Greg Langmead <gl...@sdl.com>.
Thank you, Dmitriy. I do see which relations my map-only job was working on,
but how do I see which subset of the data a given piece of that Map job was
working on, e.g. attempt_201101201235_0064_m_000243_0

If I save the input data by storing it before the Map job runs, I will still
have the conundrum of identifying which subset of it went to piece 243,
unless I'm misunderstanding.

Greg

On 1/26/11 6:23 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

> Greg,
> Pig 8 tells you which job is responsible for which set of operators; you can
> save all the inputs to the map only job by inserting intermediate stores,
> and debug just the map-only job.
> 
> D
> 
> On Wed, Jan 26, 2011 at 2:49 PM, Greg Langmead <gl...@sdl.com> wrote:
> 
>> Pig 0.8 executes my script by running six jobs. One of them is identified
>> as
>> "MAP_ONLY" and it always fails, with the innermost error I can find either
>> saying "GC overhead limit exceeded" or "Java heap space". I suspect I have
>> a
>> piece that is too large. How can I get my hands on the actual data it was
>> processing, so I can ascertain the cause? The task log says "Input records
>> from tmp1872359169" can I view that data?
>> 
>> Thanks,
>> 
>> Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1
>> 310
>> 437 7300
>> 
>> </pre>
>> <BR style="font-size:4px;">
>> <a href = "http://www.sdl.com/innovate"><img src="
>> http://www.sdl.com/images/Innovate2011_emailsignature_final.png" alt="
>> www.sdl.com" border="0"/></a>
>> <BR>
>> <font face="arial"  size="2"><a href ="http://www.sdl.com/innovate"
>> style="color:005740; font-weight: bold">www.sdl.com/innovate</a></font>
>> <BR>
>> <BR>
>> <font face="arial"  size="1" color="#736F6E">
>> <b>SDL PLC confidential, all rights reserved.</b>
>> If you are not the intended recipient of this mail SDL requests and
>> requires that you delete it without acting upon or copying any of its
>> contents, and we further request that you advise us.<BR>
>> SDL PLC is a public limited company registered in England and Wales.
>>  Registered number: 02675207.<BR>
>> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6
>> 7DY, UK.
>> </font>
>> 


Re: Perusing the input data to a failed map task

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Greg,
Pig 8 tells you which job is responsible for which set of operators; you can
save all the inputs to the map only job by inserting intermediate stores,
and debug just the map-only job.

D

On Wed, Jan 26, 2011 at 2:49 PM, Greg Langmead <gl...@sdl.com> wrote:

> Pig 0.8 executes my script by running six jobs. One of them is identified
> as
> "MAP_ONLY" and it always fails, with the innermost error I can find either
> saying "GC overhead limit exceeded" or "Java heap space". I suspect I have
> a
> piece that is too large. How can I get my hands on the actual data it was
> processing, so I can ascertain the cause? The task log says "Input records
> from tmp1872359169" can I view that data?
>
> Thanks,
>
> Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1
> 310
> 437 7300
>
> </pre>
> <BR style="font-size:4px;">
> <a href = "http://www.sdl.com/innovate"><img src="
> http://www.sdl.com/images/Innovate2011_emailsignature_final.png" alt="
> www.sdl.com" border="0"/></a>
> <BR>
> <font face="arial"  size="2"><a href ="http://www.sdl.com/innovate"
> style="color:005740; font-weight: bold">www.sdl.com/innovate</a></font>
> <BR>
> <BR>
> <font face="arial"  size="1" color="#736F6E">
> <b>SDL PLC confidential, all rights reserved.</b>
> If you are not the intended recipient of this mail SDL requests and
> requires that you delete it without acting upon or copying any of its
> contents, and we further request that you advise us.<BR>
> SDL PLC is a public limited company registered in England and Wales.
>  Registered number: 02675207.<BR>
> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6
> 7DY, UK.
> </font>
>