You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Raghu Rajagopalan <ra...@gmail.com> on 2008/07/01 21:13:04 UTC

Map tasks dont complete

Hi,
I wrote a small pig script with a couple of functions and it works
fine in the local mode.
However, when I run it on a hadoop cluster on a 4Gig file (apache
access log). The job is submitted successfully, and the input is split
to 66 map tasks (64 mb chunk size). On my cluster of 10 machines, the
first 10 maps commence - however, they do not seem to terminate
(progress goes to 1200% on the Hadoop map red tasks). I dont see
anything untoward in teh logs either.

On the command line, Pig's progress indicator sysouts continue indefinitely.

Pig script and the referred functions are attached. I'm wondering if
anyone's seen anything similar and/or any steps needed to fix this.

CsvLogStorage.java - Load function using opencsv to parse apache log
REGEX.java - regex splitter that outputs a tuple with a given regex
SPLITDATE.java - parse a date and output tuple with given date parts.

My guess is that there's something wrong with the way the custom load
function is written.

My setup:
Hadoop 0.17
Pig.jar from the pigtutorial.tar.gz on the wiki.

Thanks for looking.
Raghu

Re: Map tasks dont complete

Posted by Raghu Rajagopalan <ra...@gmail.com>.

Okay - tried some more tests - with a portion of the same input file:

1. with a 10 meg input file  - 1 map task (based on dfs block size of
64 M) - Pig + Hadoop runs successfully.  All Map tasks complete when
they reach 100%.  Normal, expected behavior

2. With a 300 Meg input file - 5 Map tasks  - Ran successfully - BUT -
many instances when map tasks wont complete on reaching 100% -
however, let them continue for sometime and eventually they were
marked complete. On the console, Pig output progress seems stuck for
quite sometime and then eventually moves forward as the map tasks
complete. Overall job execution took about 20 mins.

I'm quite okay at looking at src to find out what gives but dont know
where to start poking. Any help to get me off the ground would be
great.

thanks!
Raghu

On Tue, Jul 1, 2008 at 12:13 PM, Raghu Rajagopalan
<ra...@gmail.com> wrote:
> Hi,
> I wrote a small pig script with a couple of functions and it works
> fine in the local mode.
> However, when I run it on a hadoop cluster on a 4Gig file (apache
> access log). The job is submitted successfully, and the input is split
> to 66 map tasks (64 mb chunk size). On my cluster of 10 machines, the
> first 10 maps commence - however, they do not seem to terminate
> (progress goes to 1200% on the Hadoop map red tasks). I dont see
> anything untoward in teh logs either.
>
> On the command line, Pig's progress indicator sysouts continue indefinitely.
>
> Pig script and the referred functions are attached. I'm wondering if
> anyone's seen anything similar and/or any steps needed to fix this.
>
> CsvLogStorage.java - Load function using opencsv to parse apache log
> REGEX.java - regex splitter that outputs a tuple with a given regex
> SPLITDATE.java - parse a date and output tuple with given date parts.
>
> My guess is that there's something wrong with the way the custom load
> function is written.
>
> My setup:
> Hadoop 0.17
> Pig.jar from the pigtutorial.tar.gz on the wiki.
>
> Thanks for looking.
> Raghu
>