You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Wickham, Jeremy" <JW...@weather.com> on 2010/01/15 16:43:17 UTC

FW: Pig script very slow to start actual processing


Hello fellow pig users.  I am new to both hadoop and pig, with a
background in relational databases and perl scripting. Yesterday I ran a
fairly simple pig script that ran in around 45 minutes on our new 10
node cluster with the script processing approx 630G of raw data.  Around
1 to 2 minutes after submitting the job, I could see the map/reduce
processes running on the data node machines and the % done count began
to increment in the grunt shell.  Today, I ran the exact same pig script
against the exact same dataset.  However, this time I saw no activity on
the data nodes for over 50 minutes.  The script sat at 0% complete for
those 50 minutes, then I finally saw process on the data nodes.  From
that point, the script completed in around 45 minutes, just as it did
the day before.  I am the only user of the system, and no other jobs
were running at the time.  I have also noticed that doing a simple 'ls'
on a directory from grunt takes much much (as in many orders of
magnitude) longer to return the list of files than 'hadoop fs -ls' on
the same directory.
 
The only thing that changed between yesterday and today was I loaded
additional data into the HDFS (another 680G), but that data was not
processed by the pig script in question as it was loaded into a
different directory path.  I have seen this same 'sit and wait' behavior
from pig on a 4 node test cluster I was using prior to the current
cluster.  Any ideas what is going on here?  I am using Hadoop 20.1 and
pig 0.5.

PS.  My emails keep getting rejected by the list server as 'spam', and I
have to keep editing them until one finally goes through.  Can anything
be done about that?





Re: Pig script very slow to start actual processing

Posted by Alan Gates <ga...@yahoo-inc.com>.
Jeremy,

Usually the mails get bounced when the sender isn't a subscriber to  
pig-user.

Usually we see this sit and wait behavior when other jobs are running  
and there are no slots open on the cluster.  If you see this behavior  
again can you look at the job tracker GUI.  It will tell you how many  
slots you have open for jobs at the moment and any currently running  
jobs.  It's possible your cluster was having network issues and thus  
had no open slots or had gone into a wait mode for some reason.

Alan.

On Jan 15, 2010, at 7:43 AM, Wickham, Jeremy wrote:

>
>
> Hello fellow pig users.  I am new to both hadoop and pig, with a
> background in relational databases and perl scripting. Yesterday I  
> ran a
> fairly simple pig script that ran in around 45 minutes on our new 10
> node cluster with the script processing approx 630G of raw data.   
> Around
> 1 to 2 minutes after submitting the job, I could see the map/reduce
> processes running on the data node machines and the % done count began
> to increment in the grunt shell.  Today, I ran the exact same pig  
> script
> against the exact same dataset.  However, this time I saw no  
> activity on
> the data nodes for over 50 minutes.  The script sat at 0% complete for
> those 50 minutes, then I finally saw process on the data nodes.  From
> that point, the script completed in around 45 minutes, just as it did
> the day before.  I am the only user of the system, and no other jobs
> were running at the time.  I have also noticed that doing a simple  
> 'ls'
> on a directory from grunt takes much much (as in many orders of
> magnitude) longer to return the list of files than 'hadoop fs -ls' on
> the same directory.
>
> The only thing that changed between yesterday and today was I loaded
> additional data into the HDFS (another 680G), but that data was not
> processed by the pig script in question as it was loaded into a
> different directory path.  I have seen this same 'sit and wait'  
> behavior
> from pig on a 4 node test cluster I was using prior to the current
> cluster.  Any ideas what is going on here?  I am using Hadoop 20.1 and
> pig 0.5.
>
> PS.  My emails keep getting rejected by the list server as 'spam',  
> and I
> have to keep editing them until one finally goes through.  Can  
> anything
> be done about that?
>
>
>
>