You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by jr <jo...@io-consulting.net> on 2010/03/26 12:43:14 UTC

starting a job

Hello everybody,
I've noticed that when i run some pig scripts, the creation of the
actual hadoop jobs takes quite a while, sometimes more than 15 minutes
until the first map/reduce job starts.
How can I accelerate this? Which machine does that and what do i have to
throw at it? Is it the pig client machine that needs more beef?
Thanks for your answers,
Johannes

Re: starting a job

Posted by Jeff Zhang <zj...@gmail.com>.

Something must be wrong with your environment, 15 minutes is too long



On Fri, Mar 26, 2010 at 7:43 PM, jr <jo...@io-consulting.net>wrote:

> Hello everybody,
> I've noticed that when i run some pig scripts, the creation of the
> actual hadoop jobs takes quite a while, sometimes more than 15 minutes
> until the first map/reduce job starts.
> How can I accelerate this? Which machine does that and what do i have to
> throw at it? Is it the pig client machine that needs more beef?
> Thanks for your answers,
> Johannes
>
>


-- 
Best Regards

Jeff Zhang

Re: starting a job

Posted by Scott Carey <sc...@richrelevance.com>.

If you are in a situation where the system seems to be sitting there for a long time, get a stack dump and check the CPU and I/O stats of the main components.
For java apps you can get a stack dump with 'kill -3' on the process (it won't stop the process), and the dump will go to stdout.  Alternatively, you can use the 'jstack' tool in the jdk but that tends to be less reliable and fail on some environments.

The stack dump on the pig client will tell us what the pig client is doing in that time frame.  For example, if it is waiting on the namenode or jobtracker.

Additionally during such a wait, was there high CPU use or I/O on the client, namenode, or jobtracker?

On Mar 26, 2010, at 10:25 AM, jr wrote:

> Hello Dmitriy,
> In those 15 minutes nothing happens, the job doesn't show up in the job
> tracker at all, but once it does, processing starts immediately.
> Unfortunately i don't have the logs (the ec2 cluster was terminated
> after the job was finished.
> I attached the pig script to this mail.
> I'll probably have to run the script in the next week again, if i do
> i'll keep the logs and send it to this list :)
> Thanks!
> Johannes
> 
> 
> Am Freitag, den 26.03.2010, 10:04 -0700 schrieb Dmitriy Ryaboy:
>> One more possibility is that although the cluster is underloaded, your
>> scheduler constraints prevent you from starting new jobs (there can be a cap
>> of # of jobs per user or per work pool). However, if you start a new cluster
>> just for this task, that shouldn't be the problem.  What is happening in
>> those 15 minutes -- does the job show up on the job tracker console with all
>> the tasks "pending", or does it not even show up there?
>> Can you share a script that reproduces the issue, and the corresponding
>> logs?
>> 
>> -D
> 
> <group.pig>

Re: starting a job

Posted by Benjamin Reed <br...@yahoo-inc.com>.

there are two big sources of delay while submitting a map reduce job:

1) generating the locality information for the input splits. this 
involves contacting the namenode and could take some time if the input 
is really large and made up of many files.
2) uploading the job jar file.

both of these things are done at the client. i'm guessing you might be 
hitting 2) since you have a couple of jars you are including with your 
job. check your bandwith between the client and EC2 (or is your client 
part of EC2?) and the size of the jar files you are uploading.

ben

On 03/26/2010 10:40 AM, Dmitriy Ryaboy wrote:
> That script is very straightforward. I don't see any reason this would be
> hanging before getting to hadoop. Could you have been experiencing some
> network issues or something of that sort? I am not very experienced with EC2
> issues, and this sounds more like an underlying system problem than a Pig
> problem. More logs may of course prove me wrong :).
>
> -D
>
> On Fri, Mar 26, 2010 at 10:25 AM, jr<jo...@io-consulting.net>wrote:
>
>    
>> Hello Dmitriy,
>> In those 15 minutes nothing happens, the job doesn't show up in the job
>> tracker at all, but once it does, processing starts immediately.
>> Unfortunately i don't have the logs (the ec2 cluster was terminated
>> after the job was finished.
>> I attached the pig script to this mail.
>> I'll probably have to run the script in the next week again, if i do
>> i'll keep the logs and send it to this list :)
>> Thanks!
>> Johannes
>>
>>
>> Am Freitag, den 26.03.2010, 10:04 -0700 schrieb Dmitriy Ryaboy:
>>      
>>> One more possibility is that although the cluster is underloaded, your
>>> scheduler constraints prevent you from starting new jobs (there can be a
>>>        
>> cap
>>      
>>> of # of jobs per user or per work pool). However, if you start a new
>>>        
>> cluster
>>      
>>> just for this task, that shouldn't be the problem.  What is happening in
>>> those 15 minutes -- does the job show up on the job tracker console with
>>>        
>> all
>>      
>>> the tasks "pending", or does it not even show up there?
>>> Can you share a script that reproduces the issue, and the corresponding
>>> logs?
>>>
>>> -D
>>>        
>>
>>

Re: starting a job

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

That script is very straightforward. I don't see any reason this would be
hanging before getting to hadoop. Could you have been experiencing some
network issues or something of that sort? I am not very experienced with EC2
issues, and this sounds more like an underlying system problem than a Pig
problem. More logs may of course prove me wrong :).

-D

On Fri, Mar 26, 2010 at 10:25 AM, jr <jo...@io-consulting.net>wrote:

> Hello Dmitriy,
> In those 15 minutes nothing happens, the job doesn't show up in the job
> tracker at all, but once it does, processing starts immediately.
> Unfortunately i don't have the logs (the ec2 cluster was terminated
> after the job was finished.
> I attached the pig script to this mail.
> I'll probably have to run the script in the next week again, if i do
> i'll keep the logs and send it to this list :)
> Thanks!
> Johannes
>
>
> Am Freitag, den 26.03.2010, 10:04 -0700 schrieb Dmitriy Ryaboy:
> > One more possibility is that although the cluster is underloaded, your
> > scheduler constraints prevent you from starting new jobs (there can be a
> cap
> > of # of jobs per user or per work pool). However, if you start a new
> cluster
> > just for this task, that shouldn't be the problem.  What is happening in
> > those 15 minutes -- does the job show up on the job tracker console with
> all
> > the tasks "pending", or does it not even show up there?
> > Can you share a script that reproduces the issue, and the corresponding
> > logs?
> >
> > -D
>
>

Re: starting a job

Posted by jr <jo...@io-consulting.net>.

Hello Dmitriy,
In those 15 minutes nothing happens, the job doesn't show up in the job
tracker at all, but once it does, processing starts immediately.
Unfortunately i don't have the logs (the ec2 cluster was terminated
after the job was finished.
I attached the pig script to this mail.
I'll probably have to run the script in the next week again, if i do
i'll keep the logs and send it to this list :)
Thanks!
Johannes


Am Freitag, den 26.03.2010, 10:04 -0700 schrieb Dmitriy Ryaboy:
> One more possibility is that although the cluster is underloaded, your
> scheduler constraints prevent you from starting new jobs (there can be a cap
> of # of jobs per user or per work pool). However, if you start a new cluster
> just for this task, that shouldn't be the problem.  What is happening in
> those 15 minutes -- does the job show up on the job tracker console with all
> the tasks "pending", or does it not even show up there?
> Can you share a script that reproduces the issue, and the corresponding
> logs?
> 
> -D

Re: starting a job

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

One more possibility is that although the cluster is underloaded, your
scheduler constraints prevent you from starting new jobs (there can be a cap
of # of jobs per user or per work pool). However, if you start a new cluster
just for this task, that shouldn't be the problem.  What is happening in
those 15 minutes -- does the job show up on the job tracker console with all
the tasks "pending", or does it not even show up there?
Can you share a script that reproduces the issue, and the corresponding
logs?

-D

On Fri, Mar 26, 2010 at 8:53 AM, jr <jo...@io-consulting.net>wrote:

> Hi Ashutosh!
> Thanks for the very verbose answer,
> I'll have to say, neither of this is true!
> The pig script is a really, really simple one, about 30 lines of pig
> script.
> the cluster is started one amazon ec2 only for this job, so it's
> actually sitting there, idle.
> the dataset is not "really large", at least not in hadoop/pig terms.
> It's only about 20GB of gzip compressed data in about 140 files. it was
> about 300 maps that were created.
> I've read a few times that hadoop is much happier with big files, so
> I'll try to cat them all together and try again. Could this be the
> issue?
> Thanks!
> Johannes
>
>
> Am Freitag, den 26.03.2010, 07:52 -0700 schrieb Ashutosh Chauhan:
> > Between the point you submit a script to Pig  to the point where MR
> > job starts executing on a cluster, following are three things that may
> > take a while depending on whats affecting you.
> >
> > 1) Your cluster is heavily loaded and job tracker is busy dealing with
> > other jobs. In which case jobtracker wont schedule job just submitted
> > right away. This can be alleviated by tweaking the scheduling policies
> > of job tracker.
> >
> > 2) You are working with really large datasets (tens of thousands of
> > splits). In this case input split calculation which happens on client
> > machine may take a long while.
> >
> > 3) Your pig script is quite large (tens of thousands of lines).
> > Currently Pig takes a bit of time to compile very large scripts.
> >
> > Depending on your situation, you might be hitting one of these issues.
> > Or, there is some new issue which we will discover now :)
> >
> > Ashutosh
> >
> > On Fri, Mar 26, 2010 at 04:43, jr <jo...@io-consulting.net>
> wrote:
> > > Hello everybody,
> > > I've noticed that when i run some pig scripts, the creation of the
> > > actual hadoop jobs takes quite a while, sometimes more than 15 minutes
> > > until the first map/reduce job starts.
> > > How can I accelerate this? Which machine does that and what do i have
> to
> > > throw at it? Is it the pig client machine that needs more beef?
> > > Thanks for your answers,
> > > Johannes
> > >
> > >
>
>

Re: starting a job

Posted by jr <jo...@io-consulting.net>.

Hi Ashutosh!
Thanks for the very verbose answer,
I'll have to say, neither of this is true!
The pig script is a really, really simple one, about 30 lines of pig
script. 
the cluster is started one amazon ec2 only for this job, so it's
actually sitting there, idle. 
the dataset is not "really large", at least not in hadoop/pig terms.
It's only about 20GB of gzip compressed data in about 140 files. it was
about 300 maps that were created.
I've read a few times that hadoop is much happier with big files, so
I'll try to cat them all together and try again. Could this be the
issue?
Thanks!
Johannes


Am Freitag, den 26.03.2010, 07:52 -0700 schrieb Ashutosh Chauhan:
> Between the point you submit a script to Pig  to the point where MR
> job starts executing on a cluster, following are three things that may
> take a while depending on whats affecting you.
> 
> 1) Your cluster is heavily loaded and job tracker is busy dealing with
> other jobs. In which case jobtracker wont schedule job just submitted
> right away. This can be alleviated by tweaking the scheduling policies
> of job tracker.
> 
> 2) You are working with really large datasets (tens of thousands of
> splits). In this case input split calculation which happens on client
> machine may take a long while.
> 
> 3) Your pig script is quite large (tens of thousands of lines).
> Currently Pig takes a bit of time to compile very large scripts.
> 
> Depending on your situation, you might be hitting one of these issues.
> Or, there is some new issue which we will discover now :)
> 
> Ashutosh
> 
> On Fri, Mar 26, 2010 at 04:43, jr <jo...@io-consulting.net> wrote:
> > Hello everybody,
> > I've noticed that when i run some pig scripts, the creation of the
> > actual hadoop jobs takes quite a while, sometimes more than 15 minutes
> > until the first map/reduce job starts.
> > How can I accelerate this? Which machine does that and what do i have to
> > throw at it? Is it the pig client machine that needs more beef?
> > Thanks for your answers,
> > Johannes
> >
> >

Re: starting a job

Posted by Ashutosh Chauhan <as...@gmail.com>.

Between the point you submit a script to Pig  to the point where MR
job starts executing on a cluster, following are three things that may
take a while depending on whats affecting you.

1) Your cluster is heavily loaded and job tracker is busy dealing with
other jobs. In which case jobtracker wont schedule job just submitted
right away. This can be alleviated by tweaking the scheduling policies
of job tracker.

2) You are working with really large datasets (tens of thousands of
splits). In this case input split calculation which happens on client
machine may take a long while.

3) Your pig script is quite large (tens of thousands of lines).
Currently Pig takes a bit of time to compile very large scripts.

Depending on your situation, you might be hitting one of these issues.
Or, there is some new issue which we will discover now :)

Ashutosh

On Fri, Mar 26, 2010 at 04:43, jr <jo...@io-consulting.net> wrote:
> Hello everybody,
> I've noticed that when i run some pig scripts, the creation of the
> actual hadoop jobs takes quite a while, sometimes more than 15 minutes
> until the first map/reduce job starts.
> How can I accelerate this? Which machine does that and what do i have to
> throw at it? Is it the pig client machine that needs more beef?
> Thanks for your answers,
> Johannes
>
>