You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by lohit <lo...@gmail.com> on 2013/01/05 21:44:49 UTC

BIG jobs on YARN

Hi Devs,

Has anyone seen issues when running big jobs on YARN.
I am trying 10 TB terasort where input is 3 way replicated. This generates
job.split and job.splitmetainfo of more than 10MB. I see that first
container launched crashes without any error files.
Debugging little bit I see that job.jar symlink is not created property
which was strange.
If I try same 10TB terasort but with input one way replicated the job runs
fine. job.split and job.splitmetainfo is much less in this case, which
makes me believe there is some kind of limit I might be hitting.
I tried to set mapreduce.job.split.metainfo.maxsize to 100M, but that did
not help.
Any experience running big jobs and any related configs you guys use?

-- 
Have a Nice Day!
Lohit

Re: BIG jobs on YARN

Posted by lohit <lo...@gmail.com>.
Digging little further I saw that the problem was with
config mapreduce.jobtracker.split.metainfo.maxsize
In 2.0 documentation that config is marked as
mapreduce.*job*.split.metainfo.maxsize
while the code is refers to mapreduce.jobtracker.split.metainfo.maxsize
After setting mapreduce.jobtracker.split.metainfo.maxsize to higher value I
could get the job running.
I will open JIRA for this.

2013/1/7 Lohit <lo...@gmail.com>

> It is easily reproducible. Generate 10 TB of input data using
> teragen(replication 3) and try to run terasort using that input. First
> container fails without any information in logs and job fails
>
> Lohit
>
> On Jan 7, 2013, at 6:41 AM, Robert Evans <ev...@yahoo-inc.com> wrote:
>
> > We have run some very large jobs on top of YARN, but have not run into
> > this issue yet.  The fact that the job.jar was not symlinked correctly
> > makes me think this is a YARN distributed cache issue and not really an
> > input split issue.  How reproducible is this?  Does it happen every time
> > you run the job, or did it just happen once?  Could you take a look at
> the
> > node manage logs to see if anything shows issues while launching.  Sadly
> > the node manager does not log everything when downloading the application
> > and private distributed caches, so there could be an error in there where
> > it did not create the symlink and failed to fail :).
> >
> > --Bobby
> >
> > On 1/5/13 2:44 PM, "lohit" <lo...@gmail.com> wrote:
> >
> >> Hi Devs,
> >>
> >> Has anyone seen issues when running big jobs on YARN.
> >> I am trying 10 TB terasort where input is 3 way replicated. This
> generates
> >> job.split and job.splitmetainfo of more than 10MB. I see that first
> >> container launched crashes without any error files.
> >> Debugging little bit I see that job.jar symlink is not created property
> >> which was strange.
> >> If I try same 10TB terasort but with input one way replicated the job
> runs
> >> fine. job.split and job.splitmetainfo is much less in this case, which
> >> makes me believe there is some kind of limit I might be hitting.
> >> I tried to set mapreduce.job.split.metainfo.maxsize to 100M, but that
> did
> >> not help.
> >> Any experience running big jobs and any related configs you guys use?
> >>
> >> --
> >> Have a Nice Day!
> >> Lohit
> >
>



-- 
Have a Nice Day!
Lohit

Re: BIG jobs on YARN

Posted by Lohit <lo...@gmail.com>.
It is easily reproducible. Generate 10 TB of input data using teragen(replication 3) and try to run terasort using that input. First container fails without any information in logs and job fails

Lohit

On Jan 7, 2013, at 6:41 AM, Robert Evans <ev...@yahoo-inc.com> wrote:

> We have run some very large jobs on top of YARN, but have not run into
> this issue yet.  The fact that the job.jar was not symlinked correctly
> makes me think this is a YARN distributed cache issue and not really an
> input split issue.  How reproducible is this?  Does it happen every time
> you run the job, or did it just happen once?  Could you take a look at the
> node manage logs to see if anything shows issues while launching.  Sadly
> the node manager does not log everything when downloading the application
> and private distributed caches, so there could be an error in there where
> it did not create the symlink and failed to fail :).
> 
> --Bobby
> 
> On 1/5/13 2:44 PM, "lohit" <lo...@gmail.com> wrote:
> 
>> Hi Devs,
>> 
>> Has anyone seen issues when running big jobs on YARN.
>> I am trying 10 TB terasort where input is 3 way replicated. This generates
>> job.split and job.splitmetainfo of more than 10MB. I see that first
>> container launched crashes without any error files.
>> Debugging little bit I see that job.jar symlink is not created property
>> which was strange.
>> If I try same 10TB terasort but with input one way replicated the job runs
>> fine. job.split and job.splitmetainfo is much less in this case, which
>> makes me believe there is some kind of limit I might be hitting.
>> I tried to set mapreduce.job.split.metainfo.maxsize to 100M, but that did
>> not help.
>> Any experience running big jobs and any related configs you guys use?
>> 
>> -- 
>> Have a Nice Day!
>> Lohit
> 

Re: BIG jobs on YARN

Posted by Robert Evans <ev...@yahoo-inc.com>.
We have run some very large jobs on top of YARN, but have not run into
this issue yet.  The fact that the job.jar was not symlinked correctly
makes me think this is a YARN distributed cache issue and not really an
input split issue.  How reproducible is this?  Does it happen every time
you run the job, or did it just happen once?  Could you take a look at the
node manage logs to see if anything shows issues while launching.  Sadly
the node manager does not log everything when downloading the application
and private distributed caches, so there could be an error in there where
it did not create the symlink and failed to fail :).

--Bobby

On 1/5/13 2:44 PM, "lohit" <lo...@gmail.com> wrote:

>Hi Devs,
>
>Has anyone seen issues when running big jobs on YARN.
>I am trying 10 TB terasort where input is 3 way replicated. This generates
>job.split and job.splitmetainfo of more than 10MB. I see that first
>container launched crashes without any error files.
>Debugging little bit I see that job.jar symlink is not created property
>which was strange.
>If I try same 10TB terasort but with input one way replicated the job runs
>fine. job.split and job.splitmetainfo is much less in this case, which
>makes me believe there is some kind of limit I might be hitting.
>I tried to set mapreduce.job.split.metainfo.maxsize to 100M, but that did
>not help.
>Any experience running big jobs and any related configs you guys use?
>
>-- 
>Have a Nice Day!
>Lohit