You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by jagaran das <ja...@yahoo.co.in> on 2012/02/13 06:36:19 UTC

Fw: Hadoop Cluster Question

----- Forwarded Message -----
From: jagaran das <ja...@yahoo.co.in>
To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org> 
Sent: Sunday, 12 February 2012 9:33 PM
Subject: Hadoop Cluster Question

Hi,
A. If One of the Slave Node local disc space is full in a cluster ?

1. Would a already started running Pig job fail ?
2. Any new started pig job would fail ?
3. How would the Hadoop Cluster Behave ? Would that be a dead node ?

B. In our production cluster we are seeing one of the slave node is being more utilized than the others.
By Utilization I mean the %DFS is always more in it. How can we balance it ?

Thanks,
Jagaran

Re: Fw: Hadoop Cluster Question

Posted by Prashant Kommireddi <pr...@gmail.com>.

Wasn't your initial requirement different? You mentioned "seconddir" had a
different schema from "firstdir", in which case simply loading both
together and grouping by (A,B,C,D) will produce unexpected results.

If you can make sure both datasets have the same schema, yes THAT would be
better.

On Mon, Feb 13, 2012 at 11:54 AM, jagaran das <ja...@yahoo.co.in>wrote:

> Thanks
>
> Best would be then
> A = Load '/home/hadoop/{test/firstdir,test/seconddir}' using
> PigStorage('\t') as (A,B,C,D)
> B = group A by (A,B,C,D)
>
> Ignore E while loading and make sure both first and second field is in
> same order A B C D.
>
> Thanks
> Jagaran
>   ------------------------------
> *From:* Prashant Kommireddi <pr...@gmail.com>
> *To:* jagaran das <ja...@yahoo.co.in>
> *Sent:* Monday, 13 February 2012 11:36 AM
>
> *Subject:* Re: Fw: Hadoop Cluster Question
>
> I can suggest a dirty hack for this
>
>    1. A = load 'firstdir' as (a,b,c,d,e);
>    2. B = load 'seconddir';
>    3. C = foreach B generate $0 as a, $4 as b, $2 as c, $3 as d, $4 as e;
>    4. D = UNION A, C;
>    5. E = Group D by (a,b,c,d);
>
> Thanks,
> Prashant
>
>
> On Mon, Feb 13, 2012 at 11:22 AM, jagaran das <ja...@yahoo.co.in>wrote:
>
> Hi,
>
> I have a requirement in Pig, Where I have to read from two diff
> directories but the ordering of field is different.
>
> A = Load '/home/hadoop/{test/firstdir,test/seconddir}' using
> PigStorage('\t') as (A,B,C,D,E)
> B = group A by (A,B,C,D)
>
> now firstdir as the fields in order A B C D E but the second dir has the
> data in order A,C,D,E,B
>
> Is there any way to take read because my groupby clause contains (A,B,C,D
> )?
>
> Thanks
> Jagaran
>
>
>    ------------------------------
> *From:* Prashant Kommireddi <pr...@gmail.com>
> *To:* user@pig.apache.org; jagaran das <ja...@yahoo.co.in>
> *Sent:* Sunday, 12 February 2012 9:48 PM
> *Subject:* Re: Fw: Hadoop Cluster Question
>
>
>    1. Yes the job would fail
>    2. Yes any new job would fail until local disk space is made available
>    3. If there are too many failures from a particular node, after a few
>    failures that node would be blacklisted.
>
> Is that slave node being more utliized due to a particular job, or is just
> a general phenomenon?
> Take a look at
> http://hadoop.apache.org/common/docs/r0.20.2/hdfs_user_guide.html#Rebalancer
> .
> Thanks,
> Prashant
>
> On Sun, Feb 12, 2012 at 9:36 PM, jagaran das <ja...@yahoo.co.in>wrote:
>
>
>
>
> ----- Forwarded Message -----
> From: jagaran das <ja...@yahoo.co.in>
> To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Sent: Sunday, 12 February 2012 9:33 PM
> Subject: Hadoop Cluster Question
>
>
> Hi,
> A. If One of the Slave Node local disc space is full in a cluster ?
>
> 1. Would a already started running Pig job fail ?
> 2. Any new started pig job would fail ?
> 3. How would the Hadoop Cluster Behave ? Would that be a dead node ?
>
> B. In our production cluster we are seeing one of the slave node is being
> more utilized than the others.
> By Utilization I mean the %DFS is always more in it. How can we balance it
> ?
>
> Thanks,
> Jagaran
>
>
>
>
>
>
>
>

Re: Fw: Hadoop Cluster Question

Posted by Prashant Kommireddi <pr...@gmail.com>.

   1. Yes the job would fail
   2. Yes any new job would fail until local disk space is made available
   3. If there are too many failures from a particular node, after a few
   failures that node would be blacklisted.

Is that slave node being more utliized due to a particular job, or is just
a general phenomenon?

Take a look at
http://hadoop.apache.org/common/docs/r0.20.2/hdfs_user_guide.html#Rebalancer
.

Thanks,

Prashant

On Sun, Feb 12, 2012 at 9:36 PM, jagaran das <ja...@yahoo.co.in>wrote:

>
>
>
> ----- Forwarded Message -----
> From: jagaran das <ja...@yahoo.co.in>
> To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Sent: Sunday, 12 February 2012 9:33 PM
> Subject: Hadoop Cluster Question
>
>
> Hi,
> A. If One of the Slave Node local disc space is full in a cluster ?
>
> 1. Would a already started running Pig job fail ?
> 2. Any new started pig job would fail ?
> 3. How would the Hadoop Cluster Behave ? Would that be a dead node ?
>
> B. In our production cluster we are seeing one of the slave node is being
> more utilized than the others.
> By Utilization I mean the %DFS is always more in it. How can we balance it
> ?
>
> Thanks,
> Jagaran

Re: Fw: Hadoop Cluster Question

Posted by Prashant Kommireddi <pr...@gmail.com>.

Apologies, I overlooked "One" associated with Part A of the question, and
answered it for the case when cluster is out disk space.

Part A

   1. Job would not fail if there are more nodes where the task can be run
   (Hadoop places the task on other nodes when a particular node goes down)
   2. Similarly, the new job would use other nodes
   3. That node will be blacklisted after a few failures (depends on
   mapred.max.tracker.blacklists, default is 4)

On Sun, Feb 12, 2012 at 10:25 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> If just one (of many) nodes is full, job won't fail, though individual
> tasks might, and will get re-run elsewhere.
> Obviously that introduces unhappiness into your cluster, so avoid that.
> It's *really* bad for HBase.
>
> D
>
> On Sun, Feb 12, 2012 at 9:36 PM, jagaran das <jagaran_das@yahoo.co.in
> >wrote:
>
> >
> >
> >
> > ----- Forwarded Message -----
> > From: jagaran das <ja...@yahoo.co.in>
> > To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> > Sent: Sunday, 12 February 2012 9:33 PM
> > Subject: Hadoop Cluster Question
> >
> >
> > Hi,
> > A. If One of the Slave Node local disc space is full in a cluster ?
> >
> > 1. Would a already started running Pig job fail ?
> > 2. Any new started pig job would fail ?
> > 3. How would the Hadoop Cluster Behave ? Would that be a dead node ?
> >
> > B. In our production cluster we are seeing one of the slave node is being
> > more utilized than the others.
> > By Utilization I mean the %DFS is always more in it. How can we balance
> it
> > ?
> >
> > Thanks,
> > Jagaran
>

Re: Fw: Hadoop Cluster Question

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

If just one (of many) nodes is full, job won't fail, though individual
tasks might, and will get re-run elsewhere.
Obviously that introduces unhappiness into your cluster, so avoid that.
It's *really* bad for HBase.

D

On Sun, Feb 12, 2012 at 9:36 PM, jagaran das <ja...@yahoo.co.in>wrote:

>
>
>
> ----- Forwarded Message -----
> From: jagaran das <ja...@yahoo.co.in>
> To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Sent: Sunday, 12 February 2012 9:33 PM
> Subject: Hadoop Cluster Question
>
>
> Hi,
> A. If One of the Slave Node local disc space is full in a cluster ?
>
> 1. Would a already started running Pig job fail ?
> 2. Any new started pig job would fail ?
> 3. How would the Hadoop Cluster Behave ? Would that be a dead node ?
>
> B. In our production cluster we are seeing one of the slave node is being
> more utilized than the others.
> By Utilization I mean the %DFS is always more in it. How can we balance it
> ?
>
> Thanks,
> Jagaran