You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by "Satish Setty (HCL Financial Services)" <Sa...@hcl.com> on 2012/01/05 18:37:30 UTC

hadoop

Hello,

We are trying to use Hadoop-0.20.203.0rc1 for parallel computation.  Below are queries

Assume single node of high configuration machine 8 cores and 8gb memory.

(a) How do we know number of  map tasks spawned?  Can this be controlled? We notice only 4 jvms running on a single node - namenode, datanode, jobtracker, tasktracker. As we understand depending on number of splits that many map tasks are spawned - so we should see that many increase in jvms.

(b) Our mapper class should perform complex computations - it has plenty of dependent jars so how do we add all jars in class path  while running application? Since we require to perform parallel computations - we need many map tasks running in parallel with different data. All are in same machine with different jvms.

(c) How does data split happen?  JobClient does not talk about data splits? As we understand we create format for distributed file system, start-all.sh and then "hadoop fs -put". Do this write data to all datanodes? But we are unable to see physical location? How does split happen from this hdfs source?

(d) Can we control number of reduce tasks? Is this seperate jvm?  How are  optimal numbers for  map and reduce tasks determined?

(e) Any good documentation/links which speaks about namenode, datanode, jobtracker and tasktracker.

Kindly help.

Thanks

________________________________________
From: mapreduce-user-help@hadoop.apache.org [mapreduce-user-help@hadoop.apache.org]
Sent: Thursday, January 05, 2012 10:49 PM
To: Satish Setty (HCL Financial Services)
Subject: WELCOME to mapreduce-user@hadoop.apache.org

Hi! This is the ezmlm program. I'm managing the
mapreduce-user@hadoop.apache.org mailing list.

Acknowledgment: I have added the address

   Satish.Setty@hcl.com

to the mapreduce-user mailing list.

Welcome to mapreduce-user@hadoop.apache.org!

Please save this message so that you know the address you are
subscribed under, in case you later want to unsubscribe or change your
subscription address.


--- Administrative commands for the mapreduce-user list ---

I can handle administrative requests automatically. Please
do not send them to the list address! Instead, send
your message to the correct command address:

To subscribe to the list, send a message to:
   <ma...@hadoop.apache.org>

To remove your address from the list, send a message to:
   <ma...@hadoop.apache.org>

Send mail to the following for info and FAQ for this list:
   <ma...@hadoop.apache.org>
   <ma...@hadoop.apache.org>

Similar addresses exist for the digest list:
   <ma...@hadoop.apache.org>
   <ma...@hadoop.apache.org>

To get messages 123 through 145 (a maximum of 100 per request), mail:
   <ma...@hadoop.apache.org>

To get an index with subject and author for messages 123-456 , mail:
   <ma...@hadoop.apache.org>

They are always returned as sets of 100, max 2000 per request,
so you'll actually get 100-499.

To receive all messages with the same subject as message 12345,
send a short message to:
   <ma...@hadoop.apache.org>

The messages should contain one line or word of text to avoid being
treated as sp@m, but I will ignore their content.
Only the ADDRESS you send to is important.

You can start a subscription for an alternate address,
for example "john@host.domain", just add a hyphen and your
address (with '=' instead of '@') after the command word:
<ma...@hadoop.apache.org>

To stop subscription for this address, mail:
<ma...@hadoop.apache.org>

In both cases, I'll send a confirmation message to that address. When
you receive it, simply reply to it to complete your subscription.

If despite following these instructions, you do not get the
desired results, please contact my owner at
mapreduce-user-owner@hadoop.apache.org. Please be patient, my owner is a
lot slower than I am ;-)

--- Enclosed is a copy of the request I received.

Return-Path: <Sa...@hcl.com>
Received: (qmail 88603 invoked by uid 99); 5 Jan 2012 17:19:18 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:19:18 +0000
X-ASF-Spam-Status: No, hits=-0.0 required=5.0
        tests=SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of Satish.Setty@hcl.com designates 203.105.186.23 as permitted sender)
Received: from [203.105.186.23] (HELO gws07.hcl.com) (203.105.186.23)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:19:13 +0000
Received: from chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) by
 CHN-HCLIN-EDGE3.HCL.COM (10.249.64.140) with Microsoft SMTP Server id
 8.2.254.0; Thu, 5 Jan 2012 22:45:40 +0530
Received: from CHN-HCLT-HT04.HCLT.CORP.HCL.IN (10.108.45.37) by
 chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) with Microsoft SMTP Server (TLS) id
 8.2.254.0; Thu, 5 Jan 2012 22:48:48 +0530
Received: from CHN-HCLT-EVS07.HCLT.CORP.HCL.IN ([fe80::3d0d:efa3:3da8:2ae9])
 by CHN-HCLT-HT04.HCLT.CORP.HCL.IN ([::1]) with mapi; Thu, 5 Jan 2012 22:48:47
 +0530
From: "Satish Setty (HCL Financial Services)" <Sa...@hcl.com>
To:
        "mapreduce-user-sc.1325782989.apjoeicfclfanpacjgbo-Satish.Setty=hcl.com@hadoop.apache.org"
        <ma...@hadoop.apache.org>
Date: Thu, 5 Jan 2012 22:48:15 +0530
Subject: RE: confirm subscribe to mapreduce-user@hadoop.apache.org
Thread-Topic: confirm subscribe to mapreduce-user@hadoop.apache.org
Thread-Index: AczLy+7sWw3//jlYTHOm2lkHBLCB8wAAhPxh
Message-ID: <62...@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
References: <13...@hadoop.apache.org>
In-Reply-To: <13...@hadoop.apache.org>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

thanks

________________________________________
From: mapreduce-user-help@hadoop.apache.org [mapreduce-user-help@hadoop.apa=
che.org]
Sent: Thursday, January 05, 2012 10:33 PM
To: Satish Setty (HCL Financial Services)
Subject: confirm subscribe to mapreduce-user@hadoop.apache.org

Hi! This is the ezmlm program. I'm managing the
mapreduce-user@hadoop.apache.org mailing list.

To confirm that you would like

   Satish.Setty@hcl.com

added to the mapreduce-user mailing list, please send
a short reply to this address:

   mapreduce-user-sc.1325782989.apjoeicfclfanpacjgbo-Satish.Setty=3Dhcl.com=
@hadoop.apache.org

Usually, this happens when you just hit the "reply" button.
If this does not work, simply copy the address and paste it into
the "To:" field of a new message.

This confirmation serves two purposes. First, it verifies that I am able
to get mail through to you. Second, it protects you in case someone
forges a subscription request in your name.

Some mail programs are broken and cannot handle long addresses. If you
cannot reply to this request, instead send a message to
<ma...@hadoop.apache.org> and put the
entire address listed above into the "Subject:" line.


--- Administrative commands for the mapreduce-user list ---

I can handle administrative requests automatically. Please
do not send them to the list address! Instead, send
your message to the correct command address:

To subscribe to the list, send a message to:
   <ma...@hadoop.apache.org>

To remove your address from the list, send a message to:
   <ma...@hadoop.apache.org>

Send mail to the following for info and FAQ for this list:
   <ma...@hadoop.apache.org>
   <ma...@hadoop.apache.org>

Similar addresses exist for the digest list:
   <ma...@hadoop.apache.org>
   <ma...@hadoop.apache.org>

To get messages 123 through 145 (a maximum of 100 per request), mail:
   <ma...@hadoop.apache.org>

To get an index with subject and author for messages 123-456 , mail:
   <ma...@hadoop.apache.org>

They are always returned as sets of 100, max 2000 per request,
so you'll actually get 100-499.

To receive all messages with the same subject as message 12345,
send a short message to:
   <ma...@hadoop.apache.org>

The messages should contain one line or word of text to avoid being
treated as sp@m, but I will ignore their content.
Only the ADDRESS you send to is important.

You can start a subscription for an alternate address,
for example "john@host.domain", just add a hyphen and your
address (with '=3D' instead of '@') after the command word:
<ma...@hadoop.apache.org>

To stop subscription for this address, mail:
<ma...@hadoop.apache.org>

In both cases, I'll send a confirmation message to that address. When
you receive it, simply reply to it to complete your subscription.

If despite following these instructions, you do not get the
desired results, please contact my owner at
mapreduce-user-owner@hadoop.apache.org. Please be patient, my owner is a
lot slower than I am ;-)

--- Enclosed is a copy of the request I received.

Return-Path: <Sa...@hcl.com>
Received: (qmail 49524 invoked by uid 99); 5 Jan 2012 17:03:09 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:03:09 +000=
0
X-ASF-Spam-Status: No, hits=3D3.7 required=3D10.0
        tests=3DASF_LIST_OPS,HTML_MESSAGE,MIME_HTML_ONLY,SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (nike.apache.org: domain of Satish.Setty@hcl.com designa=
tes 203.105.186.23 as permitted sender)
Received: from [203.105.186.23] (HELO gws07.hcl.com) (203.105.186.23)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:03:02 +000=
0
Received: from chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) by
 CHN-HCLIN-EDGE3.HCL.COM (10.249.64.140) with Microsoft SMTP Server id
 8.2.254.0; Thu, 5 Jan 2012 22:29:28 +0530
Received: from CHN-HCLT-HT03.HCLT.CORP.HCL.IN (10.108.45.35) by
 chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) with Microsoft SMTP Server (TLS)=
 id
 8.2.254.0; Thu, 5 Jan 2012 22:32:35 +0530
Received: from CHN-HCLT-EVS07.HCLT.CORP.HCL.IN ([fe80::3d0d:efa3:3da8:2ae9]=
)
 by CHN-HCLT-HT03.HCLT.CORP.HCL.IN ([::1]) with mapi; Thu, 5 Jan 2012 22:32=
:34
 +0530
From: "Satish Setty (HCL Financial Services)" <Sa...@hcl.com>
To: "mapreduce-user-subscribe@hadoop.apache.org"
        <ma...@hadoop.apache.org>
Date: Thu, 5 Jan 2012 22:32:22 +0530
Subject: hadoop
Thread-Topic: hadoop
Thread-Index: AQHMy8vRHiu5SOW+OEqHfqc++oiU+w=3D=3D
Message-ID: <620012C16AC105498BB52AC8FD9745280265386D18@CHN-HCLT-EVS07.HCLT=
.CORP.HCL.IN>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/html; charset=3D"iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Virus-Checked: Checked by ClamAV on apache.org

<html dir=3D3D"ltr">
<head>
<meta http-equiv=3D3D"Content-Type" content=3D3D"text/html; charset=3D3Diso=
-8859-=3D
1">
<style title=3D3D"owaParaStyle"><!--P {
        MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px
}
--></style>
<meta content=3D3D"MSHTML 6.00.6000.17063" name=3D3D"GENERATOR">
</head>
<body ocsi=3D3D"x">
<div dir=3D3D"ltr"><font face=3D3D"Tahoma" color=3D3D"#000000" size=3D3D"2"=
>Hello,<=3D
/font></div>
<div>
<div>
<div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">We are trying to us=
e Hado=3D
op-0.20.203.0rc1 for parallel computation.&nbsp; Below are queries</font></=
=3D
div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">Assume single node =
of hig=3D
h configuration machine 8 cores and 8gb memory.</font></div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(a) How do we know&=
nbsp;n=3D
umber of &nbsp;map tasks spawned?&nbsp; Can this be controlled? We notice o=
=3D
nly 4 jvms running on a single node - namenode, datanode, jobtracker, taskt=
=3D
racker. As we understand depending on number of splits
 that many map tasks are spawned - so we should see that many increase in j=
=3D
vms. </font>
</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(b) Our mapper clas=
s shou=3D
ld perform complex computations - it has plenty of dependent jars so how do=
=3D
 we add all jars in class path&nbsp; while running application? Since we re=
=3D
quire to perform parallel computations - we
 need many map tasks running in parallel with different data. All are in sa=
=3D
me machine with different jvms.</font></div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(c) How does data s=
plit h=3D
appen?&nbsp; JobClient does not talk about data splits? As we understand we=
=3D
 create format for distributed file system, start-all.sh and then &quot;had=
=3D
oop fs -put&quot;. Do this write data to all datanodes?
 But we are unable to see physical location? How does split happen from thi=
=3D
s hdfs source?</font></div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(d) Can we control =
number=3D
 of reduce tasks? Is this seperate jvm?&nbsp; How&nbsp;are&nbsp; optimal nu=
=3D
mbers&nbsp;for &nbsp;map&nbsp;and reduce tasks determined?</font></div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(e) Any good docume=
ntatio=3D
n/links which speaks about namenode, datanode, jobtracker and tasktracker.<=
=3D
/font></div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
<div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">Thanks</font></div>
</div>
</div>
</div>
<br>
<hr>
<font face=3D3D"Arial" color=3D3D"Gray" size=3D3D"1">::DISCLAIMER::<br>
---------------------------------------------------------------------------=
=3D
--------------------------------------------<br>
<br>
The contents of this e-mail and any attachment(s) are confidential and inte=
=3D
nded for the named recipient(s) only.<br>
It shall not attach any liability on the originator or HCL or its affiliate=
=3D
s. Any views or opinions presented in<br>
this email are solely those of the author and may not necessarily reflect t=
=3D
he opinions of HCL or its affiliates.<br>
Any form of reproduction, dissemination, copying, disclosure, modification,=
=3D
 distribution and / or publication of<br>
this message without the prior written consent of the author of this e-mail=
=3D
 is strictly prohibited. If you have<br>
received this email in error please delete it and notify the sender immedia=
=3D
tely. Before opening any mail and<br>
attachments please check them for viruses and defect.<br>
<br>
---------------------------------------------------------------------------=
=3D
--------------------------------------------<br>
</font>
</body>
</html>=

Re: hadoop

Posted by Thamizhannal Paramasivam <th...@gmail.com>.

Hi,

For (a) & (d) Refer http://wiki.apache.org/hadoop/HowManyMapsAndReduces

For (b), Package your job as .jar and invoke hadoop command as below. It
gets copied to all the data nodes.
E.g. $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
/usr/joe/wordcount/input /usr/joe/wordcount/output
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

For (c), As soon as you put come files they gets copied to all the name
nodes. You need not worry about physical location of data. By using hadoop
fs -ls/cat command you can verify your input files.

For (e)
http://wiki.apache.org/hadoop/
http://www.cloudera.com/resources/

thanks,
Thamizh

On Thu, Jan 5, 2012 at 11:07 PM, Satish Setty (HCL Financial Services) <
Satish.Setty@hcl.com> wrote:

> Hello,
>
> We are trying to use Hadoop-0.20.203.0rc1 for parallel computation.  Below
> are queries
>
> Assume single node of high configuration machine 8 cores and 8gb memory.
>
> (a) How do we know number of  map tasks spawned?  Can this be controlled?
> We notice only 4 jvms running on a single node - namenode, datanode,
> jobtracker, tasktracker. As we understand depending on number of splits
> that many map tasks are spawned - so we should see that many increase in
> jvms.
>
> (b) Our mapper class should perform complex computations - it has plenty
> of dependent jars so how do we add all jars in class path  while running
> application? Since we require to perform parallel computations - we need
> many map tasks running in parallel with different data. All are in same
> machine with different jvms.
>
> (c) How does data split happen?  JobClient does not talk about data
> splits? As we understand we create format for distributed file system,
> start-all.sh and then "hadoop fs -put". Do this write data to all
> datanodes? But we are unable to see physical location? How does split
> happen from this hdfs source?
>
> (d) Can we control number of reduce tasks? Is this seperate jvm?  How are
>  optimal numbers for  map and reduce tasks determined?
>
> (e) Any good documentation/links which speaks about namenode, datanode,
> jobtracker and tasktracker.
>
> Kindly help.
>
> Thanks
>
> ________________________________________
> From: mapreduce-user-help@hadoop.apache.org [
> mapreduce-user-help@hadoop.apache.org]
> Sent: Thursday, January 05, 2012 10:49 PM
> To: Satish Setty (HCL Financial Services)
> Subject: WELCOME to mapreduce-user@hadoop.apache.org
>
> Hi! This is the ezmlm program. I'm managing the
> mapreduce-user@hadoop.apache.org mailing list.
>
> Acknowledgment: I have added the address
>
>   Satish.Setty@hcl.com
>
> to the mapreduce-user mailing list.
>
> Welcome to mapreduce-user@hadoop.apache.org!
>
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
>
>
> --- Administrative commands for the mapreduce-user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>   <ma...@hadoop.apache.org>
>
> To remove your address from the list, send a message to:
>   <ma...@hadoop.apache.org>
>
> Send mail to the following for info and FAQ for this list:
>   <ma...@hadoop.apache.org>
>   <ma...@hadoop.apache.org>
>
> Similar addresses exist for the digest list:
>   <ma...@hadoop.apache.org>
>   <ma...@hadoop.apache.org>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   <ma...@hadoop.apache.org>
>
> To get an index with subject and author for messages 123-456 , mail:
>   <ma...@hadoop.apache.org>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   <ma...@hadoop.apache.org>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> <ma...@hadoop.apache.org>
>
> To stop subscription for this address, mail:
> <ma...@hadoop.apache.org>
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> mapreduce-user-owner@hadoop.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: <Sa...@hcl.com>
> Received: (qmail 88603 invoked by uid 99); 5 Jan 2012 17:19:18 -0000
> Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:19:18
> +0000
> X-ASF-Spam-Status: No, hits=-0.0 required=5.0
>        tests=SPF_PASS
> X-Spam-Check-By: apache.org
> Received-SPF: pass (athena.apache.org: domain of Satish.Setty@hcl.comdesignates 203.105.186.23 as permitted sender)
> Received: from [203.105.186.23] (HELO gws07.hcl.com) (203.105.186.23)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:19:13
> +0000
> Received: from chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) by
>  CHN-HCLIN-EDGE3.HCL.COM (10.249.64.140) with Microsoft SMTP Server id
>  8.2.254.0; Thu, 5 Jan 2012 22:45:40 +0530
> Received: from CHN-HCLT-HT04.HCLT.CORP.HCL.IN (10.108.45.37) by
>  chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) with Microsoft SMTP Server
> (TLS) id
>  8.2.254.0; Thu, 5 Jan 2012 22:48:48 +0530
> Received: from CHN-HCLT-EVS07.HCLT.CORP.HCL.IN([fe80::3d0d:efa3:3da8:2ae9])
>  by CHN-HCLT-HT04.HCLT.CORP.HCL.IN ([::1]) with mapi; Thu, 5 Jan 2012
> 22:48:47
>  +0530
> From: "Satish Setty (HCL Financial Services)" <Sa...@hcl.com>
> To:
>        "mapreduce-user-sc.1325782989.apjoeicfclfanpacjgbo-Satish.Setty=
> hcl.com@hadoop.apache.org"
>        <mapreduce-user-sc.1325782989.apjoeicfclfanpacjgbo-Satish.Setty=
> hcl.com@hadoop.apache.org>
> Date: Thu, 5 Jan 2012 22:48:15 +0530
> Subject: RE: confirm subscribe to mapreduce-user@hadoop.apache.org
> Thread-Topic: confirm subscribe to mapreduce-user@hadoop.apache.org
> Thread-Index: AczLy+7sWw3//jlYTHOm2lkHBLCB8wAAhPxh
> Message-ID: <
> 620012C16AC105498BB52AC8FD9745280265386D1A@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN
> >
> References: <13...@hadoop.apache.org>
> In-Reply-To: <13...@hadoop.apache.org>
> Accept-Language: en-US
> Content-Language: en-US
> X-MS-Has-Attach:
> X-MS-TNEF-Correlator:
> acceptlanguage: en-US
> Content-Type: text/plain; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> MIME-Version: 1.0
>
> thanks
>
> ________________________________________
> From: mapreduce-user-help@hadoop.apache.org[mapreduce-user-help@hadoop.apa
> =
> che.org]
> Sent: Thursday, January 05, 2012 10:33 PM
> To: Satish Setty (HCL Financial Services)
> Subject: confirm subscribe to mapreduce-user@hadoop.apache.org
>
> Hi! This is the ezmlm program. I'm managing the
> mapreduce-user@hadoop.apache.org mailing list.
>
> To confirm that you would like
>
>   Satish.Setty@hcl.com
>
> added to the mapreduce-user mailing list, please send
> a short reply to this address:
>
>   mapreduce-user-sc.1325782989.apjoeicfclfanpacjgbo-Satish.Setty=3Dhcl.com=
> @hadoop.apache.org
>
> Usually, this happens when you just hit the "reply" button.
> If this does not work, simply copy the address and paste it into
> the "To:" field of a new message.
>
> This confirmation serves two purposes. First, it verifies that I am able
> to get mail through to you. Second, it protects you in case someone
> forges a subscription request in your name.
>
> Some mail programs are broken and cannot handle long addresses. If you
> cannot reply to this request, instead send a message to
> <ma...@hadoop.apache.org> and put the
> entire address listed above into the "Subject:" line.
>
>
> --- Administrative commands for the mapreduce-user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>   <ma...@hadoop.apache.org>
>
> To remove your address from the list, send a message to:
>   <ma...@hadoop.apache.org>
>
> Send mail to the following for info and FAQ for this list:
>   <ma...@hadoop.apache.org>
>   <ma...@hadoop.apache.org>
>
> Similar addresses exist for the digest list:
>   <ma...@hadoop.apache.org>
>   <ma...@hadoop.apache.org>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   <ma...@hadoop.apache.org>
>
> To get an index with subject and author for messages 123-456 , mail:
>   <ma...@hadoop.apache.org>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   <ma...@hadoop.apache.org>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=3D' instead of '@') after the command word:
> <ma...@hadoop.apache.org>
>
> To stop subscription for this address, mail:
> <ma...@hadoop.apache.org>
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> mapreduce-user-owner@hadoop.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: <Sa...@hcl.com>
> Received: (qmail 49524 invoked by uid 99); 5 Jan 2012 17:03:09 -0000
> Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:03:09
> +000=
> 0
> X-ASF-Spam-Status: No, hits=3D3.7 required=3D10.0
>        tests=3DASF_LIST_OPS,HTML_MESSAGE,MIME_HTML_ONLY,SPF_PASS
> X-Spam-Check-By: apache.org
> Received-SPF: pass (nike.apache.org: domain of Satish.Setty@hcl.comdesigna=
> tes 203.105.186.23 as permitted sender)
> Received: from [203.105.186.23] (HELO gws07.hcl.com) (203.105.186.23)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:03:02
> +000=
> 0
> Received: from chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) by
>  CHN-HCLIN-EDGE3.HCL.COM (10.249.64.140) with Microsoft SMTP Server id
>  8.2.254.0; Thu, 5 Jan 2012 22:29:28 +0530
> Received: from CHN-HCLT-HT03.HCLT.CORP.HCL.IN (10.108.45.35) by
>  chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) with Microsoft SMTP Server
> (TLS)=
>  id
>  8.2.254.0; Thu, 5 Jan 2012 22:32:35 +0530
> Received: from CHN-HCLT-EVS07.HCLT.CORP.HCL.IN([fe80::3d0d:efa3:3da8:2ae9]=
> )
>  by CHN-HCLT-HT03.HCLT.CORP.HCL.IN ([::1]) with mapi; Thu, 5 Jan 2012
> 22:32=
> :34
>  +0530
> From: "Satish Setty (HCL Financial Services)" <Sa...@hcl.com>
> To: "mapreduce-user-subscribe@hadoop.apache.org"
>        <ma...@hadoop.apache.org>
> Date: Thu, 5 Jan 2012 22:32:22 +0530
> Subject: hadoop
> Thread-Topic: hadoop
> Thread-Index: AQHMy8vRHiu5SOW+OEqHfqc++oiU+w=3D=3D
> Message-ID: <620012C16AC105498BB52AC8FD9745280265386D18@CHN-HCLT-EVS07.HCLT
> =
> .CORP.HCL.IN>
> Accept-Language: en-US
> Content-Language: en-US
> X-MS-Has-Attach:
> X-MS-TNEF-Correlator:
> acceptlanguage: en-US
> Content-Type: text/html; charset=3D"iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> MIME-Version: 1.0
> X-Virus-Checked: Checked by ClamAV on apache.org
>
> <html dir=3D3D"ltr">
> <head>
> <meta http-equiv=3D3D"Content-Type" content=3D3D"text/html;
> charset=3D3Diso=
> -8859-=3D
> 1">
> <style title=3D3D"owaParaStyle"><!--P {
>        MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px
> }
> --></style>
> <meta content=3D3D"MSHTML 6.00.6000.17063" name=3D3D"GENERATOR">
> </head>
> <body ocsi=3D3D"x">
> <div dir=3D3D"ltr"><font face=3D3D"Tahoma" color=3D3D"#000000"
> size=3D3D"2"=
> >Hello,<=3D
> /font></div>
> <div>
> <div>
> <div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">We are trying to
> us=
> e Hado=3D
> op-0.20.203.0rc1 for parallel computation.&nbsp; Below are
> queries</font></=
> =3D
> div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">Assume single node
> =
> of hig=3D
> h configuration machine 8 cores and 8gb memory.</font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(a) How do we
> know&=
> nbsp;n=3D
> umber of &nbsp;map tasks spawned?&nbsp; Can this be controlled? We notice
> o=
> =3D
> nly 4 jvms running on a single node - namenode, datanode, jobtracker,
> taskt=
> =3D
> racker. As we understand depending on number of splits
>  that many map tasks are spawned - so we should see that many increase in
> j=
> =3D
> vms. </font>
> </div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(b) Our mapper
> clas=
> s shou=3D
> ld perform complex computations - it has plenty of dependent jars so how
> do=
> =3D
>  we add all jars in class path&nbsp; while running application? Since we
> re=
> =3D
> quire to perform parallel computations - we
>  need many map tasks running in parallel with different data. All are in
> sa=
> =3D
> me machine with different jvms.</font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(c) How does data
> s=
> plit h=3D
> appen?&nbsp; JobClient does not talk about data splits? As we understand
> we=
> =3D
>  create format for distributed file system, start-all.sh and then
> &quot;had=
> =3D
> oop fs -put&quot;. Do this write data to all datanodes?
>  But we are unable to see physical location? How does split happen from
> thi=
> =3D
> s hdfs source?</font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(d) Can we control
> =
> number=3D
>  of reduce tasks? Is this seperate jvm?&nbsp; How&nbsp;are&nbsp; optimal
> nu=
> =3D
> mbers&nbsp;for &nbsp;map&nbsp;and reduce tasks determined?</font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(e) Any good
> docume=
> ntatio=3D
> n/links which speaks about namenode, datanode, jobtracker and
> tasktracker.<=
> =3D
> /font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">Thanks</font></div>
> </div>
> </div>
> </div>
> <br>
> <hr>
> <font face=3D3D"Arial" color=3D3D"Gray" size=3D3D"1">::DISCLAIMER::<br>
>
> ---------------------------------------------------------------------------=
> =3D
> --------------------------------------------<br>
> <br>
> The contents of this e-mail and any attachment(s) are confidential and
> inte=
> =3D
> nded for the named recipient(s) only.<br>
> It shall not attach any liability on the originator or HCL or its
> affiliate=
> =3D
> s. Any views or opinions presented in<br>
> this email are solely those of the author and may not necessarily reflect
> t=
> =3D
> he opinions of HCL or its affiliates.<br>
> Any form of reproduction, dissemination, copying, disclosure,
> modification,=
> =3D
>  distribution and / or publication of<br>
> this message without the prior written consent of the author of this
> e-mail=
> =3D
>  is strictly prohibited. If you have<br>
> received this email in error please delete it and notify the sender
> immedia=
> =3D
> tely. Before opening any mail and<br>
> attachments please check them for viruses and defect.<br>
> <br>
>
> ---------------------------------------------------------------------------=
> =3D
> --------------------------------------------<br>
> </font>
> </body>
> </html>=
>

Re: hadoop

Posted by be...@gmail.com.

Hi Satish
       After changing dfs.block.size to 40 did to recopy the files. Changing dfs.block.size won't affect the existing files in hdfs it would be applicable from the new files you copy to hdfs. In short with the changes in dfs.block.size=40,
mapred.min.split.size=0,mapred.max.split.size=40 do a copyFromLocal and try executing your job on this newly copied data.


Regards
Bejoy K S

-----Original Message-----
From: "Satish Setty (HCL Financial Services)" <Sa...@hcl.com>
Date: Tue, 10 Jan 2012 08:57:37 
To: Bejoy Ks<be...@gmail.com>
Cc: mapreduce-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: hadoop

  
Hi Bejoy, 
  

 Thanks for help. Changed values  mapred.min.split.size=0,mapred.max.split.size=40 but but job counter does not reflect any other changes? 
For posting kindly let me know correct link/mail-id - at present directly sending to your account["Bejoy Ks ‎[bejoy.hadoop@gmail.com]‎" - has been great help to me. 
  
Posting to group account 
mapreduce-user@hadoop.apache.org <ma...@hadoop.apache.org>   bounces back. 
  
 
 
 Counter Map Reduce Total 
 File Input Format Counters Bytes Read 61 0 61 
 Job Counters SLOTS_MILLIS_MAPS 0 0 3,886 
 Launched map tasks 0 0 2 
 Data-local map tasks 0 0 2 
 FileSystemCounters HDFS_BYTES_READ 267 0 267 
 FILE_BYTES_WRITTEN 58,134 0 58,134 
 Map-Reduce Framework Map output materialized bytes 0 0 0 
 Combine output records 0 0 0 
 Map input records 9 0 9 
 Spilled Records 0 0 0 
 Map output bytes 70 0 70 
 Map input bytes 54 0 54 
 SPLIT_RAW_BYTES 206 0 206 
 Map output records 7 0 7 
 Combine input records 0 0 0 
 
----------------
 From: Bejoy Ks [bejoy.hadoop@gmail.com]
 Sent: Monday, January 09, 2012 11:13 PM
 To: Satish Setty (HCL Financial Services)
 Cc: mapreduce-user@hadoop.apache.org
 Subject: Re: hadoop
 
 
 
Hi Satish
       It would be good if you don't cross post your queries. Just post it once on the right list.
 
       What is your value for mapred.max.split.size? Try setting these values as well 
 mapred.min.split.size=0 (it is the default value)
 mapred.max.split.size=40
 
 Try executing your job once you apply these changes on top of others you did. 
 
 Regards
 Bejoy.K.S
 
 
On Mon, Jan 9, 2012 at 5:09 PM, Satish Setty (HCL Financial Services) <Satish.Setty@hcl.com <ma...@hcl.com> > wrote:
 
 
Hi Bejoy, 
  
Even with below settings map tasks never go beyound 2, any way to make this spawn 10 tasks. Basically it should look like compute grid - computation in parallel. 
  
<property>
   <name>io.bytes.per.checksum</name>
   <value>30</value>
   <description>The number of bytes per checksum.  Must not be larger than
   io.file.buffer.size.</description>
 </property> 

 <property>
   <name>dfs.block.size</name>
    <value>30</value>
   <description>The default block size for new files.</description>
 </property>
 
<property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>10</value>
   <description>The maximum number of map tasks that will be run
   simultaneously by a task tracker.
   </description>
 </property>
 
  
 
----------------
 
From: Satish Setty (HCL Financial Services)
 Sent: Monday, January 09, 2012 1:21 PM 
 
 

 To: Bejoy Ks
 Cc: mapreduce-user@hadoop.apache.org <ma...@hadoop.apache.org> 
 Subject: RE: hadoop
 
 
 
 
 
 
 
Hi Bejoy, 
  
In hdfs I have set block size - 40bytes . Input Data set is as below 
data1   (5*8=40 bytes) 
data2 
...... 
data10 
  
  
But still I see only 2 map tasks spawned, should have been atleast 10 map tasks. Not sure how works internally. Line feed does not work [as you have explained below] 
  
Thanks 
 
----------------
 From: Satish Setty (HCL Financial Services)
 Sent: Saturday, January 07, 2012 9:17 PM
 To: Bejoy Ks
 Cc: mapreduce-user@hadoop.apache.org <ma...@hadoop.apache.org> 
 Subject: RE: hadoop
 
 
 
 
Thanks Bejoy - great information - will try out. 
  
I meant for below problem single node with high configuration -> 8 cpus and 8gb memory. Hence taking an example of 10 data items with line feeds. We want to utilize full power of machine - hence want at least 10 map tasks - each task needs to perform highly complex mathematical simulation.  At present it looks like file data is the only way to specify number of map tasks via splitsize (in bytes) - but I prefer some criteria like line feed or whatever. 
  
In below example - 'data1' corresponds to 5*8=40bytes, if I have data1 .... data10 in theory I need to see 10 map tasks with split size of 40bytes. 
  
How do I perform logging - where is the log (apache logger) data written? system outs may not come as it is background process. 
  
Regards 
  
  
 
----------------
 From: Bejoy Ks [bejoy.hadoop@gmail.com <ma...@gmail.com> ]
 Sent: Saturday, January 07, 2012 7:35 PM
 To: Satish Setty (HCL Financial Services)
 Cc: mapreduce-user@hadoop.apache.org <ma...@hadoop.apache.org> 
 Subject: Re: hadoop
 
 
 
Hi Satish
       Please find some pointers inline
 
 Problem - As per documentation filesplits corresponds to number of map tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0. Where can I find default settings for variour parameters like block size, number of map/reduce tasks.
 
 [Bejoy] I'd rather state it other way round, the number of map tasks triggered by a MR job is determined by number of input splits (and input format). If you use TextInputFormat with default settings the number of input splits is equal to the no of hdfs blocks occupied by the input. Size of an input split is equal to hdfs block size in default(64Mb). If you want to have more splits for one hdfs block itself you need to set a value less than 64 Mb for mapred.max.split.size. 
 
 You can find pretty much all default configuration values from the downloaded .tar at
 hadoop-0.20.*/src/mapred/mapred-default.xml
 hadoop-0.20.*/src/hdfs/hdfs-default.xml
 hadoop-0.20.*/src/core/core-default.xml
 
 If you want to alter some of these values then you can provide the same in 
 $HADOOP_HOME/conf/mapred-site.xml
 $HADOOP_HOME/conf/hdfs-site.xml
 $HADOOP_HOME/conf/core-site.xml
 
 These values provided in *-site.xml would be taken into account only if they are not marked in *-default.xml. If not final, the values provided in *-site.xml overrides the values in *-default.xml for corresponding configuration parameter.
 
 I require atleast  10 map taks which is same as number of "line feeds". Each corresponds to complex calculation to be done by map task. So I can have optimal cpu utilization - 8 cpus.
 
 [Bejoy] Hadoop is a good choice processing large amounts of data. It is not wise to choose one mapper for one record/line in a file, as creation of a map task itself is expensive with jvm spanning and all. Currently you may have 10 records in your input but I believe you are just testing Hadoop in dev env and in production that wouldn't be the case there could be n files having m records each and this m can be in millions.(Just assuming based on my experience). On larger data sets you may not need to split on line boundaries. There can be multiple lines in a file and if you use TextInputFormat it is just one line processed by a map task at an instant. If you have n map tasks then n lines could be getting processed at an instant of map task execution time frame one by each map task. In larger data volumes map tasks are spanned in specific nodes primarily based on data locality, then on available tasks slots on data local node and so on. It is possible that if you have a 10 node cluster, 10 hdfs blocks corresponding to a input file and assume that all the blocks are present only on 8 nodes and there are sufficient task slots available on all 8 , then tasks for your job may be executed in 8 nodes alone instead of 10. So there are chances that there won't be 100% balanced CPU utilization across nodes in a cluster. 
                I'm not really sure how you can spawn map tasks based on line feeds in a file .Let us wait for others  to comment on this. 
            Also if your using map reduce for parallel computation alone the make sure you set the number of reducers to zero, with that you can save a lot of time that would be other wise spend on sort and shuffle phases. 
 (-D  mapred.reduce.tasks=0)
 
 
Behaviour of maptasks looks strange to be as some times if I give in program jobconf.set(num map tasks) it takes 2 or 8.  
 
 [Bejoy]There is no default value for number of map tasks, it is determined by input splits and  input format used by your job. You cannot set the number of map tasks even if you set them at your job level, it is not considered. (mapred.map.tasks) . But you can definitely specify the number of reduce tasks at your job level  by job.setNumReduceTasks(n) or mapred.reduce.tasks. If not set it would take the default value for reduce tasks specified in conf files.
 
 
 I see some files like part-00001... 
Are they partitions? 
 [Bejoy] The part-000* files corresponds to reducers. You'd have n files if you have n reducers as one reducer produces one output file.
 
 Hope it helps!..
 
 Regards
 Bejoy.KS
 
 
 
On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services) <Satish.Setty@hcl.com <ma...@hcl.com> > wrote:
 
 
Hi Bijoy, 
 
 
 
 
  
Just finished installation and tested sample applications. 
  
Problem - As per documentation filesplits corresponds to number of map tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0. Where can I find default settings for variour parameters like block size, number of map/reduce tasks. 
  
Is it possible to control filesplit by "line feed - \n". I tried giving sample input -> jobconf -> TextInputFormat 
  
date1   
date2 
date3 
....... 
...... 
date10 
  
But when I run I see number of maptasks=2 or 1. 
I require atleast  10 map taks which is same as number of "line feeds". Each corresponds to complex calculation to be done by map task. So I can have optimal cpu utilization - 8 cpus. 
  
Behaviour of maptasks looks strange to be as some times if I give in program jobconf.set(num map tasks) it takes 2 or 8.  I see some files like part-00001... 
Are they partitions? 
  
Thanks 
 
----------------
 From: Satish Setty (HCL Financial Services)
 Sent: Friday, January 06, 2012 12:29 PM
 To: bejoy.hadoop@gmail.com <ma...@gmail.com> 
 Subject: FW: hadoop
 
 
 
 
  
 
 
 
 
 
Thanks Bejoy. Extremely useful information. We will try and come back. WebApp application [jobtracker web UI ] does this require deployment or application server container comes inbuilt with hadoop? 
  
Regards 
  
 
----------------
 From: Bejoy Ks [bejoy.hadoop@gmail.com <ma...@gmail.com> ]
 Sent: Friday, January 06, 2012 12:54 AM
 To: mapreduce-user@hadoop.apache.org <ma...@hadoop.apache.org> 
 Subject: Re: hadoop
 
 
 
 
 
 
Hi Satish
         Please find some pointers in line
 
 (a) How do we know number of  map tasks spawned?  Can this be controlled? We notice only 4 jvms running on a single node - namenode, datanode, jobtracker, tasktracker. As we understand depending on number of splits that many map tasks are spawned - so we should see that many increase in jvms.
 
 [Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are the default process on hadoop it is not dependent on your tasks and your tasks are custom tasks are launched in separate jvms. You can control the maximum number of mappers on each tasktracker at an instance by setting mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or reduce) are executed on individual jvms and once the task is completed the jvms are destroyed. You are right, in default one map task is launched per input split.
 Just check the jobtracker web UI (http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all details on the job including the number of map tasks spanned by a job. If you want to run multiple task tracker and data node instances on the same machine you need to ensure that there are no port conflicts.
 
 (b) Our mapper class should perform complex computations - it has plenty of dependent jars so how do we add all jars in class path  while running application? Since we require to perform parallel computations - we need many map tasks running in parallel with different data. All are in same machine with different jvms.
 
 [Bejoy] If these dependent jars are used by almost all your applications include the same in class path of all your nodes.(in your case just one node). Alternatively you can use -libjars option while submitting your job. For more details refer
 http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
 
 (c) How does data split happen?  JobClient does not talk about data splits? As we understand we create format for distributed file system, start-all.sh and then "hadoop fs -put". Do this write data to all datanodes? But we are unable to see physical location? How does split happen from this hdfs source?
 
 [Bejoy] Input files are split into blocks during copy into hdfs itself , the size of each block is detmined from the hadoop configuration of your cluster. Name node decides on which all datanodes these blocks are to be placed including its replicas and this details are passed on to the client. The client copies the blocks to one data node and from this data node the block is replicated to other datanodes. The splitting of a file happens in HDFS API level.
 
 thanks 

 
----------------
 ::DISCLAIMER::
 -----------------------------------------------------------------------------------------------------------------------
 
 The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
 It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
 this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
 Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
 this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
 received this email in error please delete it and notify the sender immediately. Before opening any mail and
 attachments please check them for viruses and defect.
 
 -----------------------------------------------------------------------------------------------------------------------

Re: hadoop

Posted by Bejoy Ks <be...@gmail.com>.

Hi Satish
      It would be good if you don't cross post your queries. Just post it
once on the right list.

      What is your value for mapred.max.split.size? Try setting these
values as well
mapred.min.split.size=0 (it is the default value)
mapred.max.split.size=40

Try executing your job once you apply these changes on top of others you
did.

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 5:09 PM, Satish Setty (HCL Financial Services) <
Satish.Setty@hcl.com> wrote:

>  Hi Bejoy,
>
> Even with below settings map tasks never go beyound 2, any way to make
> this spawn 10 tasks. Basically it should look like compute grid -
> computation in parallel.
>
> <property>
>   <name>io.bytes.per.checksum</name>
>   <value>30</value>
>   <description>The number of bytes per checksum.  Must not be larger than
>   io.file.buffer.size.</description>
> </property>
>
> <property>
>   <name>dfs.block.size</name>
>    <value>30</value>
>   <description>The default block size for new files.</description>
> </property>
>  <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>10</value>
>   <description>The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
>  ------------------------------
> *From:* Satish Setty (HCL Financial Services)
> *Sent:* Monday, January 09, 2012 1:21 PM
>
> *To:* Bejoy Ks
> *Cc:* mapreduce-user@hadoop.apache.org
> *Subject:* RE: hadoop
>
>   Hi Bejoy,
>
> In hdfs I have set block size - 40bytes . Input Data set is as below
> data1   (5*8=40 bytes)
> data2
> ......
> data10
>
>
> But still I see only 2 map tasks spawned, should have been atleast 10 map
> tasks. Not sure how works internally. Line feed does not work [as you have
> explained below]
>
> Thanks
>  ------------------------------
> *From:* Satish Setty (HCL Financial Services)
> *Sent:* Saturday, January 07, 2012 9:17 PM
> *To:* Bejoy Ks
> *Cc:* mapreduce-user@hadoop.apache.org
> *Subject:* RE: hadoop
>
>   Thanks Bejoy - great information - will try out.
>
> I meant for below problem single node with high configuration -> 8 cpus
> and 8gb memory. Hence taking an example of 10 data items with line feeds.
> We want to utilize full power of machine - hence want at least 10 map tasks
> - each task needs to perform highly complex mathematical simulation.  At
> present it looks like file data is the only way to specify number of map
> tasks via splitsize (in bytes) - but I prefer some criteria like line feed
> or whatever.
>
> In below example - 'data1' corresponds to 5*8=40bytes, if I have data1
> .... data10 in theory I need to see 10 map tasks with split size of 40bytes.
>
> How do I perform logging - where is the log (apache logger) data written?
> system outs may not come as it is background process.
>
> Regards
>
>
>  ------------------------------
> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
> *Sent:* Saturday, January 07, 2012 7:35 PM
> *To:* Satish Setty (HCL Financial Services)
> *Cc:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: hadoop
>
>  Hi Satish
>       Please find some pointers inline
>
> Problem - As per documentation filesplits corresponds to number of map
> tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
> Where can I find default settings for variour parameters like block size,
> number of map/reduce tasks.
>
> [Bejoy] I'd rather state it other way round, the number of map tasks
> triggered by a MR job is determined by number of input splits (and input
> format). If you use TextInputFormat with default settings the number of
> input splits is equal to the no of hdfs blocks occupied by the input. Size
> of an input split is equal to hdfs block size in default(64Mb). If you want
> to have more splits for one hdfs block itself you need to set a value less
> than 64 Mb for mapred.max.split.size.
>
> You can find pretty much all default configuration values from the
> downloaded .tar at
> hadoop-0.20.*/src/mapred/mapred-default.xml
> hadoop-0.20.*/src/hdfs/hdfs-default.xml
> hadoop-0.20.*/src/core/core-default.xml
>
> If you want to alter some of these values then you can provide the same in
> $HADOOP_HOME/conf/mapred-site.xml
> $HADOOP_HOME/conf/hdfs-site.xml
> $HADOOP_HOME/conf/core-site.xml
>
> These values provided in *-site.xml would be taken into account only if
> they are not marked in *-default.xml. If not final, the values provided in
> *-site.xml overrides the values in *-default.xml for corresponding
> configuration parameter.
>
> I require atleast  10 map taks which is same as number of "line feeds".
> Each corresponds to complex calculation to be done by map task. So I can
> have optimal cpu utilization - 8 cpus.
>
> [Bejoy] Hadoop is a good choice processing large amounts of data. It is
> not wise to choose one mapper for one record/line in a file, as creation of
> a map task itself is expensive with jvm spanning and all. Currently you may
> have 10 records in your input but I believe you are just testing Hadoop in
> dev env and in production that wouldn't be the case there could be n files
> having m records each and this m can be in millions.(Just assuming based on
> my experience). On larger data sets you may not need to split on line
> boundaries. There can be multiple lines in a file and if you use
> TextInputFormat it is just one line processed by a map task at an instant.
> If you have n map tasks then n lines could be getting processed at an
> instant of map task execution time frame one by each map task. In larger
> data volumes map tasks are spanned in specific nodes primarily based on
> data locality, then on available tasks slots on data local node and so on.
> It is possible that if you have a 10 node cluster, 10 hdfs blocks
> corresponding to a input file and assume that all the blocks are present
> only on 8 nodes and there are sufficient task slots available on all 8 ,
> then tasks for your job may be executed in 8 nodes alone instead of 10. So
> there are chances that there won't be 100% balanced CPU utilization across
> nodes in a cluster.
>                I'm not really sure how you can spawn map tasks based on
> line feeds in a file .Let us wait for others  to comment on this.
>            Also if your using map reduce for parallel computation alone
> the make sure you set the number of reducers to zero, with that you can
> save a lot of time that would be other wise spend on sort and shuffle
> phases.
> (-D  mapred.reduce.tasks=0)
>
>  Behaviour of maptasks looks strange to be as some times if I give in
> program jobconf.set(num map tasks) it takes 2 or 8.
>
> [Bejoy]There is no default value for number of map tasks, it is determined
> by input splits and  input format used by your job. You cannot set the
> number of map tasks even if you set them at your job level, it is not
> considered. (mapred.map.tasks) . But you can definitely specify the number
> of reduce tasks at your job level  by job.setNumReduceTasks(n) or
> mapred.reduce.tasks. If not set it would take the default value for reduce
> tasks specified in conf files.
>
>
> I see some files like part-00001...
> Are they partitions?
>
> [Bejoy] The part-000* files corresponds to reducers. You'd have n files if
> you have n reducers as one reducer produces one output file.
>
> Hope it helps!..
>
> Regards
> Bejoy.KS
>
>
> On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services) <
> Satish.Setty@hcl.com> wrote:
>
>>  Hi Bijoy,
>>
>> Just finished installation and tested sample applications.
>>
>> Problem - As per documentation filesplits corresponds to number of map
>> tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
>> Where can I find default settings for variour parameters like block size,
>> number of map/reduce tasks.
>>
>> Is it possible to control filesplit by "line feed - \n". I tried giving
>> sample input -> jobconf -> TextInputFormat
>>
>> date1
>> date2
>> date3
>> .......
>> ......
>> date10
>>
>> But when I run I see number of maptasks=2 or 1.
>> I require atleast  10 map taks which is same as number of "line feeds".
>> Each corresponds to complex calculation to be done by map task. So I can
>> have optimal cpu utilization - 8 cpus.
>>
>> Behaviour of maptasks looks strange to be as some times if I give in
>> program jobconf.set(num map tasks) it takes 2 or 8.  I see some files like
>> part-00001...
>> Are they partitions?
>>
>> Thanks
>>  ------------------------------
>> *From:* Satish Setty (HCL Financial Services)
>> *Sent:* Friday, January 06, 2012 12:29 PM
>> *To:* bejoy.hadoop@gmail.com
>> *Subject:* FW: hadoop
>>
>>
>>    Thanks Bejoy. Extremely useful information. We will try and come
>> back. WebApp application [jobtracker web UI ] does this require
>> deployment or application server container comes inbuilt with hadoop?
>>
>> Regards
>>
>>  ------------------------------
>> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
>> *Sent:* Friday, January 06, 2012 12:54 AM
>> *To:* mapreduce-user@hadoop.apache.org
>> *Subject:* Re: hadoop
>>
>>     Hi Satish
>>         Please find some pointers in line
>>
>> (a) How do we know number of  map tasks spawned?  Can this be controlled?
>> We notice only 4 jvms running on a single node - namenode, datanode,
>> jobtracker, tasktracker. As we understand depending on number of splits
>> that many map tasks are spawned - so we should see that many increase in
>> jvms.
>>
>> [Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode
>> are the default process on hadoop it is not dependent on your tasks and
>> your tasks are custom tasks are launched in separate jvms. You can control
>> the maximum number of mappers on each tasktracker at an instance by setting
>> mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or
>> reduce) are executed on individual jvms and once the task is completed the
>> jvms are destroyed. You are right, in default one map task is launched per
>> input split.
>> Just check the jobtracker web UI (
>> http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all
>> details on the job including the number of map tasks spanned by a job. If
>> you want to run multiple task tracker and data node instances on the same
>> machine you need to ensure that there are no port conflicts.
>>
>> (b) Our mapper class should perform complex computations - it has plenty
>> of dependent jars so how do we add all jars in class path  while running
>> application? Since we require to perform parallel computations - we need
>> many map tasks running in parallel with different data. All are in same
>> machine with different jvms.
>>
>> [Bejoy] If these dependent jars are used by almost all your applications
>> include the same in class path of all your nodes.(in your case just one
>> node). Alternatively you can use -libjars option while submitting your job.
>> For more details refer
>>
>> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>>
>> (c) How does data split happen?  JobClient does not talk about data
>> splits? As we understand we create format for distributed file system,
>> start-all.sh and then "hadoop fs -put". Do this write data to all
>> datanodes? But we are unable to see physical location? How does split
>> happen from this hdfs source?
>>
>> [Bejoy] Input files are split into blocks during copy into hdfs itself ,
>> the size of each block is detmined from the hadoop configuration of your
>> cluster. Name node decides on which all datanodes these blocks are to be
>> placed including its replicas and this details are passed on to the client.
>> The client copies the blocks to one data node and from this data node the
>> block is replicated to other datanodes. The splitting of a file happens in
>> HDFS API level.
>>
>>  thanks
>>
>> ------------------------------
>> ::DISCLAIMER::
>>
>> -----------------------------------------------------------------------------------------------------------------------
>>
>> The contents of this e-mail and any attachment(s) are confidential and
>> intended for the named recipient(s) only.
>> It shall not attach any liability on the originator or HCL or its
>> affiliates. Any views or opinions presented in
>> this email are solely those of the author and may not necessarily reflect
>> the opinions of HCL or its affiliates.
>> Any form of reproduction, dissemination, copying, disclosure,
>> modification, distribution and / or publication of
>> this message without the prior written consent of the author of this
>> e-mail is strictly prohibited. If you have
>> received this email in error please delete it and notify the sender
>> immediately. Before opening any mail and
>> attachments please check them for viruses and defect.
>>
>>
>> -----------------------------------------------------------------------------------------------------------------------
>>
>
>

Re: hadoop

Posted by Bejoy Ks <be...@gmail.com>.

Hi Satish
      Please find some pointers inline

Problem - As per documentation filesplits corresponds to number of map
tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
Where can I find default settings for variour parameters like block size,
number of map/reduce tasks.

[Bejoy] I'd rather state it other way round, the number of map tasks
triggered by a MR job is determined by number of input splits (and input
format). If you use TextInputFormat with default settings the number of
input splits is equal to the no of hdfs blocks occupied by the input. Size
of an input split is equal to hdfs block size in default(64Mb). If you want
to have more splits for one hdfs block itself you need to set a value less
than 64 Mb for mapred.max.split.size.

You can find pretty much all default configuration values from the
downloaded .tar at
hadoop-0.20.*/src/mapred/mapred-default.xml
hadoop-0.20.*/src/hdfs/hdfs-default.xml
hadoop-0.20.*/src/core/core-default.xml

If you want to alter some of these values then you can provide the same in
$HADOOP_HOME/conf/mapred-site.xml
$HADOOP_HOME/conf/hdfs-site.xml
$HADOOP_HOME/conf/core-site.xml

These values provided in *-site.xml would be taken into account only if
they are not marked in *-default.xml. If not final, the values provided in
*-site.xml overrides the values in *-default.xml for corresponding
configuration parameter.

I require atleast  10 map taks which is same as number of "line feeds".
Each corresponds to complex calculation to be done by map task. So I can
have optimal cpu utilization - 8 cpus.

[Bejoy] Hadoop is a good choice processing large amounts of data. It is not
wise to choose one mapper for one record/line in a file, as creation of a
map task itself is expensive with jvm spanning and all. Currently you may
have 10 records in your input but I believe you are just testing Hadoop in
dev env and in production that wouldn't be the case there could be n files
having m records each and this m can be in millions.(Just assuming based on
my experience). On larger data sets you may not need to split on line
boundaries. There can be multiple lines in a file and if you use
TextInputFormat it is just one line processed by a map task at an instant.
If you have n map tasks then n lines could be getting processed at an
instant of map task execution time frame one by each map task. In larger
data volumes map tasks are spanned in specific nodes primarily based on
data locality, then on available tasks slots on data local node and so on.
It is possible that if you have a 10 node cluster, 10 hdfs blocks
corresponding to a input file and assume that all the blocks are present
only on 8 nodes and there are sufficient task slots available on all 8 ,
then tasks for your job may be executed in 8 nodes alone instead of 10. So
there are chances that there won't be 100% balanced CPU utilization across
nodes in a cluster.
               I'm not really sure how you can spawn map tasks based on
line feeds in a file .Let us wait for others  to comment on this.
           Also if your using map reduce for parallel computation alone the
make sure you set the number of reducers to zero, with that you can save a
lot of time that would be other wise spend on sort and shuffle phases.
(-D  mapred.reduce.tasks=0)

Behaviour of maptasks looks strange to be as some times if I give in
program jobconf.set(num map tasks) it takes 2 or 8.

[Bejoy]There is no default value for number of map tasks, it is determined
by input splits and  input format used by your job. You cannot set the
number of map tasks even if you set them at your job level, it is not
considered. (mapred.map.tasks) . But you can definitely specify the number
of reduce tasks at your job level  by job.setNumReduceTasks(n) or
mapred.reduce.tasks. If not set it would take the default value for reduce
tasks specified in conf files.


I see some files like part-00001...
Are they partitions?

[Bejoy] The part-000* files corresponds to reducers. You'd have n files if
you have n reducers as one reducer produces one output file.

Hope it helps!..

Regards
Bejoy.KS


On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services) <
Satish.Setty@hcl.com> wrote:

>  Hi Bijoy,
>
> Just finished installation and tested sample applications.
>
> Problem - As per documentation filesplits corresponds to number of map
> tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
> Where can I find default settings for variour parameters like block size,
> number of map/reduce tasks.
>
> Is it possible to control filesplit by "line feed - \n". I tried giving
> sample input -> jobconf -> TextInputFormat
>
> date1
> date2
> date3
> .......
> ......
> date10
>
> But when I run I see number of maptasks=2 or 1.
> I require atleast  10 map taks which is same as number of "line feeds".
> Each corresponds to complex calculation to be done by map task. So I can
> have optimal cpu utilization - 8 cpus.
>
> Behaviour of maptasks looks strange to be as some times if I give in
> program jobconf.set(num map tasks) it takes 2 or 8.  I see some files like
> part-00001...
> Are they partitions?
>
> Thanks
>  ------------------------------
> *From:* Satish Setty (HCL Financial Services)
> *Sent:* Friday, January 06, 2012 12:29 PM
> *To:* bejoy.hadoop@gmail.com
> *Subject:* FW: hadoop
>
>
>  Thanks Bejoy. Extremely useful information. We will try and come back.
> WebApp application [jobtracker web UI ] does this require deployment or
> application server container comes inbuilt with hadoop?
>
> Regards
>
>  ------------------------------
> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
> *Sent:* Friday, January 06, 2012 12:54 AM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: hadoop
>
>  Hi Satish
>         Please find some pointers in line
>
> (a) How do we know number of  map tasks spawned?  Can this be controlled?
> We notice only 4 jvms running on a single node - namenode, datanode,
> jobtracker, tasktracker. As we understand depending on number of splits
> that many map tasks are spawned - so we should see that many increase in
> jvms.
>
> [Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are
> the default process on hadoop it is not dependent on your tasks and your
> tasks are custom tasks are launched in separate jvms. You can control the
> maximum number of mappers on each tasktracker at an instance by setting
> mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or
> reduce) are executed on individual jvms and once the task is completed the
> jvms are destroyed. You are right, in default one map task is launched per
> input split.
> Just check the jobtracker web UI (
> http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all
> details on the job including the number of map tasks spanned by a job. If
> you want to run multiple task tracker and data node instances on the same
> machine you need to ensure that there are no port conflicts.
>
> (b) Our mapper class should perform complex computations - it has plenty
> of dependent jars so how do we add all jars in class path  while running
> application? Since we require to perform parallel computations - we need
> many map tasks running in parallel with different data. All are in same
> machine with different jvms.
>
> [Bejoy] If these dependent jars are used by almost all your applications
> include the same in class path of all your nodes.(in your case just one
> node). Alternatively you can use -libjars option while submitting your job.
> For more details refer
>
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>
> (c) How does data split happen?  JobClient does not talk about data
> splits? As we understand we create format for distributed file system,
> start-all.sh and then "hadoop fs -put". Do this write data to all
> datanodes? But we are unable to see physical location? How does split
> happen from this hdfs source?
>
> [Bejoy] Input files are split into blocks during copy into hdfs itself ,
> the size of each block is detmined from the hadoop configuration of your
> cluster. Name node decides on which all datanodes these blocks are to be
> placed including its replicas and this details are passed on to the client.
> The client copies the blocks to one data node and from this data node the
> block is replicated to other datanodes. The splitting of a file happens in
> HDFS API level.
>
> thanks
>
> ------------------------------
> ::DISCLAIMER::
>
> -----------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its
> affiliates. Any views or opinions presented in
> this email are solely those of the author and may not necessarily reflect
> the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure,
> modification, distribution and / or publication of
> this message without the prior written consent of the author of this
> e-mail is strictly prohibited. If you have
> received this email in error please delete it and notify the sender
> immediately. Before opening any mail and
> attachments please check them for viruses and defect.
>
>
> -----------------------------------------------------------------------------------------------------------------------
>

Re: hadoop

Posted by Bejoy Ks <be...@gmail.com>.

Hi Satish
Please find some pointers in line

(a) How do we know number of map tasks spawned? Can this be controlled?
We notice only 4 jvms running on a single node - namenode, datanode,
jobtracker, tasktracker. As we understand depending on number of splits
that many map tasks are spawned - so we should see that many increase in
jvms.

[Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are
the default process on hadoop it is not dependent on your tasks and your
tasks are custom tasks are launched in separate jvms. You can control the
maximum number of mappers on each tasktracker at an instance by setting
mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or
reduce) are executed on individual jvms and once the task is completed the
jvms are destroyed. You are right, in default one map task is launched per
input split.
Just check the jobtracker web UI (
http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all
details on the job including the number of map tasks spanned by a job. If
you want to run multiple task tracker and data node instances on the same
machine you need to ensure that there are no port conflicts.

(b) Our mapper class should perform complex computations - it has plenty of
dependent jars so how do we add all jars in class path while running
application? Since we require to perform parallel computations - we need
many map tasks running in parallel with different data. All are in same
machine with different jvms.

[Bejoy] If these dependent jars are used by almost all your applications
include the same in class path of all your nodes.(in your case just one
node). Alternatively you can use -libjars option while submitting your job.
For more details refer
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

(c) How does data split happen? JobClient does not talk about data splits?
As we understand we create format for distributed file system, start-all.sh
and then "hadoop fs -put". Do this write data to all datanodes? But we are
unable to see physical location? How does split happen from this hdfs
source?

[Bejoy] Input files are split into blocks during copy into hdfs itself ,
the size of each block is detmined from the hadoop configuration of your
cluster. Name node decides on which all datanodes these blocks are to be
placed including its replicas and this details are passed on to the client.
The client copies the blocks to one data node and from this data node the
block is replicated to other datanodes. The splitting of a file happens in
HDFS API level.

(d) Can we control number of reduce tasks? Is this seperate jvm? How are
optimal numbers for map and reduce tasks determined?

[Bejoy] You can control the total number of reduce tasks spawned on a task
tracker using mapred.tasktracker.map.tasks.maximum . You can control the
number of reduce tasks at job level using mapred.reduce.tasks . Unless you
enable jvm reuse all tasks are spawned within individual jvms. Optimal
number of reduce tasks for your job is determined on the data that flows to
your reducer and other parameters. Make sure that your tasks are not very
short lived(a few seconds) as task initialization itself is expensive. For
more details refer
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

(e) Any good documentation/links which speaks about namenode, datanode,
jobtracker and tasktracker.

[Bejoy] Refer ASF documents on hadoop, http://wiki.apache.org/hadoop/
Yahoo Developer Network tutorial
http://developer.yahoo.com/hadoop/tutorial/module1.html
and for a one stop reference get the book 'Hadoop - The Definitive Guide'
by Tom White

Hope it helps

Regards
Bejoy.K.S