You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Bejoy Ks <be...@gmail.com> on 2012/01/12 16:38:13 UTC
Re: hadoop - increase number of map tasks.

Sathish
         Can you post in the values for all the parameters you get under
Cluster Summary in your jobtracker webUI.

Regards
Bejoy.K.S

2012/1/12 Satish Setty (HCL Financial Services) <Sa...@hcl.com>

>  Hi Bejoy,
>
> I have put each line item in a seperate file [created 10 files]. Gave
> directory as input. Now we have 10 map processes created, but at a time
> only 2 are running
> Please see start time / finish time.  Any setting at cluster level -
> map_slot says  ("2") not sure where to change this? Lot of cpu is idle -
> out of 8 cpus atleast 2 are idle/under-utilized so in theory all 10 map
> tasks should be running concurrently.
>
> Thanks
>  Hadoop map task list for job_201201121044_0001<http://192.168.60.12:50030/jobdetails.jsp?jobid=job_201201121044_0001>on
> localhost <http://192.168.60.12:50030/jobtracker.jsp>
> ------------------------------
> All Tasks   Task Complete Status Start Time Finish Time Errors Counters
> task_201201121044_0001_m_000000<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000000> 100.00%
>     hdfs://localhost:9000/user/soruser/hellodir1/hello10.txt:0+8
>  12-Jan-2012 10:46:10
>  12-Jan-2012 10:48:55 (2mins, 45sec)
>
>
>  12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000000>
> task_201201121044_0001_m_000001<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000001> 100.00%
>     hdfs://localhost:9000/user/soruser/hellodir1/hello1.txt:0+7
>  12-Jan-2012 10:46:10
>  12-Jan-2012 10:48:52 (2mins, 42sec)
>
>
>  12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000001>
> task_201201121044_0001_m_000002<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000002> 100.00%
>     hdfs://localhost:9000/user/soruser/hellodir1/hello2.txt:0+7
>  12-Jan-2012 10:48:52
>  12-Jan-2012 10:51:34 (2mins, 42sec)
>
>
>  12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000002>
> task_201201121044_0001_m_000003<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000003> 100.00%
>     hdfs://localhost:9000/user/soruser/hellodir1/hello3.txt:0+7
>  12-Jan-2012 10:48:55
>  12-Jan-2012 10:51:25 (2mins, 30sec)
>
>
>  12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000003>
> task_201201121044_0001_m_000004<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000004> 0.00%
>     hdfs://localhost:9000/user/soruser/hellodir1/hello4.txt:0+7
>  12-Jan-2012 10:51:25
>
>
>  12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000004>
> task_201201121044_0001_m_000005<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000005> 0.00%
>     hdfs://localhost:9000/user/soruser/hellodir1/hello5.txt:0+7
>  12-Jan-2012 10:51:34
>
>
>  12<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000005>
> task_201201121044_0001_m_000006<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000006> 0.00%
>
>
>
>
>  0<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000006>
> task_201201121044_0001_m_000007<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000007> 0.00%
>
>
>
>
>  0<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000007>
> task_201201121044_0001_m_000008<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000008> 0.00%
>
>
>
>
>  0<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000008>
> task_201201121044_0001_m_000009<http://192.168.60.12:50030/taskdetails.jsp?tipid=task_201201121044_0001_m_000009> 0.00%
>
>
>
>
>  0<http://192.168.60.12:50030/taskstats.jsp?tipid=task_201201121044_0001_m_000009>
> ------------------------------
> Go back to JobTracker <http://192.168.60.12:50030/jobtracker.jsp>
> ------------------------------
> This is Apache Hadoop <http://hadoop.apache.org/> release 0.20.203.0
>
>  ------------------------------
> *From:* Satish Setty (HCL Financial Services)
> *Sent:* Tuesday, January 10, 2012 8:57 AM
> *To:* Bejoy Ks
> *Cc:* mapreduce-user@hadoop.apache.org
> *Subject:* RE: hadoop
>
>
>
> Hi Bejoy,
>
>
>
>
> Thanks for help. Changed values  mapred.min.split.size=0,mapred.max.split.size=40
> but but job counter does not reflect any other changes?
>
> For posting kindly let me know correct link/mail-id - at present directly
> sending to your account["Bejoy Ks ‎[bejoy.hadoop@gmail.com]‎" - has been
> great help to me.
>
>
>
> Posting to group account mapreduce-user@hadoop.apache.org
>   bounces back.
>
>
>
>  Counter Map Reduce Total  File Input Format Counters Bytes Read 61 0 61  Job
> Counters SLOTS_MILLIS_MAPS 0 0 3,886  Launched map tasks 0 0 2  Data-local
> map tasks 0 0 2  FileSystemCounters HDFS_BYTES_READ 267 0 267
> FILE_BYTES_WRITTEN 58,134 0 58,134  Map-Reduce Framework Map output
> materialized bytes 0 0 0  Combine output records 0 0 0  Map input records
> 9 0 9  Spilled Records 0 0 0  Map output bytes 70 0 70  Map input bytes 54
> 0 54  SPLIT_RAW_BYTES 206 0 206  Map output records 7 0 7  Combine input
> records 0 0 0
>  ------------------------------
> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
> *Sent:* Monday, January 09, 2012 11:13 PM
> *To:* Satish Setty (HCL Financial Services)
> *Cc:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: hadoop
>
>  Hi Satish
>       It would be good if you don't cross post your queries. Just post it
> once on the right list.
>
>       What is your value for mapred.max.split.size? Try setting these
> values as well
> mapred.min.split.size=0 (it is the default value)
> mapred.max.split.size=40
>
> Try executing your job once you apply these changes on top of others you
> did.
>
> Regards
> Bejoy.K.S
>
> On Mon, Jan 9, 2012 at 5:09 PM, Satish Setty (HCL Financial Services) <
> Satish.Setty@hcl.com> wrote:
>
>>  Hi Bejoy,
>>
>> Even with below settings map tasks never go beyound 2, any way to make
>> this spawn 10 tasks. Basically it should look like compute grid -
>> computation in parallel.
>>
>> <property>
>>   <name>io.bytes.per.checksum</name>
>>   <value>30</value>
>>   <description>The number of bytes per checksum.  Must not be larger than
>>   io.file.buffer.size.</description>
>> </property>
>>
>> <property>
>>   <name>dfs.block.size</name>
>>    <value>30</value>
>>   <description>The default block size for new files.</description>
>> </property>
>>  <property>
>>   <name>mapred.tasktracker.map.tasks.maximum</name>
>>   <value>10</value>
>>   <description>The maximum number of map tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>>
>>  ------------------------------
>>  *From:* Satish Setty (HCL Financial Services)
>>  *Sent:* Monday, January 09, 2012 1:21 PM
>>
>> *To:* Bejoy Ks
>> *Cc:* mapreduce-user@hadoop.apache.org
>> *Subject:* RE: hadoop
>>
>>    Hi Bejoy,
>>
>> In hdfs I have set block size - 40bytes . Input Data set is as below
>> data1   (5*8=40 bytes)
>> data2
>> ......
>> data10
>>
>>
>> But still I see only 2 map tasks spawned, should have been atleast 10 map
>> tasks. Not sure how works internally. Line feed does not work [as you have
>> explained below]
>>
>> Thanks
>>  ------------------------------
>> *From:* Satish Setty (HCL Financial Services)
>> *Sent:* Saturday, January 07, 2012 9:17 PM
>> *To:* Bejoy Ks
>> *Cc:* mapreduce-user@hadoop.apache.org
>> *Subject:* RE: hadoop
>>
>>   Thanks Bejoy - great information - will try out.
>>
>> I meant for below problem single node with high configuration -> 8 cpus
>> and 8gb memory. Hence taking an example of 10 data items with line feeds.
>> We want to utilize full power of machine - hence want at least 10 map tasks
>> - each task needs to perform highly complex mathematical simulation.  At
>> present it looks like file data is the only way to specify number of map
>> tasks via splitsize (in bytes) - but I prefer some criteria like line feed
>> or whatever.
>>
>> In below example - 'data1' corresponds to 5*8=40bytes, if I have data1
>> .... data10 in theory I need to see 10 map tasks with split size of 40bytes.
>>
>> How do I perform logging - where is the log (apache logger) data written?
>> system outs may not come as it is background process.
>>
>> Regards
>>
>>
>>  ------------------------------
>> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
>> *Sent:* Saturday, January 07, 2012 7:35 PM
>> *To:* Satish Setty (HCL Financial Services)
>> *Cc:* mapreduce-user@hadoop.apache.org
>> *Subject:* Re: hadoop
>>
>>  Hi Satish
>>       Please find some pointers inline
>>
>> Problem - As per documentation filesplits corresponds to number of map
>> tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
>> Where can I find default settings for variour parameters like block size,
>> number of map/reduce tasks.
>>
>> [Bejoy] I'd rather state it other way round, the number of map tasks
>> triggered by a MR job is determined by number of input splits (and input
>> format). If you use TextInputFormat with default settings the number of
>> input splits is equal to the no of hdfs blocks occupied by the input. Size
>> of an input split is equal to hdfs block size in default(64Mb). If you want
>> to have more splits for one hdfs block itself you need to set a value less
>> than 64 Mb for mapred.max.split.size.
>>
>> You can find pretty much all default configuration values from the
>> downloaded .tar at
>> hadoop-0.20.*/src/mapred/mapred-default.xml
>> hadoop-0.20.*/src/hdfs/hdfs-default.xml
>> hadoop-0.20.*/src/core/core-default.xml
>>
>> If you want to alter some of these values then you can provide the same
>> in
>> $HADOOP_HOME/conf/mapred-site.xml
>> $HADOOP_HOME/conf/hdfs-site.xml
>> $HADOOP_HOME/conf/core-site.xml
>>
>> These values provided in *-site.xml would be taken into account only if
>> they are not marked in *-default.xml. If not final, the values provided in
>> *-site.xml overrides the values in *-default.xml for corresponding
>> configuration parameter.
>>
>> I require atleast  10 map taks which is same as number of "line feeds".
>> Each corresponds to complex calculation to be done by map task. So I can
>> have optimal cpu utilization - 8 cpus.
>>
>> [Bejoy] Hadoop is a good choice processing large amounts of data. It is
>> not wise to choose one mapper for one record/line in a file, as creation of
>> a map task itself is expensive with jvm spanning and all. Currently you may
>> have 10 records in your input but I believe you are just testing Hadoop in
>> dev env and in production that wouldn't be the case there could be n files
>> having m records each and this m can be in millions.(Just assuming based on
>> my experience). On larger data sets you may not need to split on line
>> boundaries. There can be multiple lines in a file and if you use
>> TextInputFormat it is just one line processed by a map task at an instant.
>> If you have n map tasks then n lines could be getting processed at an
>> instant of map task execution time frame one by each map task. In larger
>> data volumes map tasks are spanned in specific nodes primarily based on
>> data locality, then on available tasks slots on data local node and so on.
>> It is possible that if you have a 10 node cluster, 10 hdfs blocks
>> corresponding to a input file and assume that all the blocks are present
>> only on 8 nodes and there are sufficient task slots available on all 8 ,
>> then tasks for your job may be executed in 8 nodes alone instead of 10. So
>> there are chances that there won't be 100% balanced CPU utilization across
>> nodes in a cluster.
>>                I'm not really sure how you can spawn map tasks based on
>> line feeds in a file .Let us wait for others  to comment on this.
>>            Also if your using map reduce for parallel computation alone
>> the make sure you set the number of reducers to zero, with that you can
>> save a lot of time that would be other wise spend on sort and shuffle
>> phases.
>> (-D  mapred.reduce.tasks=0)
>>
>>  Behaviour of maptasks looks strange to be as some times if I give in
>> program jobconf.set(num map tasks) it takes 2 or 8.
>>
>> [Bejoy]There is no default value for number of map tasks, it is
>> determined by input splits and  input format used by your job. You cannot
>> set the number of map tasks even if you set them at your job level, it is
>> not considered. (mapred.map.tasks) . But you can definitely specify the
>> number of reduce tasks at your job level  by job.setNumReduceTasks(n) or
>> mapred.reduce.tasks. If not set it would take the default value for reduce
>> tasks specified in conf files.
>>
>>
>> I see some files like part-00001...
>> Are they partitions?
>>
>> [Bejoy] The part-000* files corresponds to reducers. You'd have n files
>> if you have n reducers as one reducer produces one output file.
>>
>> Hope it helps!..
>>
>> Regards
>> Bejoy.KS
>>
>>
>> On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services) <
>> Satish.Setty@hcl.com> wrote:
>>
>>>  Hi Bijoy,
>>>
>>> Just finished installation and tested sample applications.
>>>
>>> Problem - As per documentation filesplits corresponds to number of map
>>> tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
>>> Where can I find default settings for variour parameters like block size,
>>> number of map/reduce tasks.
>>>
>>> Is it possible to control filesplit by "line feed - \n". I tried giving
>>> sample input -> jobconf -> TextInputFormat
>>>
>>> date1
>>> date2
>>> date3
>>> .......
>>> ......
>>> date10
>>>
>>> But when I run I see number of maptasks=2 or 1.
>>> I require atleast  10 map taks which is same as number of "line feeds".
>>> Each corresponds to complex calculation to be done by map task. So I can
>>> have optimal cpu utilization - 8 cpus.
>>>
>>> Behaviour of maptasks looks strange to be as some times if I give in
>>> program jobconf.set(num map tasks) it takes 2 or 8.  I see some files like
>>> part-00001...
>>> Are they partitions?
>>>
>>> Thanks
>>>  ------------------------------
>>> *From:* Satish Setty (HCL Financial Services)
>>> *Sent:* Friday, January 06, 2012 12:29 PM
>>> *To:* bejoy.hadoop@gmail.com
>>> *Subject:* FW: hadoop
>>>
>>>
>>>    Thanks Bejoy. Extremely useful information. We will try and come
>>> back. WebApp application [jobtracker web UI ] does this require
>>> deployment or application server container comes inbuilt with hadoop?
>>>
>>> Regards
>>>
>>>  ------------------------------
>>> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
>>> *Sent:* Friday, January 06, 2012 12:54 AM
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Subject:* Re: hadoop
>>>
>>>     Hi Satish
>>>         Please find some pointers in line
>>>
>>> (a) How do we know number of  map tasks spawned?  Can this be
>>> controlled? We notice only 4 jvms running on a single node - namenode,
>>> datanode, jobtracker, tasktracker. As we understand depending on number of
>>> splits that many map tasks are spawned - so we should see that many
>>> increase in jvms.
>>>
>>> [Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode
>>> are the default process on hadoop it is not dependent on your tasks and
>>> your tasks are custom tasks are launched in separate jvms. You can control
>>> the maximum number of mappers on each tasktracker at an instance by setting
>>> mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or
>>> reduce) are executed on individual jvms and once the task is completed the
>>> jvms are destroyed. You are right, in default one map task is launched per
>>> input split.
>>> Just check the jobtracker web UI (
>>> http://nameNodeHostName:50030/jobtracker.jsp), it would give you you
>>> all details on the job including the number of map tasks spanned by a job.
>>> If you want to run multiple task tracker and data node instances on the
>>> same machine you need to ensure that there are no port conflicts.
>>>
>>> (b) Our mapper class should perform complex computations - it has plenty
>>> of dependent jars so how do we add all jars in class path  while running
>>> application? Since we require to perform parallel computations - we need
>>> many map tasks running in parallel with different data. All are in same
>>> machine with different jvms.
>>>
>>> [Bejoy] If these dependent jars are used by almost all your applications
>>> include the same in class path of all your nodes.(in your case just one
>>> node). Alternatively you can use -libjars option while submitting your job.
>>> For more details refer
>>>
>>> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>>>
>>> (c) How does data split happen?  JobClient does not talk about data
>>> splits? As we understand we create format for distributed file system,
>>> start-all.sh and then "hadoop fs -put". Do this write data to all
>>> datanodes? But we are unable to see physical location? How does split
>>> happen from this hdfs source?
>>>
>>> [Bejoy] Input files are split into blocks during copy into hdfs itself ,
>>> the size of each block is detmined from the hadoop configuration of your
>>> cluster. Name node decides on which all datanodes these blocks are to be
>>> placed including its replicas and this details are passed on to the client.
>>> The client copies the blocks to one data node and from this data node the
>>> block is replicated to other datanodes. The splitting of a file happens in
>>> HDFS API level.
>>>
>>>  thanks
>>>
>>> ------------------------------
>>> ::DISCLAIMER::
>>>
>>> -----------------------------------------------------------------------------------------------------------------------
>>>
>>> The contents of this e-mail and any attachment(s) are confidential and
>>> intended for the named recipient(s) only.
>>> It shall not attach any liability on the originator or HCL or its
>>> affiliates. Any views or opinions presented in
>>> this email are solely those of the author and may not necessarily
>>> reflect the opinions of HCL or its affiliates.
>>> Any form of reproduction, dissemination, copying, disclosure,
>>> modification, distribution and / or publication of
>>> this message without the prior written consent of the author of this
>>> e-mail is strictly prohibited. If you have
>>> received this email in error please delete it and notify the sender
>>> immediately. Before opening any mail and
>>> attachments please check them for viruses and defect.
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------------
>>>
>>
>>
>