You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jun Young Kim <ju...@gmail.com> on 2011/02/22 09:57:21 UTC

How I can assume the proper a block size if the input file size is dynamic?

hi, all.

I know dfs.blocksize key can affect the performance of a hadoop.

in my case, I have thousands of directories which are including so many 
different sized input files.
(file sizes are from 10K to 1G).

in this case, How I can assume the dfs.blocksize to get a best performance?

11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to 
process : *15407*
11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following 
namenodes' delegation tokens:null
11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
11/02/22 17:45:55 INFO mapreduce.Job:  map 0% reduce 0%

thanks.

-- 
Junyoung Kim (juneng603@gmail.com)


Re: How I can assume the proper a block size if the input file size is dynamic?

Posted by Jun Young Kim <ju...@gmail.com>.
currenly, I got a problem to reduce the output of mappers.

11/02/23 09:57:45 INFO input.FileInputFormat: Total input paths to 
process : 4157
11/02/23 09:57:47 WARN conf.Configuration: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
11/02/23 09:57:47 INFO mapreduce.JobSubmitter: number of splits:4309

input file sizes are so dynamic now.
based on this files, a hadoop creates so many splits to map them.

here is my result of M/R.

Kind 	Total Tasks(successful+failed+killed) 	Successful tasks 	Failed 
tasks 	Killed tasks 	Start Time 	Finish Time
Setup 	1 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_SETUP&status=all> 
	1 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_SETUP&status=SUCCEEDED> 
	0 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_SETUP&status=FAILED> 
	0 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_SETUP&status=KILLED> 
	22-2-2011 22:10:07 	22-2-2011 22:10:08 (1sec)
Map 	4309 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=MAP&status=all> 
	4309 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=MAP&status=SUCCEEDED> 
	0 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=MAP&status=FAILED> 
	0 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=MAP&status=KILLED> 
	22-2-2011 22:10:11 	22-2-2011 22:18:51 (8mins, 40sec)
Reduce 	5 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=REDUCE&status=all> 
	0 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=REDUCE&status=SUCCEEDED> 
	4 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=REDUCE&status=FAILED> 
	1 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=REDUCE&status=KILLED> 
	22-2-2011 22:11:00 	22-2-2011 22:36:51 (25mins, 50sec)
Cleanup 	1 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_CLEANUP&status=all> 
	1 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_CLEANUP&status=SUCCEEDED> 
	0 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_CLEANUP&status=FAILED> 
	0 
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_CLEANUP&status=KILLED%3E> 
	22-2-2011 22:36:47 	22-2-2011 22:37:51 (1mins, 4sec)



in the step of Reduce. there are failed/killed tasks.
the reason of them are this.

org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
shuffle in fetcher#3 at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) 
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
java.lang.OutOfMemoryError: Java heap space at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:58) 
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:45) 
at 
org.apache.hadoop.mapreduce.task.reduce.MapOutput.<init>(MapOutput.java:104) 
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267) 
at org.apache.hadoop.mapreduce.task.re

yes. it's from shuttle procedures.

I think the problem of my job is too many map tasks and only one reduce 
task.

to fix this problem, should I reduce map tasks?
to do that, maybe do I need to concatenate all my input files to a 
single file?

thanks.


Junyoung Kim (juneng603@gmail.com)


On 02/22/2011 10:20 PM, Tish Heyssel wrote:
> Yeah,
>
> That's not gonna work.  You need to pre-process your input files to
> concatenate them into larger files and then set your dfs.blocksize
> accordingly.  Otherwise your jobs will be slow, slow slow.
>
> tish
>
> On Tue, Feb 22, 2011 at 3:57 AM, Jun Young Kim<ju...@gmail.com>  wrote:
>
>> hi, all.
>>
>> I know dfs.blocksize key can affect the performance of a hadoop.
>>
>> in my case, I have thousands of directories which are including so many
>> different sized input files.
>> (file sizes are from 10K to 1G).
>>
>> in this case, How I can assume the dfs.blocksize to get a best performance?
>>
>> 11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to process
>> : *15407*
>> 11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is deprecated.
>> Instead, use mapreduce.job.maps
>> 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
>> 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following
>> namenodes' delegation tokens:null
>> 11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
>> 11/02/22 17:45:55 INFO mapreduce.Job:  map 0% reduce 0%
>>
>> thanks.
>>
>> --
>> Junyoung Kim (juneng603@gmail.com)
>>
>>
>

Re: How I can assume the proper a block size if the input file size is dynamic?

Posted by Tish Heyssel <ti...@gmail.com>.
Yeah,

That's not gonna work.  You need to pre-process your input files to
concatenate them into larger files and then set your dfs.blocksize
accordingly.  Otherwise your jobs will be slow, slow slow.

tish

On Tue, Feb 22, 2011 at 3:57 AM, Jun Young Kim <ju...@gmail.com> wrote:

> hi, all.
>
> I know dfs.blocksize key can affect the performance of a hadoop.
>
> in my case, I have thousands of directories which are including so many
> different sized input files.
> (file sizes are from 10K to 1G).
>
> in this case, How I can assume the dfs.blocksize to get a best performance?
>
> 11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to process
> : *15407*
> 11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is deprecated.
> Instead, use mapreduce.job.maps
> 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
> 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following
> namenodes' delegation tokens:null
> 11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
> 11/02/22 17:45:55 INFO mapreduce.Job:  map 0% reduce 0%
>
> thanks.
>
> --
> Junyoung Kim (juneng603@gmail.com)
>
>


-- 
Tish Heyssel
Peterson Burnett Technologies
Please use: tishhey@gmail.com
Alternate email: tish@speakeasy.net
pmh@pbtechnologies.com