You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jun Young Kim <ju...@gmail.com> on 2011/02/22 09:57:21 UTC
How I can assume the proper a block size if the input file size is
dynamic?
hi, all.
I know dfs.blocksize key can affect the performance of a hadoop.
in my case, I have thousands of directories which are including so many
different sized input files.
(file sizes are from 10K to 1G).
in this case, How I can assume the dfs.blocksize to get a best performance?
11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to
process : *15407*
11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following
namenodes' delegation tokens:null
11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
11/02/22 17:45:55 INFO mapreduce.Job: map 0% reduce 0%
thanks.
--
Junyoung Kim (juneng603@gmail.com)
Re: How I can assume the proper a block size if the input file size
is dynamic?
Posted by Jun Young Kim <ju...@gmail.com>.
currenly, I got a problem to reduce the output of mappers.
11/02/23 09:57:45 INFO input.FileInputFormat: Total input paths to
process : 4157
11/02/23 09:57:47 WARN conf.Configuration: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
11/02/23 09:57:47 INFO mapreduce.JobSubmitter: number of splits:4309
input file sizes are so dynamic now.
based on this files, a hadoop creates so many splits to map them.
here is my result of M/R.
Kind Total Tasks(successful+failed+killed) Successful tasks Failed
tasks Killed tasks Start Time Finish Time
Setup 1
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_SETUP&status=all>
1
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_SETUP&status=SUCCEEDED>
0
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_SETUP&status=FAILED>
0
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_SETUP&status=KILLED>
22-2-2011 22:10:07 22-2-2011 22:10:08 (1sec)
Map 4309
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=MAP&status=all>
4309
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=MAP&status=SUCCEEDED>
0
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=MAP&status=FAILED>
0
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=MAP&status=KILLED>
22-2-2011 22:10:11 22-2-2011 22:18:51 (8mins, 40sec)
Reduce 5
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=REDUCE&status=all>
0
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=REDUCE&status=SUCCEEDED>
4
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=REDUCE&status=FAILED>
1
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=REDUCE&status=KILLED>
22-2-2011 22:11:00 22-2-2011 22:36:51 (25mins, 50sec)
Cleanup 1
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_CLEANUP&status=all>
1
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_CLEANUP&status=SUCCEEDED>
0
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_CLEANUP&status=FAILED>
0
<http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_201102222050_0003_irteam&taskType=JOB_CLEANUP&status=KILLED%3E>
22-2-2011 22:36:47 22-2-2011 22:37:51 (1mins, 4sec)
in the step of Reduce. there are failed/killed tasks.
the reason of them are this.
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in
shuffle in fetcher#3 at
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by:
java.lang.OutOfMemoryError: Java heap space at
org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:58)
at
org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:45)
at
org.apache.hadoop.mapreduce.task.reduce.MapOutput.<init>(MapOutput.java:104)
at
org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267)
at org.apache.hadoop.mapreduce.task.re
yes. it's from shuttle procedures.
I think the problem of my job is too many map tasks and only one reduce
task.
to fix this problem, should I reduce map tasks?
to do that, maybe do I need to concatenate all my input files to a
single file?
thanks.
Junyoung Kim (juneng603@gmail.com)
On 02/22/2011 10:20 PM, Tish Heyssel wrote:
> Yeah,
>
> That's not gonna work. You need to pre-process your input files to
> concatenate them into larger files and then set your dfs.blocksize
> accordingly. Otherwise your jobs will be slow, slow slow.
>
> tish
>
> On Tue, Feb 22, 2011 at 3:57 AM, Jun Young Kim<ju...@gmail.com> wrote:
>
>> hi, all.
>>
>> I know dfs.blocksize key can affect the performance of a hadoop.
>>
>> in my case, I have thousands of directories which are including so many
>> different sized input files.
>> (file sizes are from 10K to 1G).
>>
>> in this case, How I can assume the dfs.blocksize to get a best performance?
>>
>> 11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to process
>> : *15407*
>> 11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is deprecated.
>> Instead, use mapreduce.job.maps
>> 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
>> 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following
>> namenodes' delegation tokens:null
>> 11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
>> 11/02/22 17:45:55 INFO mapreduce.Job: map 0% reduce 0%
>>
>> thanks.
>>
>> --
>> Junyoung Kim (juneng603@gmail.com)
>>
>>
>
Re: How I can assume the proper a block size if the input file size
is dynamic?
Posted by Tish Heyssel <ti...@gmail.com>.
Yeah,
That's not gonna work. You need to pre-process your input files to
concatenate them into larger files and then set your dfs.blocksize
accordingly. Otherwise your jobs will be slow, slow slow.
tish
On Tue, Feb 22, 2011 at 3:57 AM, Jun Young Kim <ju...@gmail.com> wrote:
> hi, all.
>
> I know dfs.blocksize key can affect the performance of a hadoop.
>
> in my case, I have thousands of directories which are including so many
> different sized input files.
> (file sizes are from 10K to 1G).
>
> in this case, How I can assume the dfs.blocksize to get a best performance?
>
> 11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to process
> : *15407*
> 11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is deprecated.
> Instead, use mapreduce.job.maps
> 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411*
> 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following
> namenodes' delegation tokens:null
> 11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002
> 11/02/22 17:45:55 INFO mapreduce.Job: map 0% reduce 0%
>
> thanks.
>
> --
> Junyoung Kim (juneng603@gmail.com)
>
>
--
Tish Heyssel
Peterson Burnett Technologies
Please use: tishhey@gmail.com
Alternate email: tish@speakeasy.net
pmh@pbtechnologies.com