You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Something Something <ma...@gmail.com> on 2010/03/28 21:10:23 UTC
Improving performance of Hadoop job
When I run a Hadoop job in a command shell it runs in 21 secs, but when I
run the same from within Eclipse IDE it runs in 1 second! Trying to figure
out what the reason is. The output is shown below:
1) It appears the FILE_BYTES_READ, HDFS_BYTES_WRITTEN etc values are
totally different. Sounds like that's what is causing the performance
degradation.
2) I tried setting -Xmx1000m in the command shell but that doesn't help.
Are there other JVM arguments that affect values of these variables
(FILE_BYTES_READ etc)?
Any hints would be GREATLY appreciated. Thanks.
*Command Shell log:*
INFO: Starting top scores writing job....
10/03/28 11:19:49 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
10/03/28 11:19:50 INFO mapred.JobClient: Running job: job_201003281008_0006
10/03/28 11:19:51 INFO mapred.JobClient: map 0% reduce 0%
10/03/28 11:19:59 INFO mapred.JobClient: map 100% reduce 0%
10/03/28 11:20:04 INFO mapred.JobClient: map 100% reduce 100%
10/03/28 11:20:10 INFO mapred.JobClient: Job complete: job_201003281008_0006
10/03/28 11:20:10 INFO mapred.JobClient: Counters: 16
10/03/28 11:20:10 INFO mapred.JobClient: Job Counters
10/03/28 11:20:10 INFO mapred.JobClient: Launched reduce tasks=1
10/03/28 11:20:10 INFO mapred.JobClient: Launched map tasks=1
10/03/28 11:20:10 INFO mapred.JobClient: Data-local map tasks=1
10/03/28 11:20:10 INFO mapred.JobClient: FileSystemCounters
*10/03/28 11:20:10 INFO mapred.JobClient: FILE_BYTES_READ=1736
10/03/28 11:20:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3504
10/03/28 11:20:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3706
*10/03/28 11:20:10 INFO mapred.JobClient: Map-Reduce Framework
10/03/28 11:20:10 INFO mapred.JobClient: Reduce input groups=87
10/03/28 11:20:10 INFO mapred.JobClient: Combine output records=0
10/03/28 11:20:10 INFO mapred.JobClient: Map input records=100
10/03/28 11:20:10 INFO mapred.JobClient: Reduce shuffle bytes=0
10/03/28 11:20:10 INFO mapred.JobClient: Reduce output records=100
10/03/28 11:20:10 INFO mapred.JobClient: Spilled Records=200
10/03/28 11:20:10 INFO mapred.JobClient: Map output bytes=1530
10/03/28 11:20:10 INFO mapred.JobClient: Combine input records=0
10/03/28 11:20:10 INFO mapred.JobClient: Map output records=100
10/03/28 11:20:10 INFO mapred.JobClient: Reduce input records=100
*Total time for writing scores: 21 secs.*
*Eclipse console log:*
INFO: Starting top scores writing job....
10/03/28 11:29:15 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
10/03/28 11:29:15 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
10/03/28 11:29:15 WARN mapred.JobClient: No job jar file set. User classes
may not be found. See JobConf(Class) or JobConf#setJar(String).
10/03/28 11:29:15 INFO mapred.JobClient: Running job: job_local_0006
10/03/28 11:29:15 INFO mapred.MapTask: io.sort.mb = 100
10/03/28 11:29:16 INFO mapred.MapTask: data buffer = 79691776/99614720
10/03/28 11:29:16 INFO mapred.MapTask: record buffer = 262144/327680
10/03/28 11:29:16 INFO mapred.MapTask: Starting flush of map output
10/03/28 11:29:16 INFO mapred.MapTask: Finished spill 0
10/03/28 11:29:16 INFO mapred.TaskRunner: Task:attempt_local_0006_m_000000_0
is done. And is in the process of commiting
10/03/28 11:29:16 INFO mapred.LocalJobRunner:
10/03/28 11:29:16 INFO mapred.TaskRunner: Task
'attempt_local_0006_m_000000_0' done.
10/03/28 11:29:16 INFO mapred.LocalJobRunner:
10/03/28 11:29:16 INFO mapred.Merger: Merging 1 sorted segments
10/03/28 11:29:16 INFO mapred.Merger: Down to the last merge-pass, with 1
segments left of total size: 1732 bytes
10/03/28 11:29:16 INFO mapred.LocalJobRunner:
10/03/28 11:29:16 INFO mapred.TaskRunner: Task:attempt_local_0006_r_000000_0
is done. And is in the process of commiting
10/03/28 11:29:16 INFO mapred.LocalJobRunner:
10/03/28 11:29:16 INFO mapred.TaskRunner: Task attempt_local_0006_r_000000_0
is allowed to commit now
10/03/28 11:29:16 INFO output.FileOutputCommitter: Saved output of task
'attempt_local_0006_r_000000_0' to markers/results
10/03/28 11:29:16 INFO mapred.LocalJobRunner: reduce > reduce
10/03/28 11:29:16 INFO mapred.TaskRunner: Task
'attempt_local_0006_r_000000_0' done.
10/03/28 11:29:16 INFO mapred.JobClient: map 100% reduce 100%
10/03/28 11:29:16 INFO mapred.JobClient: Job complete: job_local_0006
10/03/28 11:29:16 INFO mapred.JobClient: Counters: 14
10/03/28 11:29:16 INFO mapred.JobClient: FileSystemCounters
*10/03/28 11:29:16 INFO mapred.JobClient: FILE_BYTES_READ=95848
10/03/28 11:29:16 INFO mapred.JobClient: HDFS_BYTES_READ=277286
10/03/28 11:29:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=304886
10/03/28 11:29:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=209112
*10/03/28 11:29:16 INFO mapred.JobClient: Map-Reduce Framework
10/03/28 11:29:16 INFO mapred.JobClient: Reduce input groups=87
10/03/28 11:29:16 INFO mapred.JobClient: Combine output records=0
10/03/28 11:29:16 INFO mapred.JobClient: Map input records=100
10/03/28 11:29:16 INFO mapred.JobClient: Reduce shuffle bytes=0
10/03/28 11:29:16 INFO mapred.JobClient: Reduce output records=100
10/03/28 11:29:16 INFO mapred.JobClient: Spilled Records=200
10/03/28 11:29:16 INFO mapred.JobClient: Map output bytes=1530
10/03/28 11:29:16 INFO mapred.JobClient: Combine input records=0
10/03/28 11:29:16 INFO mapred.JobClient: Map output records=100
10/03/28 11:29:16 INFO mapred.JobClient: Reduce input records=100
*Total time for writing scores: 1 secs.
*
Re: Improving performance of Hadoop job
Posted by Vukasin Toroman <vu...@toroman.name>.
Hi,
just a guess: maybe it has something to do with it being a "LocalJobRunner"? Does that even run on the cluster (not using eclipse myself so am not sure).
Greetz,
Vukasin
On Mar 28, 2010, at 21:10 , Something Something wrote:
> When I run a Hadoop job in a command shell it runs in 21 secs, but when I
> run the same from within Eclipse IDE it runs in 1 second! Trying to figure
> out what the reason is. The output is shown below:
>
> 1) It appears the FILE_BYTES_READ, HDFS_BYTES_WRITTEN etc values are
> totally different. Sounds like that's what is causing the performance
> degradation.
> 2) I tried setting -Xmx1000m in the command shell but that doesn't help.
> Are there other JVM arguments that affect values of these variables
> (FILE_BYTES_READ etc)?
>
> Any hints would be GREATLY appreciated. Thanks.
>
>
> *Command Shell log:*
>
> INFO: Starting top scores writing job....
> 10/03/28 11:19:49 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/03/28 11:19:50 INFO mapred.JobClient: Running job: job_201003281008_0006
> 10/03/28 11:19:51 INFO mapred.JobClient: map 0% reduce 0%
> 10/03/28 11:19:59 INFO mapred.JobClient: map 100% reduce 0%
> 10/03/28 11:20:04 INFO mapred.JobClient: map 100% reduce 100%
> 10/03/28 11:20:10 INFO mapred.JobClient: Job complete: job_201003281008_0006
> 10/03/28 11:20:10 INFO mapred.JobClient: Counters: 16
> 10/03/28 11:20:10 INFO mapred.JobClient: Job Counters
> 10/03/28 11:20:10 INFO mapred.JobClient: Launched reduce tasks=1
> 10/03/28 11:20:10 INFO mapred.JobClient: Launched map tasks=1
> 10/03/28 11:20:10 INFO mapred.JobClient: Data-local map tasks=1
> 10/03/28 11:20:10 INFO mapred.JobClient: FileSystemCounters
> *10/03/28 11:20:10 INFO mapred.JobClient: FILE_BYTES_READ=1736
> 10/03/28 11:20:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3504
> 10/03/28 11:20:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3706
> *10/03/28 11:20:10 INFO mapred.JobClient: Map-Reduce Framework
> 10/03/28 11:20:10 INFO mapred.JobClient: Reduce input groups=87
> 10/03/28 11:20:10 INFO mapred.JobClient: Combine output records=0
> 10/03/28 11:20:10 INFO mapred.JobClient: Map input records=100
> 10/03/28 11:20:10 INFO mapred.JobClient: Reduce shuffle bytes=0
> 10/03/28 11:20:10 INFO mapred.JobClient: Reduce output records=100
> 10/03/28 11:20:10 INFO mapred.JobClient: Spilled Records=200
> 10/03/28 11:20:10 INFO mapred.JobClient: Map output bytes=1530
> 10/03/28 11:20:10 INFO mapred.JobClient: Combine input records=0
> 10/03/28 11:20:10 INFO mapred.JobClient: Map output records=100
> 10/03/28 11:20:10 INFO mapred.JobClient: Reduce input records=100
> *Total time for writing scores: 21 secs.*
>
>
>
> *Eclipse console log:*
>
> INFO: Starting top scores writing job....
> 10/03/28 11:29:15 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
> processName=JobTracker, sessionId= - already initialized
> 10/03/28 11:29:15 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/03/28 11:29:15 WARN mapred.JobClient: No job jar file set. User classes
> may not be found. See JobConf(Class) or JobConf#setJar(String).
> 10/03/28 11:29:15 INFO mapred.JobClient: Running job: job_local_0006
> 10/03/28 11:29:15 INFO mapred.MapTask: io.sort.mb = 100
> 10/03/28 11:29:16 INFO mapred.MapTask: data buffer = 79691776/99614720
> 10/03/28 11:29:16 INFO mapred.MapTask: record buffer = 262144/327680
> 10/03/28 11:29:16 INFO mapred.MapTask: Starting flush of map output
> 10/03/28 11:29:16 INFO mapred.MapTask: Finished spill 0
> 10/03/28 11:29:16 INFO mapred.TaskRunner: Task:attempt_local_0006_m_000000_0
> is done. And is in the process of commiting
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner:
> 10/03/28 11:29:16 INFO mapred.TaskRunner: Task
> 'attempt_local_0006_m_000000_0' done.
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner:
> 10/03/28 11:29:16 INFO mapred.Merger: Merging 1 sorted segments
> 10/03/28 11:29:16 INFO mapred.Merger: Down to the last merge-pass, with 1
> segments left of total size: 1732 bytes
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner:
> 10/03/28 11:29:16 INFO mapred.TaskRunner: Task:attempt_local_0006_r_000000_0
> is done. And is in the process of commiting
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner:
> 10/03/28 11:29:16 INFO mapred.TaskRunner: Task attempt_local_0006_r_000000_0
> is allowed to commit now
> 10/03/28 11:29:16 INFO output.FileOutputCommitter: Saved output of task
> 'attempt_local_0006_r_000000_0' to markers/results
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner: reduce > reduce
> 10/03/28 11:29:16 INFO mapred.TaskRunner: Task
> 'attempt_local_0006_r_000000_0' done.
> 10/03/28 11:29:16 INFO mapred.JobClient: map 100% reduce 100%
> 10/03/28 11:29:16 INFO mapred.JobClient: Job complete: job_local_0006
> 10/03/28 11:29:16 INFO mapred.JobClient: Counters: 14
> 10/03/28 11:29:16 INFO mapred.JobClient: FileSystemCounters
> *10/03/28 11:29:16 INFO mapred.JobClient: FILE_BYTES_READ=95848
> 10/03/28 11:29:16 INFO mapred.JobClient: HDFS_BYTES_READ=277286
> 10/03/28 11:29:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=304886
> 10/03/28 11:29:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=209112
> *10/03/28 11:29:16 INFO mapred.JobClient: Map-Reduce Framework
> 10/03/28 11:29:16 INFO mapred.JobClient: Reduce input groups=87
> 10/03/28 11:29:16 INFO mapred.JobClient: Combine output records=0
> 10/03/28 11:29:16 INFO mapred.JobClient: Map input records=100
> 10/03/28 11:29:16 INFO mapred.JobClient: Reduce shuffle bytes=0
> 10/03/28 11:29:16 INFO mapred.JobClient: Reduce output records=100
> 10/03/28 11:29:16 INFO mapred.JobClient: Spilled Records=200
> 10/03/28 11:29:16 INFO mapred.JobClient: Map output bytes=1530
> 10/03/28 11:29:16 INFO mapred.JobClient: Combine input records=0
> 10/03/28 11:29:16 INFO mapred.JobClient: Map output records=100
> 10/03/28 11:29:16 INFO mapred.JobClient: Reduce input records=100
> *Total time for writing scores: 1 secs.
> *
Re: Improving performance of Hadoop job
Posted by Abhishek Verma <ve...@gmail.com>.
The log shows that Eclipse is running the job locally:
0/03/28 11:29:16 INFO mapred.*LocalJobRunner*:
Hence the difference in the run times. Normally, Hadoop jobs have a start up
time of around 10 seconds atleast.
-Abhishek.
On Sun, Mar 28, 2010 at 2:10 PM, Something Something <
mailinglists19@gmail.com> wrote:
> When I run a Hadoop job in a command shell it runs in 21 secs, but when I
> run the same from within Eclipse IDE it runs in 1 second! Trying to figure
> out what the reason is. The output is shown below:
>
> 1) It appears the FILE_BYTES_READ, HDFS_BYTES_WRITTEN etc values are
> totally different. Sounds like that's what is causing the performance
> degradation.
> 2) I tried setting -Xmx1000m in the command shell but that doesn't help.
> Are there other JVM arguments that affect values of these variables
> (FILE_BYTES_READ etc)?
>
> Any hints would be GREATLY appreciated. Thanks.
>
>
> *Command Shell log:*
>
> INFO: Starting top scores writing job....
> 10/03/28 11:19:49 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/03/28 11:19:50 INFO mapred.JobClient: Running job: job_201003281008_0006
> 10/03/28 11:19:51 INFO mapred.JobClient: map 0% reduce 0%
> 10/03/28 11:19:59 INFO mapred.JobClient: map 100% reduce 0%
> 10/03/28 11:20:04 INFO mapred.JobClient: map 100% reduce 100%
> 10/03/28 11:20:10 INFO mapred.JobClient: Job complete:
> job_201003281008_0006
> 10/03/28 11:20:10 INFO mapred.JobClient: Counters: 16
> 10/03/28 11:20:10 INFO mapred.JobClient: Job Counters
> 10/03/28 11:20:10 INFO mapred.JobClient: Launched reduce tasks=1
> 10/03/28 11:20:10 INFO mapred.JobClient: Launched map tasks=1
> 10/03/28 11:20:10 INFO mapred.JobClient: Data-local map tasks=1
> 10/03/28 11:20:10 INFO mapred.JobClient: FileSystemCounters
> *10/03/28 11:20:10 INFO mapred.JobClient: FILE_BYTES_READ=1736
> 10/03/28 11:20:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3504
> 10/03/28 11:20:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3706
> *10/03/28 11:20:10 INFO mapred.JobClient: Map-Reduce Framework
> 10/03/28 11:20:10 INFO mapred.JobClient: Reduce input groups=87
> 10/03/28 11:20:10 INFO mapred.JobClient: Combine output records=0
> 10/03/28 11:20:10 INFO mapred.JobClient: Map input records=100
> 10/03/28 11:20:10 INFO mapred.JobClient: Reduce shuffle bytes=0
> 10/03/28 11:20:10 INFO mapred.JobClient: Reduce output records=100
> 10/03/28 11:20:10 INFO mapred.JobClient: Spilled Records=200
> 10/03/28 11:20:10 INFO mapred.JobClient: Map output bytes=1530
> 10/03/28 11:20:10 INFO mapred.JobClient: Combine input records=0
> 10/03/28 11:20:10 INFO mapred.JobClient: Map output records=100
> 10/03/28 11:20:10 INFO mapred.JobClient: Reduce input records=100
> *Total time for writing scores: 21 secs.*
>
>
>
> *Eclipse console log:*
>
> INFO: Starting top scores writing job....
> 10/03/28 11:29:15 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
> processName=JobTracker, sessionId= - already initialized
> 10/03/28 11:29:15 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/03/28 11:29:15 WARN mapred.JobClient: No job jar file set. User classes
> may not be found. See JobConf(Class) or JobConf#setJar(String).
> 10/03/28 11:29:15 INFO mapred.JobClient: Running job: job_local_0006
> 10/03/28 11:29:15 INFO mapred.MapTask: io.sort.mb = 100
> 10/03/28 11:29:16 INFO mapred.MapTask: data buffer = 79691776/99614720
> 10/03/28 11:29:16 INFO mapred.MapTask: record buffer = 262144/327680
> 10/03/28 11:29:16 INFO mapred.MapTask: Starting flush of map output
> 10/03/28 11:29:16 INFO mapred.MapTask: Finished spill 0
> 10/03/28 11:29:16 INFO mapred.TaskRunner:
> Task:attempt_local_0006_m_000000_0
> is done. And is in the process of commiting
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner:
> 10/03/28 11:29:16 INFO mapred.TaskRunner: Task
> 'attempt_local_0006_m_000000_0' done.
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner:
> 10/03/28 11:29:16 INFO mapred.Merger: Merging 1 sorted segments
> 10/03/28 11:29:16 INFO mapred.Merger: Down to the last merge-pass, with 1
> segments left of total size: 1732 bytes
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner:
> 10/03/28 11:29:16 INFO mapred.TaskRunner:
> Task:attempt_local_0006_r_000000_0
> is done. And is in the process of commiting
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner:
> 10/03/28 11:29:16 INFO mapred.TaskRunner: Task
> attempt_local_0006_r_000000_0
> is allowed to commit now
> 10/03/28 11:29:16 INFO output.FileOutputCommitter: Saved output of task
> 'attempt_local_0006_r_000000_0' to markers/results
> 10/03/28 11:29:16 INFO mapred.LocalJobRunner: reduce > reduce
> 10/03/28 11:29:16 INFO mapred.TaskRunner: Task
> 'attempt_local_0006_r_000000_0' done.
> 10/03/28 11:29:16 INFO mapred.JobClient: map 100% reduce 100%
> 10/03/28 11:29:16 INFO mapred.JobClient: Job complete: job_local_0006
> 10/03/28 11:29:16 INFO mapred.JobClient: Counters: 14
> 10/03/28 11:29:16 INFO mapred.JobClient: FileSystemCounters
> *10/03/28 11:29:16 INFO mapred.JobClient: FILE_BYTES_READ=95848
> 10/03/28 11:29:16 INFO mapred.JobClient: HDFS_BYTES_READ=277286
> 10/03/28 11:29:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=304886
> 10/03/28 11:29:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=209112
> *10/03/28 11:29:16 INFO mapred.JobClient: Map-Reduce Framework
> 10/03/28 11:29:16 INFO mapred.JobClient: Reduce input groups=87
> 10/03/28 11:29:16 INFO mapred.JobClient: Combine output records=0
> 10/03/28 11:29:16 INFO mapred.JobClient: Map input records=100
> 10/03/28 11:29:16 INFO mapred.JobClient: Reduce shuffle bytes=0
> 10/03/28 11:29:16 INFO mapred.JobClient: Reduce output records=100
> 10/03/28 11:29:16 INFO mapred.JobClient: Spilled Records=200
> 10/03/28 11:29:16 INFO mapred.JobClient: Map output bytes=1530
> 10/03/28 11:29:16 INFO mapred.JobClient: Combine input records=0
> 10/03/28 11:29:16 INFO mapred.JobClient: Map output records=100
> 10/03/28 11:29:16 INFO mapred.JobClient: Reduce input records=100
> *Total time for writing scores: 1 secs.
> *
>