You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Cubic <cu...@gmail.com> on 2009/11/26 13:02:20 UTC

Processing 10MB files in Hadoop

Hi list.

I have small files containing data that has to be processed. A file
can be small, even down to 10MB (but it can me also 100-600MB large)
and contains at least 30000 records to be processed.
Processing one record can take 30 seconds to 2 minutes. My cluster is
about 10 nodes. Each node has 16 cores.

Anybody can give an idea about how to deal with these small files? It
is not quite a common Hadoop task; I know. For example, how many map
tasks should I set in this case?

Re: Processing 10MB files in Hadoop

Posted by CubicDesign <cu...@gmail.com>.


> Sorry for deviating from the question  , but curious to know what does core
> here refer to ?
>

http://en.wikipedia.org/wiki/Multi-core

Re: Processing 10MB files in Hadoop

Posted by Aaron Kimball <aa...@cloudera.com>.
By default you get at least one task per file; if any file is bigger than a
block, then that file is broken up into N tasks where each is one block
long. Not sure what you mean by "properly calculate" -- as long as you have
more tasks than you have cores, then you'll definitely have work for every
core to do; having more tasks with high granularity will also let nodes that
get "small" tasks to complete many of them while other cores are stuck with
the "heavier" tasks.

If you call setNumMapTasks() with a higher number of tasks than the
InputFormat creates (via the algorithm above), then it should create
additional tasks by dividing files up into smaller chunks (which may be
sub-block-sized).

As for where you should run your computation.. I don't know that the "map"
and "reduce" phases are really "optimized" for computation in any particular
way. It's just a data motion thing. (At the end of the day, it's your code
doing the processing on either side of the fence, which should dominate the
execution time.) If you use an identity mapper with a pseudo-random key to
spray the data into a bunch of reduce partitions, then you'll get a bunch of
reducers each working on a hopefully-evenly-sized slice of the data. So the
map tasks will quickly read from the original source data and forward the
workload along to the reducers which do the actual heavy lifting. The cost
of this approach is that you have to pay for the time taken to transfer the
data from the mapper nodes to the reducer nodes and sort by key when it gets
there. If you're only working with 600 MB of data, this is probably
negligible. The advantages of doing your computation in the reducers is

1) You can directly control the number of reducer tasks and set this equal
to the number of cores in your cluster.
2) You can tune your partitioning algorithm such that all reducers get
roughly equal workload assignments, if there appears to be some sort of skew
in the dataset.

The tradeoff is that you have to ship all the data to the reducers before
computation starts, which sacrifices data locality and involves an
"intermediate" data set of the same size as the input data set. If this is
in the range of hundreds of GB or north, then this can be very
time-consuming -- so it doesn't scale terribly well. Of course, by the time
you've got several hundred GB of data to work with, your current workload
imbalance issues should be moot anyway.

- Aaron


On Fri, Nov 27, 2009 at 4:33 PM, CubicDesign <cu...@gmail.com> wrote:

>
>
> Aaron Kimball wrote:
>
>> (Note: this is a tasktracker setting, not a job setting. you'll need to
>> set this on every
>> node, then restart the mapreduce cluster to take effect.)
>>
>>
> Ok. And here is my mistake. I set this to 16 only on the main node not also
> on data nodes. Thanks a lot!!!!!!
>
>  Of course, you need to have enough RAM to make sure that all these tasks
>> can
>> run concurrently without swapping.
>>
> No problem!
>
>
>  If your individual records require around a minute each to process as you
>> claimed earlier, you're
>> nowhere near in danger of hitting that particular performance bottleneck.
>>
>>
>>
> I was thinking that is I am under the recommended value of 64MB, Hadoop
> cannot properly calculate the number of tasks.
>

Re: Processing 10MB files in Hadoop

Posted by CubicDesign <cu...@gmail.com>.

Aaron Kimball wrote:
> (Note: this is a tasktracker setting, not a job setting. you'll need to set this on every
> node, then restart the mapreduce cluster to take effect.)
>   
Ok. And here is my mistake. I set this to 16 only on the main node not 
also on data nodes. Thanks a lot!!!!!!
> Of course, you need to have enough RAM to make sure that all these tasks can
> run concurrently without swapping.
No problem!

> If your individual records require around a minute each to process as you claimed earlier, you're
> nowhere near in danger of hitting that particular performance bottleneck.
>
>   
I was thinking that is I am under the recommended value of 64MB, Hadoop 
cannot properly calculate the number of tasks.

Re: Processing 10MB files in Hadoop

Posted by Aaron Kimball <aa...@cloudera.com>.
More importantly: have you told Hadoop to use all your cores?

What is mapred.tasktracker.map.tasks.maximum set to? This defaults to 2. If
you've got 16 cores/node, you should set this to at least 15--16 so that all
your cores are being used. You may need to set this higher, like 20, to
ensure that cores aren't being starved. Measure with ganglia or top to make
sure your CPU utilization is up to where you're satisfied. (Note: this is a
tasktracker setting, not a job setting. you'll need to set this on every
node, then restart the mapreduce cluster to take effect.)

Of course, you need to have enough RAM to make sure that all these tasks can
run concurrently without swapping. Swapping will destroy your performance.
Then again, if you bought 16-way machines, presumably you didn't cheap out
in that department :)

100 tasks is not an absurd number. For large data sets (e.g., TB scale), I
have seen several tens of thousands of tasks.

In general, yes, running many tasks over small files is not a good fit for
Hadoop, but 100 is not "many small files" -- you might see some sort of
speed up by coalescing multiple files into a single task, but when you hear
problems with processing many small files, folks are frequently referring to
something like 10,000 files where each file is only a few MB, and the actual
processing per record is extremely cheap. In cases like this, task startup
times severely dominate actual computation time. If your individual records
require around a minute each to process as you claimed earlier, you're
nowhere near in danger of hitting that particular performance bottleneck.

- Aaron


On Thu, Nov 26, 2009 at 12:23 PM, CubicDesign <cu...@gmail.com> wrote:

>
>
>  Are the record processing steps bound by a local machine resource - cpu,
>> disk io or other?
>>
>>
> Some disk I/O. Not so much compared with the CPU. Basically it is a CPU
> bound. This is why each machine has 16 cores.
>
>  What I often do when I have lots of small files to handle is use the
>> NlineInputFormat,
>>
> Each file contains a complete/independent set of records. I cannot mix the
> data resulted from processing two different files.
>
>
> ---------
> Ok. I think I need to re-explain my problem :)
> While running jobs on these small files, the computation time was almost 5
> times longer than expected. It looks like the job was affected by the number
> of map task that I have (100). I don't know which are the best parameters in
> my case (10MB files).
>
> I have zero reduce tasks.
>
>
>

Re: Processing 10MB files in Hadoop

Posted by CubicDesign <cu...@gmail.com>.

> Are the record processing steps bound by a local machine resource - cpu,
> disk io or other?
>   
Some disk I/O. Not so much compared with the CPU. Basically it is a CPU 
bound. This is why each machine has 16 cores.
> What I often do when I have lots of small files to handle is use the
> NlineInputFormat,
Each file contains a complete/independent set of records. I cannot mix 
the data resulted from processing two different files.


---------
Ok. I think I need to re-explain my problem :)
While running jobs on these small files, the computation time was almost 
5 times longer than expected. It looks like the job was affected by the 
number of map task that I have (100). I don't know which are the best 
parameters in my case (10MB files).

I have zero reduce tasks.



Re: Processing 10MB files in Hadoop

Posted by CubicDesign <cu...@gmail.com>.
30000 records in 10MB files.
Files can vary and the number of records also can vary.




> If the data is 10MB and you have 30k records, and it takes ~2 mins to
> process each record, I'd suggest using map to distribute the data across
> several reducers then do the actual processing on reduce.
Hmmm... Good idea. Thanks. But is 'Reduce' optimized to do the heavy 
part of the computation?

Re: Processing 10MB files in Hadoop

Posted by Patrick Angeles <pa...@gmail.com>.
What does the data look like?

You mention 30k records, is that for 10MB or for 600MB, or do you have a
constant 30k records with vastly varying file sizes?

If the data is 10MB and you have 30k records, and it takes ~2 mins to
process each record, I'd suggest using map to distribute the data across
several reducers then do the actual processing on reduce.



On Fri, Nov 27, 2009 at 7:07 PM, CubicDesign <cu...@gmail.com> wrote:

> Ok. I have set the number on maps to about 1760 (11 nodes * 16 cores/node *
> 10 as recommended by Hadoop documentation) and my job still takes several
> hours to run instead of one.
>
> Can be the overhead added by Hadoop that big? I mean I have over 30000
> small tasks (about one minute), each one starting its own JVM.
>
>
>

Re: Processing 10MB files in Hadoop

Posted by CubicDesign <cu...@gmail.com>.
Ok. I have set the number on maps to about 1760 (11 nodes * 16 
cores/node * 10 as recommended by Hadoop documentation) and my job still 
takes several hours to run instead of one.

Can be the overhead added by Hadoop that big? I mean I have over 30000 
small tasks (about one minute), each one starting its own JVM.



Re: Processing 10MB files in Hadoop

Posted by Jason Venner <ja...@gmail.com>.
Are the record processing steps bound by a local machine resource - cpu,
disk io or other?

What I often do when I have lots of small files to handle is use the
NlineInputFormat, as data locality for the input files is a much lessor
issue than short task run times in that case,
Each line of my input file would be one of the small files, and then I would
set the number of files per split to be some reasonable number.

If the individual record processing is not bound by local resources you may
wish to try the MultithreadedMapRunner, which gives you a lot of flexibily
about the number of map executions you run in parallel without needing to
restart your cluster to change the tasks per tracker.


On Thu, Nov 26, 2009 at 8:05 AM, Jeff Zhang <zj...@gmail.com> wrote:

> Quote from the wiki doc
>
> *The number of map tasks can also be increased manually using the
> JobConf<http://wiki.apache.org/hadoop/JobConf>'s
> conf.setNumMapTasks(int num). This can be used to increase the number of
> map
> tasks, but will not set the number below that which Hadoop determines via
> splitting the input data.*
>
> So the number of map task is determited by InputFormat.
> But you can manually set the number of reducer task to improve the
> performance, because the default number of reducer task is 1
>
>
> Jeff Zhang
>
> On Thu, Nov 26, 2009 at 7:58 AM, CubicDesign <cu...@gmail.com>
> wrote:
>
> > But the documentation DO recommend to set it:
> > http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >
> >
> >
> > PS: I am using streaming
> >
> >
> >
> >
> > Jeff Zhang wrote:
> >
> >> Actually, you do not need to set the number of map task, the InputFormat
> >> will compute it for you according your input data set.
> >>
> >> Jeff Zhang
> >>
> >>
> >> On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <cu...@gmail.com>
> >> wrote:
> >>
> >>
> >>
> >>>  The number of mapper is determined by your InputFormat.
> >>>
> >>>
> >>>> In common case, if file is smaller than one block size (which is 64M
> by
> >>>> default), one mapper for this file. if file is larger than one block
> >>>> size,
> >>>> hadoop will split this large file, and the number of mapper for this
> >>>> file
> >>>> will be ceiling ( (size of file)/(size of block) )
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> Hi
> >>>
> >>> Do you mean, I should set the number of map tasks to 1 ????
> >>> I want to process this file not in a single node but over the entire
> >>> cluster. I need a lot of processing power in order to finish the job in
> >>> hours instead of days.
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: Processing 10MB files in Hadoop

Posted by Jeff Zhang <zj...@gmail.com>.
Quote from the wiki doc

*The number of map tasks can also be increased manually using the
JobConf<http://wiki.apache.org/hadoop/JobConf>'s
conf.setNumMapTasks(int num). This can be used to increase the number of map
tasks, but will not set the number below that which Hadoop determines via
splitting the input data.*

So the number of map task is determited by InputFormat.
But you can manually set the number of reducer task to improve the
performance, because the default number of reducer task is 1


Jeff Zhang

On Thu, Nov 26, 2009 at 7:58 AM, CubicDesign <cu...@gmail.com> wrote:

> But the documentation DO recommend to set it:
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
>
>
> PS: I am using streaming
>
>
>
>
> Jeff Zhang wrote:
>
>> Actually, you do not need to set the number of map task, the InputFormat
>> will compute it for you according your input data set.
>>
>> Jeff Zhang
>>
>>
>> On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <cu...@gmail.com>
>> wrote:
>>
>>
>>
>>>  The number of mapper is determined by your InputFormat.
>>>
>>>
>>>> In common case, if file is smaller than one block size (which is 64M by
>>>> default), one mapper for this file. if file is larger than one block
>>>> size,
>>>> hadoop will split this large file, and the number of mapper for this
>>>> file
>>>> will be ceiling ( (size of file)/(size of block) )
>>>>
>>>>
>>>>
>>>>
>>>>
>>> Hi
>>>
>>> Do you mean, I should set the number of map tasks to 1 ????
>>> I want to process this file not in a single node but over the entire
>>> cluster. I need a lot of processing power in order to finish the job in
>>> hours instead of days.
>>>
>>>
>>>
>>
>>
>>
>

Re: Processing 10MB files in Hadoop

Posted by CubicDesign <cu...@gmail.com>.
But the documentation DO recommend to set it: 
http://wiki.apache.org/hadoop/HowManyMapsAndReduces



PS: I am using streaming
 


Jeff Zhang wrote:
> Actually, you do not need to set the number of map task, the InputFormat
> will compute it for you according your input data set.
>
> Jeff Zhang
>
>
> On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <cu...@gmail.com> wrote:
>
>   
>>  The number of mapper is determined by your InputFormat.
>>     
>>> In common case, if file is smaller than one block size (which is 64M by
>>> default), one mapper for this file. if file is larger than one block size,
>>> hadoop will split this large file, and the number of mapper for this file
>>> will be ceiling ( (size of file)/(size of block) )
>>>
>>>
>>>
>>>       
>> Hi
>>
>> Do you mean, I should set the number of map tasks to 1 ????
>> I want to process this file not in a single node but over the entire
>> cluster. I need a lot of processing power in order to finish the job in
>> hours instead of days.
>>
>>     
>
>   

Re: Processing 10MB files in Hadoop

Posted by Jeff Zhang <zj...@gmail.com>.
Actually, you do not need to set the number of map task, the InputFormat
will compute it for you according your input data set.

Jeff Zhang


On Thu, Nov 26, 2009 at 7:39 AM, CubicDesign <cu...@gmail.com> wrote:

>
>  The number of mapper is determined by your InputFormat.
>>
>> In common case, if file is smaller than one block size (which is 64M by
>> default), one mapper for this file. if file is larger than one block size,
>> hadoop will split this large file, and the number of mapper for this file
>> will be ceiling ( (size of file)/(size of block) )
>>
>>
>>
> Hi
>
> Do you mean, I should set the number of map tasks to 1 ????
> I want to process this file not in a single node but over the entire
> cluster. I need a lot of processing power in order to finish the job in
> hours instead of days.
>

Re: Processing 10MB files in Hadoop

Posted by CubicDesign <cu...@gmail.com>.
> The number of mapper is determined by your InputFormat.
>
> In common case, if file is smaller than one block size (which is 64M by
> default), one mapper for this file. if file is larger than one block size,
> hadoop will split this large file, and the number of mapper for this file
> will be ceiling ( (size of file)/(size of block) )
>
>   
Hi

Do you mean, I should set the number of map tasks to 1 ????
I want to process this file not in a single node but over the entire 
cluster. I need a lot of processing power in order to finish the job in 
hours instead of days.

Re: Processing 10MB files in Hadoop

Posted by Jeff Zhang <zj...@gmail.com>.
The number of mapper is determined by your InputFormat.

In common case, if file is smaller than one block size (which is 64M by
default), one mapper for this file. if file is larger than one block size,
hadoop will split this large file, and the number of mapper for this file
will be ceiling ( (size of file)/(size of block) )

Jeff Zhang



On Thu, Nov 26, 2009 at 5:42 AM, Siddu <si...@gmail.com> wrote:

> On Thu, Nov 26, 2009 at 5:32 PM, Cubic <cu...@gmail.com> wrote:
>
> > Hi list.
> >
> > I have small files containing data that has to be processed. A file
> > can be small, even down to 10MB (but it can me also 100-600MB large)
> > and contains at least 30000 records to be processed.
> > Processing one record can take 30 seconds to 2 minutes. My cluster is
> > about 10 nodes. Each node has 16 cores.
> >
> Sorry for deviating from the question  , but curious to know what does core
> here refer to ?
>
>
> > Anybody can give an idea about how to deal with these small files? It
> > is not quite a common Hadoop task; I know. For example, how many map
> > tasks should I set in this case?
> >
>
>
>
> --
> Regards,
> ~Sid~
> I have never met a man so ignorant that i couldn't learn something from him
>

Re: Processing 10MB files in Hadoop

Posted by Siddu <si...@gmail.com>.
On Thu, Nov 26, 2009 at 5:32 PM, Cubic <cu...@gmail.com> wrote:

> Hi list.
>
> I have small files containing data that has to be processed. A file
> can be small, even down to 10MB (but it can me also 100-600MB large)
> and contains at least 30000 records to be processed.
> Processing one record can take 30 seconds to 2 minutes. My cluster is
> about 10 nodes. Each node has 16 cores.
>
Sorry for deviating from the question  , but curious to know what does core
here refer to ?


> Anybody can give an idea about how to deal with these small files? It
> is not quite a common Hadoop task; I know. For example, how many map
> tasks should I set in this case?
>



-- 
Regards,
~Sid~
I have never met a man so ignorant that i couldn't learn something from him

Re: Good idea to run NameNode and JobTracker on same machine?

Posted by Jeff Zhang <zj...@gmail.com>.
It depends on the size of your cluster. I think you can combine them
together if your cluster has less than 10 machines.


Jeff Zhang




On Thu, Nov 26, 2009 at 6:26 AM, Raymond Jennings III <raymondjiii@yahoo.com
> wrote:

> Do people normally combine these two processes onto one machine?  Currently
> I have them on separate machines but I am wondering they use that much CPU
> processing time and maybe I should combine them and create another DataNode.
>
>
>
>

Re: Hadoop 0.20 map/reduce Failing for old API

Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Nov 27, 2009 at 10:46 AM, Arv Mistry <ar...@kindsight.net> wrote:
> Thanks Rekha, I was missing the new library
> (hadoop-0.20.1-hdfs-core.jar) in my client.
>
> It seems to run a little further but I'm now getting a
> ClassCastException returned by the mapper. Note, this worked with the
> 0.19 load, so I'm assuming there's something additional in the
> configuration that I'm missing. Can anyone help?
>
> java.lang.ClassCastException: org.apache.hadoop.mapred.MultiFileSplit
> cannot be cast to org.apache.hadoop.mapred.FileSplit
>        at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat
> .java:54)
>        at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Cheers Arv
>
> -----Original Message-----
> From: Rekha Joshi [mailto:rekhajos@yahoo-inc.com]
> Sent: November 26, 2009 11:45 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Hadoop 0.20 map/reduce Failing for old API
>
> The exit status of 1 usually indicates configuration issues, incorrect
> command invocation in hadoop 0.20 (incorrect params), if not JVM crash.
> In your logs there is no indication of crash, but some paths/command can
> be the cause. Can you check if your lib paths/data paths are correct?
>
> If it is a memory intensive task, you may also try values on
> mapred.child.java.opts /mapred.job.map.memory.mb.Thanks!
>
> On 11/27/09 1:28 AM, "Arv Mistry" <ar...@kindsight.net> wrote:
>
> Hi,
>
> We've recently upgraded to hadoop 0.20. Writing to HDFS seems to be
> working fine, but the map/reduce jobs are failing with the following
> exception. Note, we have not moved to the new map/reduce API yet. In the
> client that launches the job, the only change I have made is to now load
> the three files; core-site.xml, hdfs-site.xml and mapred-site.xml rather
> than the hadoop-site.xml. Any ideas?
>
> INFO   | jvm 1    | 2009/11/26 13:47:26 | 2009-11-26 13:47:26,328 INFO
> [FileInputFormat] Total input paths to process : 711
> INFO   | jvm 1    | 2009/11/26 13:47:28 | 2009-11-26 13:47:28,033 INFO
> [JobClient] Running job: job_200911241319_0003
> INFO   | jvm 1    | 2009/11/26 13:47:29 | 2009-11-26 13:47:29,036 INFO
> [JobClient]  map 0% reduce 0%
> INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,068 INFO
> [JobClient] Task Id : attempt_200911241319_0003_m_000003_0, Status :
> FAILED
> INFO   | jvm 1    | 2009/11/26 13:47:36 | java.io.IOException: Task
> process exit with nonzero status of 1.
> INFO   | jvm 1    | 2009/11/26 13:47:36 |       at
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> INFO   | jvm 1    | 2009/11/26 13:47:36 |
> INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,094 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_000003_0&filter=stdout
> INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,096 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_000003_0&filter=stderr
> INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,162 INFO
> [JobClient] Task Id : attempt_200911241319_0003_m_000000_0, Status :
> FAILED
> INFO   | jvm 1    | 2009/11/26 13:47:51 | java.io.IOException: Task
> process exit with nonzero status of 1.
> INFO   | jvm 1    | 2009/11/26 13:47:51 |       at
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> INFO   | jvm 1    | 2009/11/26 13:47:51 |
> INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,166 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_000000_0&filter=stdout
> INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,167 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_000000_0&filter=stderr
> INFO   | jvm 1    | 2009/11/26 13:47:52 | 2009-11-26 13:47:52,173 INFO
> [JobClient]  map 50% reduce 0%
> INFO   | jvm 1    | 2009/11/26 13:48:03 | 2009-11-26 13:48:03,219 INFO
> [JobClient] Task Id : attempt_200911241319_0003_m_000001_0, Status :
> FAILED
> INFO   | jvm 1    | 2009/11/26 13:48:03 | Map output lost, rescheduling:
> getMapOutput(attempt_200911241319_0003_m_000001_0,0) failed :
> INFO   | jvm 1    | 2009/11/26 13:48:03 |
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200911241319_0003/attempt_200911241319_0003_m_0
> 00001_0/output/file.out.index in any of the configured local directories
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathT
> oRead(LocalDirAllocator.java:389)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAlloca
> tor.java:138)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.
> java:2886)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:2
> 16)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandler
> Collection.java:230)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.Server.handle(Server.java:324)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConne
> ction.java:864)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:
> 409)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java
> :522)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |
> INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,235 INFO
> [JobClient] Task Id : attempt_200911241319_0003_m_000000_1, Status :
> FAILED
> INFO   | jvm 1    | 2009/11/26 13:48:06 | java.io.IOException: Task
> process exit with nonzero status of 1.
> INFO   | jvm 1    | 2009/11/26 13:48:06 |       at
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> INFO   | jvm 1    | 2009/11/26 13:48:06 |
> INFO   | jvm 1    | 2009/11/26 13:48:06 | java.io.IOException: Task
> process exit with nonzero status of 1.
> INFO   | jvm 1    | 2009/11/26 13:48:06 |       at
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> INFO   | jvm 1    | 2009/11/26 13:48:06 |
> INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,239 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_000000_1&filter=stdout
> INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,245 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_000000_1&filter=stderr
> INFO   | jvm 1    | 2009/11/26 13:48:13 | 2009-11-26 13:48:13,302 INFO
> [JobClient]  map 0% reduce 0%
> INFO   | jvm 1    | 2009/11/26 13:48:16 | 2009-11-26 13:48:16,315 INFO
> [JobClient]  map 50% reduce 0%
> INFO   | jvm 1    | 2009/11/26 13:48:18 | 2009-11-26 13:48:18,324 INFO
> [JobClient] Task Id : attempt_200911241319_0003_m_000000_2, Status :
> FAILED
> INFO   | jvm 1    | 2009/11/26 13:48:18 | java.io.IOException: Task
> process exit with nonzero status of 1.
>
>
>

Based on you just adding one jar file and now you are seeing
ClassCastException, you upgrade may have problems. Did you try to
upgrade in the same hadpoop directory and possible left files from the
old install in the same directories with the new ones?

RE: Hadoop 0.20 map/reduce Failing for old API

Posted by Arv Mistry <ar...@kindsight.net>.
Thanks Rekha, I was missing the new library
(hadoop-0.20.1-hdfs-core.jar) in my client.

It seems to run a little further but I'm now getting a
ClassCastException returned by the mapper. Note, this worked with the
0.19 load, so I'm assuming there's something additional in the
configuration that I'm missing. Can anyone help?

java.lang.ClassCastException: org.apache.hadoop.mapred.MultiFileSplit
cannot be cast to org.apache.hadoop.mapred.FileSplit
	at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat
.java:54)
	at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

Cheers Arv

-----Original Message-----
From: Rekha Joshi [mailto:rekhajos@yahoo-inc.com] 
Sent: November 26, 2009 11:45 PM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop 0.20 map/reduce Failing for old API

The exit status of 1 usually indicates configuration issues, incorrect
command invocation in hadoop 0.20 (incorrect params), if not JVM crash.
In your logs there is no indication of crash, but some paths/command can
be the cause. Can you check if your lib paths/data paths are correct?

If it is a memory intensive task, you may also try values on
mapred.child.java.opts /mapred.job.map.memory.mb.Thanks!

On 11/27/09 1:28 AM, "Arv Mistry" <ar...@kindsight.net> wrote:

Hi,

We've recently upgraded to hadoop 0.20. Writing to HDFS seems to be
working fine, but the map/reduce jobs are failing with the following
exception. Note, we have not moved to the new map/reduce API yet. In the
client that launches the job, the only change I have made is to now load
the three files; core-site.xml, hdfs-site.xml and mapred-site.xml rather
than the hadoop-site.xml. Any ideas?

INFO   | jvm 1    | 2009/11/26 13:47:26 | 2009-11-26 13:47:26,328 INFO
[FileInputFormat] Total input paths to process : 711
INFO   | jvm 1    | 2009/11/26 13:47:28 | 2009-11-26 13:47:28,033 INFO
[JobClient] Running job: job_200911241319_0003
INFO   | jvm 1    | 2009/11/26 13:47:29 | 2009-11-26 13:47:29,036 INFO
[JobClient]  map 0% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,068 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000003_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:47:36 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:47:36 |       at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:47:36 |
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,094 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000003_0&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,096 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000003_0&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,162 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:47:51 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:47:51 |       at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:47:51 |
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,166 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_0&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,167 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_0&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:47:52 | 2009-11-26 13:47:52,173 INFO
[JobClient]  map 50% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:03 | 2009-11-26 13:48:03,219 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000001_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:03 | Map output lost, rescheduling:
getMapOutput(attempt_200911241319_0003_m_000001_0,0) failed :
INFO   | jvm 1    | 2009/11/26 13:48:03 |
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200911241319_0003/attempt_200911241319_0003_m_0
00001_0/output/file.out.index in any of the configured local directories
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathT
oRead(LocalDirAllocator.java:389)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAlloca
tor.java:138)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.
java:2886)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:2
16)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandler
Collection.java:230)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.Server.handle(Server.java:324)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConne
ction.java:864)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:
409)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java
:522)
INFO   | jvm 1    | 2009/11/26 13:48:03 |
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,235 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_1, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:06 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:48:06 |       at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:48:06 |
INFO   | jvm 1    | 2009/11/26 13:48:06 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:48:06 |       at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:48:06 |
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,239 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_1&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,245 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_1&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:48:13 | 2009-11-26 13:48:13,302 INFO
[JobClient]  map 0% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:16 | 2009-11-26 13:48:16,315 INFO
[JobClient]  map 50% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:18 | 2009-11-26 13:48:18,324 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_2, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:18 | java.io.IOException: Task
process exit with nonzero status of 1.



Re: Hadoop 0.20 map/reduce Failing for old API

Posted by Rekha Joshi <re...@yahoo-inc.com>.
The exit status of 1 usually indicates configuration issues, incorrect command invocation in hadoop 0.20 (incorrect params), if not JVM crash.
In your logs there is no indication of crash, but some paths/command can be the cause. Can you check if your lib paths/data paths are correct?

If it is a memory intensive task, you may also try values on mapred.child.java.opts /mapred.job.map.memory.mb.Thanks!

On 11/27/09 1:28 AM, "Arv Mistry" <ar...@kindsight.net> wrote:

Hi,

We've recently upgraded to hadoop 0.20. Writing to HDFS seems to be
working fine, but the map/reduce jobs are failing with the following
exception. Note, we have not moved to the new map/reduce API yet. In the
client that launches the job, the only change I have made is to now load
the three files; core-site.xml, hdfs-site.xml and mapred-site.xml rather
than the hadoop-site.xml. Any ideas?

INFO   | jvm 1    | 2009/11/26 13:47:26 | 2009-11-26 13:47:26,328 INFO
[FileInputFormat] Total input paths to process : 711
INFO   | jvm 1    | 2009/11/26 13:47:28 | 2009-11-26 13:47:28,033 INFO
[JobClient] Running job: job_200911241319_0003
INFO   | jvm 1    | 2009/11/26 13:47:29 | 2009-11-26 13:47:29,036 INFO
[JobClient]  map 0% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,068 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000003_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:47:36 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:47:36 |       at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:47:36 |
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,094 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000003_0&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,096 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000003_0&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,162 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:47:51 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:47:51 |       at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:47:51 |
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,166 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_0&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,167 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_0&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:47:52 | 2009-11-26 13:47:52,173 INFO
[JobClient]  map 50% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:03 | 2009-11-26 13:48:03,219 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000001_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:03 | Map output lost, rescheduling:
getMapOutput(attempt_200911241319_0003_m_000001_0,0) failed :
INFO   | jvm 1    | 2009/11/26 13:48:03 |
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200911241319_0003/attempt_200911241319_0003_m_0
00001_0/output/file.out.index in any of the configured local directories
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathT
oRead(LocalDirAllocator.java:389)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAlloca
tor.java:138)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.
java:2886)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:2
16)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandler
Collection.java:230)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.Server.handle(Server.java:324)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConne
ction.java:864)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:
409)
INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java
:522)
INFO   | jvm 1    | 2009/11/26 13:48:03 |
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,235 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_1, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:06 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:48:06 |       at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:48:06 |
INFO   | jvm 1    | 2009/11/26 13:48:06 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:48:06 |       at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:48:06 |
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,239 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_1&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,245 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_1&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:48:13 | 2009-11-26 13:48:13,302 INFO
[JobClient]  map 0% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:16 | 2009-11-26 13:48:16,315 INFO
[JobClient]  map 50% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:18 | 2009-11-26 13:48:18,324 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_2, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:18 | java.io.IOException: Task
process exit with nonzero status of 1.



Hadoop 0.20 map/reduce Failing for old API

Posted by Arv Mistry <ar...@kindsight.net>.
Hi,

We've recently upgraded to hadoop 0.20. Writing to HDFS seems to be
working fine, but the map/reduce jobs are failing with the following
exception. Note, we have not moved to the new map/reduce API yet. In the
client that launches the job, the only change I have made is to now load
the three files; core-site.xml, hdfs-site.xml and mapred-site.xml rather
than the hadoop-site.xml. Any ideas?

INFO   | jvm 1    | 2009/11/26 13:47:26 | 2009-11-26 13:47:26,328 INFO
[FileInputFormat] Total input paths to process : 711
INFO   | jvm 1    | 2009/11/26 13:47:28 | 2009-11-26 13:47:28,033 INFO
[JobClient] Running job: job_200911241319_0003
INFO   | jvm 1    | 2009/11/26 13:47:29 | 2009-11-26 13:47:29,036 INFO
[JobClient]  map 0% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,068 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000003_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:47:36 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:47:36 | 	at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:47:36 | 
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,094 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000003_0&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,096 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000003_0&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,162 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:47:51 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:47:51 | 	at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:47:51 | 
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,166 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_0&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,167 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_0&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:47:52 | 2009-11-26 13:47:52,173 INFO
[JobClient]  map 50% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:03 | 2009-11-26 13:48:03,219 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000001_0, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:03 | Map output lost, rescheduling:
getMapOutput(attempt_200911241319_0003_m_000001_0,0) failed :
INFO   | jvm 1    | 2009/11/26 13:48:03 |
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200911241319_0003/attempt_200911241319_0003_m_0
00001_0/output/file.out.index in any of the configured local directories
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathT
oRead(LocalDirAllocator.java:389)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAlloca
tor.java:138)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.
java:2886)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:2
16)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandler
Collection.java:230)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.Server.handle(Server.java:324)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConne
ction.java:864)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:
409)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 	at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java
:522)
INFO   | jvm 1    | 2009/11/26 13:48:03 | 
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,235 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_1, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:06 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:48:06 | 	at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:48:06 | 
INFO   | jvm 1    | 2009/11/26 13:48:06 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1    | 2009/11/26 13:48:06 | 	at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1    | 2009/11/26 13:48:06 | 
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,239 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_1&filter=stdout
INFO   | jvm 1    | 2009/11/26 13:48:06 | 2009-11-26 13:48:06,245 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_000000_1&filter=stderr
INFO   | jvm 1    | 2009/11/26 13:48:13 | 2009-11-26 13:48:13,302 INFO
[JobClient]  map 0% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:16 | 2009-11-26 13:48:16,315 INFO
[JobClient]  map 50% reduce 0%
INFO   | jvm 1    | 2009/11/26 13:48:18 | 2009-11-26 13:48:18,324 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_000000_2, Status :
FAILED
INFO   | jvm 1    | 2009/11/26 13:48:18 | java.io.IOException: Task
process exit with nonzero status of 1.


Re: Good idea to run NameNode and JobTracker on same machine?

Posted by Aaron Kimball <aa...@cloudera.com>.
The real kicker is going to be memory consumption of one or both of these
services. The NN in particular uses a large amount of RAM to store the
filesystem image. I think that those who are suggesting a breakeven point of
<= 10 nodes are lowballing. In practice, unless your machines are really
thin on the RAM (e.g., 2--4 GB), I haven't seen any cases where these
services need to be separated below the 20-node mark; I've also seen several
clusters of 40 nodes running fine with these services colocated. It depends
on how many files are in HDFS and how frequently you're submitting a lot of
concurrent jobs to MapReduce.

If you're setting up a production environment that you plan to expand,
however, as a best practice you should configure the master node to have two
hostnames (e.g., "nn" and "jt") so that you can have separate hostnames in
fs.default.name and mapred.job.tracker; when the day comes that these
services are placed on different nodes, you'll then be able to just move one
of the hostnames over and not need to reconfigure all 20--40 other nodes.

- Aaron

On Thu, Nov 26, 2009 at 8:27 PM, Srigurunath Chakravarthi <
sriguru@yahoo-inc.com> wrote:

> Raymond,
> Load wise, it should be very safe to run both JT and NN on a single node
> for small clusters (< 40 Task Trackers and/or Data Nodes). They don't use
> much CPU as such.
>
>  This may even work for larger clusters depending on the type of hardware
> you have and the Hadoop job mix. We usually observe < 5% CPU load with ~80
> DNs/TTs on an 8-code Intel processor based box with 16GB RAM.
>
>  It is best that you observe CPU & mem load on the JT+NN node to take a
> call on whether to separate them. iostat, top or sar should tell you.
>
> Regards,
> Sriguru
>
> >-----Original Message-----
> >From: John Martyniak [mailto:john@beforedawnsolutions.com]
> >Sent: Friday, November 27, 2009 3:06 AM
> >To: common-user@hadoop.apache.org
> >Cc: <co...@hadoop.apache.org>
> >Subject: Re: Good idea to run NameNode and JobTracker on same machine?
> >
> >I have a cluster of 4 machines plus one machine to run nn & jt.  I
> >have heard that 5 or 6 is the magic #.  I will see when I add the next
> >batch of machines.
> >
> >And it seems to running fine.
> >
> >-Jogn
> >
> >On Nov 26, 2009, at 11:38 AM, Yongqiang He <he...@gmail.com>
> >wrote:
> >
> >> I think it is definitely not a good idea to combine these two in
> >> production
> >> environment.
> >>
> >> Thanks
> >> Yongqiang
> >> On 11/26/09 6:26 AM, "Raymond Jennings III" <ra...@yahoo.com>
> >> wrote:
> >>
> >>> Do people normally combine these two processes onto one machine?
> >>> Currently I
> >>> have them on separate machines but I am wondering they use that
> >>> much CPU
> >>> processing time and maybe I should combine them and create another
> >>> DataNode.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>

RE: Good idea to run NameNode and JobTracker on same machine?

Posted by Srigurunath Chakravarthi <sr...@yahoo-inc.com>.
Raymond,
Load wise, it should be very safe to run both JT and NN on a single node for small clusters (< 40 Task Trackers and/or Data Nodes). They don't use much CPU as such.

 This may even work for larger clusters depending on the type of hardware you have and the Hadoop job mix. We usually observe < 5% CPU load with ~80 DNs/TTs on an 8-code Intel processor based box with 16GB RAM.

 It is best that you observe CPU & mem load on the JT+NN node to take a call on whether to separate them. iostat, top or sar should tell you.

Regards,
Sriguru

>-----Original Message-----
>From: John Martyniak [mailto:john@beforedawnsolutions.com]
>Sent: Friday, November 27, 2009 3:06 AM
>To: common-user@hadoop.apache.org
>Cc: <co...@hadoop.apache.org>
>Subject: Re: Good idea to run NameNode and JobTracker on same machine?
>
>I have a cluster of 4 machines plus one machine to run nn & jt.  I
>have heard that 5 or 6 is the magic #.  I will see when I add the next
>batch of machines.
>
>And it seems to running fine.
>
>-Jogn
>
>On Nov 26, 2009, at 11:38 AM, Yongqiang He <he...@gmail.com>
>wrote:
>
>> I think it is definitely not a good idea to combine these two in
>> production
>> environment.
>>
>> Thanks
>> Yongqiang
>> On 11/26/09 6:26 AM, "Raymond Jennings III" <ra...@yahoo.com>
>> wrote:
>>
>>> Do people normally combine these two processes onto one machine?
>>> Currently I
>>> have them on separate machines but I am wondering they use that
>>> much CPU
>>> processing time and maybe I should combine them and create another
>>> DataNode.
>>>
>>>
>>>
>>>
>>>
>>
>>

Re: Good idea to run NameNode and JobTracker on same machine?

Posted by John Martyniak <jo...@beforedawnsolutions.com>.
I have a cluster of 4 machines plus one machine to run nn & jt.  I  
have heard that 5 or 6 is the magic #.  I will see when I add the next  
batch of machines.

And it seems to running fine.

-Jogn

On Nov 26, 2009, at 11:38 AM, Yongqiang He <he...@gmail.com>  
wrote:

> I think it is definitely not a good idea to combine these two in  
> production
> environment.
>
> Thanks
> Yongqiang
> On 11/26/09 6:26 AM, "Raymond Jennings III" <ra...@yahoo.com>  
> wrote:
>
>> Do people normally combine these two processes onto one machine?   
>> Currently I
>> have them on separate machines but I am wondering they use that  
>> much CPU
>> processing time and maybe I should combine them and create another  
>> DataNode.
>>
>>
>>
>>
>>
>
>

Re: Good idea to run NameNode and JobTracker on same machine?

Posted by Yongqiang He <he...@gmail.com>.
I think it is definitely not a good idea to combine these two in production
environment.

Thanks
Yongqiang
On 11/26/09 6:26 AM, "Raymond Jennings III" <ra...@yahoo.com> wrote:

> Do people normally combine these two processes onto one machine?  Currently I
> have them on separate machines but I am wondering they use that much CPU
> processing time and maybe I should combine them and create another DataNode.
> 
> 
>       
> 
> 



Good idea to run NameNode and JobTracker on same machine?

Posted by Raymond Jennings III <ra...@yahoo.com>.
Do people normally combine these two processes onto one machine?  Currently I have them on separate machines but I am wondering they use that much CPU processing time and maybe I should combine them and create another DataNode.


      

Re: Processing 10MB files in Hadoop

Posted by Yongqiang He <he...@gmail.com>.
Try CombineFileInputFormat.

Thanks
Yongqiang
On 11/26/09 4:02 AM, "Cubic" <cu...@gmail.com> wrote:

> i list.
> 
> I have small files containing data that has to be processed. A file
> can be small, even down to 10MB (but it can me also 100-600MB large)
> and contains at least 30000 records to be processed.
> Processing one record can take 30 seconds to 2 minutes. My cluster is
> about 10 nodes. Each node has 16 cores.
> 
> Anybody can give an idea about how to deal with these small files? It
> is not quite a common Hadoop task; I know. For example, how many map
> tasks should I set in this case?
> 
>