You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by dave bayer <da...@cloudfactory.org> on 2009/09/29 02:30:36 UTC

dfs create block sticking

On a cluster running 0.19.2

We have some production jobs that perform ETL tasks that open files
in hdfs during the reduce task (with speculative execution in reduce  
stage
programmatically turned off). Since upgrading the cluster from 0.19.1,  
we've
been seeing some odd behavior in that we are experiencing timeouts with
block/file creation, timeouts that are long enough that the reduce  
attempt gets
killed. Subsequent reduce attempts then fail because the first killed  
attempt
is still noted (by the namenode I assume) to create the block/file  
according to
the exception that bubbles up. Didn't see anything like this in JIRA,  
and I'm
trying to grab a few jstacks from the namenode when I see these errors  
pop
up (usually correlated with a somewhat busy cluster) in an effort to  
get some
idea of what is going on here.

Currently the cluster is small with about 5 data nodes and 10s of TBs  
with
the 2x the namespace files easily fitting in memory.... I don't see  
any process
eating more than a couple percent of cpu on the name node box (which
also hosts the secondary nn). iostat shows 100-200 block read/written  
every
other second on this host leaving plenty of headroom there. The  
cluster is
scheduled to grow in the near future, which may worsen this hang/ 
blocking
if its due to a bottleneck.

Before I start tracing through the code, I thought I might ask whether  
anyone
has seen anything the exerts from the jobtracker logs below? Is there  
a way
to guarantee that all in processes takes for a given reduce task will be
terminated (and any associated network connections be sent a reset or
something) before a new reduce task is started.

On kind of side thought - is the task attempt name in the jobconf that  
is handed
to the reduce in configure() and if so - what might the setting name  
be to get at
it? Or does one need to go through a more circuitous route to obtain the
TaskAttemptID associated with the attempt?

Back to the point at hand, from the jobtracker logs:

Failing initial reduce:
----------------------------
2009-09-27 22:24:25,056 INFO org.apache.hadoop.mapred.TaskInProgress:  
Error from attempt_200909231347_0694_r_000002_0:  
java.net.SocketTimeoutException: 69000 millis timeout while waiting  
for channel to be ready for read. ch :  
java.nio.channels.SocketChannel[connected local=/X.X.X.2:47440 remote=/ 
X.X.X.2:50010]
        at  
org 
.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: 
162)
        at  
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
        at  
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
        at  
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116)
        at java.io.DataInputStream.readByte(DataInputStream.java:248)
        at  
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
        at  
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
        at org.apache.hadoop.io.Text.readString(Text.java:400)
        at org.apache.hadoop.hdfs.DFSClient 
$DFSOutputStream.createBlockOutputStream(DFSClient.java:2787)
        at org.apache.hadoop.hdfs.DFSClient 
$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2712)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access 
$2000(DFSClient.java:1996)
       at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
$DataStreamer.run(DFSClient.java:2182)

Failing second reduce:
-------------------------------
2009-09-27 22:53:22,048 INFO org.apache.hadoop.mapred.TaskInProgress:  
Error from attempt_200909231347_0694_r_000002_3:  
org.apache.hadoop.ipc.RemoteException:  
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed  
to create file >blah<
for DFSClient_attempt_200909231347_0694_r_000002_3 on client X.X.X.7,  
because this file is already being created by  
DFSClient_attempt_200909231347_0694_r_000002_0 on X.X.X.2
        at  
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal
(FSNamesystem.java:1085)        at  
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNames
ystem.java:998)        at  
org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:
301)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown  
Source)        at  
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)        at  
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

        at org.apache.hadoop.ipc.Client.call(Client.java:697)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at $Proxy1.create(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at  
org 
.apache 
.hadoop 
.io 
.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java: 
82)
        at  
org 
.apache 
.hadoop 
.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy1.create(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient 
$DFSOutputStream.<init>(DFSClient.java:2594)
        at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
        at  
org 
.apache 
.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java: 
188)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)


Many thanks...

dave bayer

Re: dfs create block sticking

Posted by Jason Venner <ja...@gmail.com>.

I had a problem like that with a custom record writer - solr-1301

On Mon, Sep 28, 2009 at 11:18 PM, Chandraprakash Bhagtani <
cpbhagtani@gmail.com> wrote:

> I faced the org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException
> exception once.
> What I was doing that I was overriding FileOutputFormat in some class and
> in
> that I had
> opend a file stream. I did this because I needed only a single file as an
> output. It was working
> fine when I had only one reducer. But when I increased the number of
> reducers, every reducer
> was trying to create/use a file with the same name, therefore I got
> AlreadyBeingCreatedException.
>
> your case may be different, but I thought to share mine.
>
> On Tue, Sep 29, 2009 at 11:03 AM, Jason Venner <jason.hadoop@gmail.com
> >wrote:
>
> > How long does it take you to create a file in on one of your datanodes,
> in
> > the dfs block storage area, while your job is running, it could simply be
> > that the OS level file creation is taking longer than the RPC timeout.
> >
> > On Mon, Sep 28, 2009 at 5:30 PM, dave bayer <da...@cloudfactory.org>
> > wrote:
> >
> > > On a cluster running 0.19.2
> > >
> > > We have some production jobs that perform ETL tasks that open files
> > > in hdfs during the reduce task (with speculative execution in reduce
> > stage
> > > programmatically turned off). Since upgrading the cluster from 0.19.1,
> > > we've
> > > been seeing some odd behavior in that we are experiencing timeouts with
> > > block/file creation, timeouts that are long enough that the reduce
> > attempt
> > > gets
> > > killed. Subsequent reduce attempts then fail because the first killed
> > > attempt
> > > is still noted (by the namenode I assume) to create the block/file
> > > according to
> > > the exception that bubbles up. Didn't see anything like this in JIRA,
> and
> > > I'm
> > > trying to grab a few jstacks from the namenode when I see these errors
> > pop
> > > up (usually correlated with a somewhat busy cluster) in an effort to
> get
> > > some
> > > idea of what is going on here.
> > >
> > > Currently the cluster is small with about 5 data nodes and 10s of TBs
> > with
> > > the 2x the namespace files easily fitting in memory.... I don't see any
> > > process
> > > eating more than a couple percent of cpu on the name node box (which
> > > also hosts the secondary nn). iostat shows 100-200 block read/written
> > every
> > > other second on this host leaving plenty of headroom there. The cluster
> > is
> > > scheduled to grow in the near future, which may worsen this
> hang/blocking
> > > if its due to a bottleneck.
> > >
> > > Before I start tracing through the code, I thought I might ask whether
> > > anyone
> > > has seen anything the exerts from the jobtracker logs below? Is there a
> > way
> > > to guarantee that all in processes takes for a given reduce task will
> be
> > > terminated (and any associated network connections be sent a reset or
> > > something) before a new reduce task is started.
> > >
> > > On kind of side thought - is the task attempt name in the jobconf that
> is
> > > handed
> > > to the reduce in configure() and if so - what might the setting name be
> > to
> > > get at
> > > it? Or does one need to go through a more circuitous route to obtain
> the
> > > TaskAttemptID associated with the attempt?
> > >
> > > Back to the point at hand, from the jobtracker logs:
> > >
> > > Failing initial reduce:
> > > ----------------------------
> > > 2009-09-27 22:24:25,056 INFO org.apache.hadoop.mapred.TaskInProgress:
> > Error
> > > from attempt_200909231347_0694_r_000002_0:
> > java.net.SocketTimeoutException:
> > > 69000 millis timeout while waiting for channel to be ready for read. ch
> :
> > > java.nio.channels.SocketChannel[connected local=/X.X.X.2:47440
> > > remote=/X.X.X.2:50010]
> > >       at
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
> > >       at
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
> > >       at
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
> > >       at
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116)
> > >       at java.io.DataInputStream.readByte(DataInputStream.java:248)
> > >       at
> > > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
> > >       at
> > > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
> > >       at org.apache.hadoop.io.Text.readString(Text.java:400)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2787)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2712)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> > >      at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)
> > >
> > > Failing second reduce:
> > > -------------------------------
> > > 2009-09-27 22:53:22,048 INFO org.apache.hadoop.mapred.TaskInProgress:
> > Error
> > > from attempt_200909231347_0694_r_000002_3:
> > > org.apache.hadoop.ipc.RemoteException:
> > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
> > > create file >blah<
> > > for DFSClient_attempt_200909231347_0694_r_000002_3 on client X.X.X.7,
> > > because this file is already being created by
> > > DFSClient_attempt_200909231347_0694_r_000002_0 on X.X.X.2
> > >       at
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal
> > > (FSNamesystem.java:1085)        at
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNames
> > > ystem.java:998)        at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:
> > > 301)
> > >       at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
> > >  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> > > sorImpl.java:25)
> > >       at java.lang.reflect.Method.invoke(Method.java:597)        at
> > > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
> > >       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> > >
> > >       at org.apache.hadoop.ipc.Client.call(Client.java:697)
> > >       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> > >       at $Proxy1.create(Unknown Source)
> > >       at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> > >       at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >       at java.lang.reflect.Method.invoke(Method.java:597)
> > >       at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> > >       at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> > >       at $Proxy1.create(Unknown Source)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2594)
> > >       at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:188)
> > >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
> > >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
> > >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
> > >
> > >
> > > Many thanks...
> > >
> > > dave bayer
> > >
> >
> >
> >
> > --
> > Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> > http://www.amazon.com/dp/1430219424?tag=jewlerymall
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>
>
>
> --
> Thanks & Regards,
> Chandra Prakash Bhagtani,
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: dfs create block sticking

Posted by Chandraprakash Bhagtani <cp...@gmail.com>.

I faced the org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException
exception once.
What I was doing that I was overriding FileOutputFormat in some class and in
that I had
opend a file stream. I did this because I needed only a single file as an
output. It was working
fine when I had only one reducer. But when I increased the number of
reducers, every reducer
was trying to create/use a file with the same name, therefore I got
AlreadyBeingCreatedException.

your case may be different, but I thought to share mine.

On Tue, Sep 29, 2009 at 11:03 AM, Jason Venner <ja...@gmail.com>wrote:

> How long does it take you to create a file in on one of your datanodes, in
> the dfs block storage area, while your job is running, it could simply be
> that the OS level file creation is taking longer than the RPC timeout.
>
> On Mon, Sep 28, 2009 at 5:30 PM, dave bayer <da...@cloudfactory.org>
> wrote:
>
> > On a cluster running 0.19.2
> >
> > We have some production jobs that perform ETL tasks that open files
> > in hdfs during the reduce task (with speculative execution in reduce
> stage
> > programmatically turned off). Since upgrading the cluster from 0.19.1,
> > we've
> > been seeing some odd behavior in that we are experiencing timeouts with
> > block/file creation, timeouts that are long enough that the reduce
> attempt
> > gets
> > killed. Subsequent reduce attempts then fail because the first killed
> > attempt
> > is still noted (by the namenode I assume) to create the block/file
> > according to
> > the exception that bubbles up. Didn't see anything like this in JIRA, and
> > I'm
> > trying to grab a few jstacks from the namenode when I see these errors
> pop
> > up (usually correlated with a somewhat busy cluster) in an effort to get
> > some
> > idea of what is going on here.
> >
> > Currently the cluster is small with about 5 data nodes and 10s of TBs
> with
> > the 2x the namespace files easily fitting in memory.... I don't see any
> > process
> > eating more than a couple percent of cpu on the name node box (which
> > also hosts the secondary nn). iostat shows 100-200 block read/written
> every
> > other second on this host leaving plenty of headroom there. The cluster
> is
> > scheduled to grow in the near future, which may worsen this hang/blocking
> > if its due to a bottleneck.
> >
> > Before I start tracing through the code, I thought I might ask whether
> > anyone
> > has seen anything the exerts from the jobtracker logs below? Is there a
> way
> > to guarantee that all in processes takes for a given reduce task will be
> > terminated (and any associated network connections be sent a reset or
> > something) before a new reduce task is started.
> >
> > On kind of side thought - is the task attempt name in the jobconf that is
> > handed
> > to the reduce in configure() and if so - what might the setting name be
> to
> > get at
> > it? Or does one need to go through a more circuitous route to obtain the
> > TaskAttemptID associated with the attempt?
> >
> > Back to the point at hand, from the jobtracker logs:
> >
> > Failing initial reduce:
> > ----------------------------
> > 2009-09-27 22:24:25,056 INFO org.apache.hadoop.mapred.TaskInProgress:
> Error
> > from attempt_200909231347_0694_r_000002_0:
> java.net.SocketTimeoutException:
> > 69000 millis timeout while waiting for channel to be ready for read. ch :
> > java.nio.channels.SocketChannel[connected local=/X.X.X.2:47440
> > remote=/X.X.X.2:50010]
> >       at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
> >       at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
> >       at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
> >       at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116)
> >       at java.io.DataInputStream.readByte(DataInputStream.java:248)
> >       at
> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
> >       at
> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
> >       at org.apache.hadoop.io.Text.readString(Text.java:400)
> >       at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2787)
> >       at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2712)
> >       at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> >      at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)
> >
> > Failing second reduce:
> > -------------------------------
> > 2009-09-27 22:53:22,048 INFO org.apache.hadoop.mapred.TaskInProgress:
> Error
> > from attempt_200909231347_0694_r_000002_3:
> > org.apache.hadoop.ipc.RemoteException:
> > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
> > create file >blah<
> > for DFSClient_attempt_200909231347_0694_r_000002_3 on client X.X.X.7,
> > because this file is already being created by
> > DFSClient_attempt_200909231347_0694_r_000002_0 on X.X.X.2
> >       at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal
> > (FSNamesystem.java:1085)        at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNames
> > ystem.java:998)        at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:
> > 301)
> >       at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
> >  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> > sorImpl.java:25)
> >       at java.lang.reflect.Method.invoke(Method.java:597)        at
> > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
> >       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> >
> >       at org.apache.hadoop.ipc.Client.call(Client.java:697)
> >       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> >       at $Proxy1.create(Unknown Source)
> >       at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> >       at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >       at java.lang.reflect.Method.invoke(Method.java:597)
> >       at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> >       at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> >       at $Proxy1.create(Unknown Source)
> >       at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2594)
> >       at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
> >       at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:188)
> >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
> >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
> >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
> >
> >
> > Many thanks...
> >
> > dave bayer
> >
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>



-- 
Thanks & Regards,
Chandra Prakash Bhagtani,

Re: dfs create block sticking

Posted by Jason Venner <ja...@gmail.com>.

How long does it take you to create a file in on one of your datanodes, in
the dfs block storage area, while your job is running, it could simply be
that the OS level file creation is taking longer than the RPC timeout.

On Mon, Sep 28, 2009 at 5:30 PM, dave bayer <da...@cloudfactory.org> wrote:

> On a cluster running 0.19.2
>
> We have some production jobs that perform ETL tasks that open files
> in hdfs during the reduce task (with speculative execution in reduce stage
> programmatically turned off). Since upgrading the cluster from 0.19.1,
> we've
> been seeing some odd behavior in that we are experiencing timeouts with
> block/file creation, timeouts that are long enough that the reduce attempt
> gets
> killed. Subsequent reduce attempts then fail because the first killed
> attempt
> is still noted (by the namenode I assume) to create the block/file
> according to
> the exception that bubbles up. Didn't see anything like this in JIRA, and
> I'm
> trying to grab a few jstacks from the namenode when I see these errors pop
> up (usually correlated with a somewhat busy cluster) in an effort to get
> some
> idea of what is going on here.
>
> Currently the cluster is small with about 5 data nodes and 10s of TBs with
> the 2x the namespace files easily fitting in memory.... I don't see any
> process
> eating more than a couple percent of cpu on the name node box (which
> also hosts the secondary nn). iostat shows 100-200 block read/written every
> other second on this host leaving plenty of headroom there. The cluster is
> scheduled to grow in the near future, which may worsen this hang/blocking
> if its due to a bottleneck.
>
> Before I start tracing through the code, I thought I might ask whether
> anyone
> has seen anything the exerts from the jobtracker logs below? Is there a way
> to guarantee that all in processes takes for a given reduce task will be
> terminated (and any associated network connections be sent a reset or
> something) before a new reduce task is started.
>
> On kind of side thought - is the task attempt name in the jobconf that is
> handed
> to the reduce in configure() and if so - what might the setting name be to
> get at
> it? Or does one need to go through a more circuitous route to obtain the
> TaskAttemptID associated with the attempt?
>
> Back to the point at hand, from the jobtracker logs:
>
> Failing initial reduce:
> ----------------------------
> 2009-09-27 22:24:25,056 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_200909231347_0694_r_000002_0: java.net.SocketTimeoutException:
> 69000 millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/X.X.X.2:47440
> remote=/X.X.X.2:50010]
>       at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
>       at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>       at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>       at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116)
>       at java.io.DataInputStream.readByte(DataInputStream.java:248)
>       at
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
>       at
> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
>       at org.apache.hadoop.io.Text.readString(Text.java:400)
>       at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2787)
>       at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2712)
>       at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>      at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)
>
> Failing second reduce:
> -------------------------------
> 2009-09-27 22:53:22,048 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_200909231347_0694_r_000002_3:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
> create file >blah<
> for DFSClient_attempt_200909231347_0694_r_000002_3 on client X.X.X.7,
> because this file is already being created by
> DFSClient_attempt_200909231347_0694_r_000002_0 on X.X.X.2
>       at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal
> (FSNamesystem.java:1085)        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNames
> ystem.java:998)        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:
> 301)
>       at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)        at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>
>       at org.apache.hadoop.ipc.Client.call(Client.java:697)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>       at $Proxy1.create(Unknown Source)
>       at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>       at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>       at $Proxy1.create(Unknown Source)
>       at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2594)
>       at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:188)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
>
>
> Many thanks...
>
> dave bayer
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals