You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ben Hardy <be...@gmail.com> on 2010/01/25 20:01:49 UTC

HDFS behaving strangely

Hey folks,

We're running a 100 node cluster on Hadoop 0.18.3 using Amazon Elastic
MapReduce.

We've been uploading data to this cluster via SCP and using hadoop fs
-copyFromLocal to get it into HDFS.

Generally this works fine but our last run saw a failure in this operation
which only said "RuntimeError".

So we blew away the destination directory in HDFS and tried the
copyFromLocal again.

This time it failed because it thinks one of the files it's trying to copy
to HDFS is already in HDFS, however, I don't get how this is possible if we
just blew away the destination's parent directory. Subequent attepts result
in identical results.

hadoop fsck reports a HEALTHY filesystem.

We do see a lot of errors like those below in the namenode log. Are these
normal, or perhaps related to the problem described above?

Would appreciate any advice or suggestions.

b

2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC Server
handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
blk_-3060969094589165545 is added to invalidSet of 10.245.103.240:9200
2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC Server
handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
blk_-3060969094589165545 is added to invalidSet of 10.242.25.206:9200
2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC Server
handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
blk_5935615666845780861 is added to invalidSet of 10.242.15.111:9200
2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC Server
handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
blk_5935615666845780861 is added to invalidSet of 10.244.107.18:9200

Re: HDFS behaving strangely

Posted by Mark Kerzner <ma...@gmail.com>.

You may be facing the other well-known problem in Hadoop - don't use many
small files:

http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

On Mon, Jan 25, 2010 at 7:38 PM, Ben Hardy <be...@gmail.com> wrote:

> For me the cause of this problem turned out to be a bug in Linux 2.6.21,
> which is used in the default Elastic MapReduce AMI we run on c1-mediums.
>
> What was going on in the files that I uploaded is that in one particular
> directory with 15,000 odd files in it, some of the files were appearing in
> output filesystem commands like find and ls TWICE. Really weird. So when
> hadoop tried to copy these files into HDFS it quite rightly complained that
> it had seen that file before.
>
> Even though all my filenames are unique.
>
> So watch out for that one folks, it's a doozy, and it's not a Hadoop bug,
> but still might bite you.
>
> -b
>
> On Mon, Jan 25, 2010 at 11:08 AM, Mark Kerzner <markkerzner@gmail.com
> >wrote:
>
> > I hit this error in -copyFromLocal, or a similar one, all the time. It is
> > also found in .19 and .20.
> >
> > One can work around manually. For example, copy the file to a different
> > place in HDFS, remove the offending file in HDFS, and rename your file
> into
> > the problem one. This works, and after this I have no problem.
> >
> > The funny thing is that it happens for specific file names, only a few.
> For
> > example, job.prop always gives a problem, whereas job.properties does
> not.
> >
> > If I were a good boy, I would debug it with "job.prop" file, but of
> course
> > I
> > just found a workaround and forgot about it.
> >
> > Sincerely,
> > Mark
> >
> > On Mon, Jan 25, 2010 at 1:01 PM, Ben Hardy <be...@gmail.com> wrote:
> >
> > > Hey folks,
> > >
> > > We're running a 100 node cluster on Hadoop 0.18.3 using Amazon Elastic
> > > MapReduce.
> > >
> > > We've been uploading data to this cluster via SCP and using hadoop fs
> > > -copyFromLocal to get it into HDFS.
> > >
> > > Generally this works fine but our last run saw a failure in this
> > operation
> > > which only said "RuntimeError".
> > >
> > > So we blew away the destination directory in HDFS and tried the
> > > copyFromLocal again.
> > >
> > > This time it failed because it thinks one of the files it's trying to
> > copy
> > > to HDFS is already in HDFS, however, I don't get how this is possible
> if
> > we
> > > just blew away the destination's parent directory. Subequent attepts
> > result
> > > in identical results.
> > >
> > > hadoop fsck reports a HEALTHY filesystem.
> > >
> > > We do see a lot of errors like those below in the namenode log. Are
> these
> > > normal, or perhaps related to the problem described above?
> > >
> > > Would appreciate any advice or suggestions.
> > >
> > > b
> > >
> > > 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC
> > Server
> > > handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> > > blk_-3060969094589165545 is added to invalidSet of 10.245.103.240:9200
> > > 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC
> > Server
> > > handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> > > blk_-3060969094589165545 is added to invalidSet of 10.242.25.206:9200
> > > 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC
> > Server
> > > handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> > > blk_5935615666845780861 is added to invalidSet of 10.242.15.111:9200
> > > 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC
> > Server
> > > handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> > > blk_5935615666845780861 is added to invalidSet of 10.244.107.18:9200
> > >
> >
>

Re: HDFS behaving strangely

Posted by Ben Hardy <be...@gmail.com>.

For me the cause of this problem turned out to be a bug in Linux 2.6.21,
which is used in the default Elastic MapReduce AMI we run on c1-mediums.

What was going on in the files that I uploaded is that in one particular
directory with 15,000 odd files in it, some of the files were appearing in
output filesystem commands like find and ls TWICE. Really weird. So when
hadoop tried to copy these files into HDFS it quite rightly complained that
it had seen that file before.

Even though all my filenames are unique.

So watch out for that one folks, it's a doozy, and it's not a Hadoop bug,
but still might bite you.

-b

On Mon, Jan 25, 2010 at 11:08 AM, Mark Kerzner <ma...@gmail.com>wrote:

> I hit this error in -copyFromLocal, or a similar one, all the time. It is
> also found in .19 and .20.
>
> One can work around manually. For example, copy the file to a different
> place in HDFS, remove the offending file in HDFS, and rename your file into
> the problem one. This works, and after this I have no problem.
>
> The funny thing is that it happens for specific file names, only a few. For
> example, job.prop always gives a problem, whereas job.properties does not.
>
> If I were a good boy, I would debug it with "job.prop" file, but of course
> I
> just found a workaround and forgot about it.
>
> Sincerely,
> Mark
>
> On Mon, Jan 25, 2010 at 1:01 PM, Ben Hardy <be...@gmail.com> wrote:
>
> > Hey folks,
> >
> > We're running a 100 node cluster on Hadoop 0.18.3 using Amazon Elastic
> > MapReduce.
> >
> > We've been uploading data to this cluster via SCP and using hadoop fs
> > -copyFromLocal to get it into HDFS.
> >
> > Generally this works fine but our last run saw a failure in this
> operation
> > which only said "RuntimeError".
> >
> > So we blew away the destination directory in HDFS and tried the
> > copyFromLocal again.
> >
> > This time it failed because it thinks one of the files it's trying to
> copy
> > to HDFS is already in HDFS, however, I don't get how this is possible if
> we
> > just blew away the destination's parent directory. Subequent attepts
> result
> > in identical results.
> >
> > hadoop fsck reports a HEALTHY filesystem.
> >
> > We do see a lot of errors like those below in the namenode log. Are these
> > normal, or perhaps related to the problem described above?
> >
> > Would appreciate any advice or suggestions.
> >
> > b
> >
> > 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC
> Server
> > handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> > blk_-3060969094589165545 is added to invalidSet of 10.245.103.240:9200
> > 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC
> Server
> > handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> > blk_-3060969094589165545 is added to invalidSet of 10.242.25.206:9200
> > 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC
> Server
> > handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> > blk_5935615666845780861 is added to invalidSet of 10.242.15.111:9200
> > 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC
> Server
> > handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> > blk_5935615666845780861 is added to invalidSet of 10.244.107.18:9200
> >
>

Re: HDFS behaving strangely

Posted by Mark Kerzner <ma...@gmail.com>.

I hit this error in -copyFromLocal, or a similar one, all the time. It is
also found in .19 and .20.

One can work around manually. For example, copy the file to a different
place in HDFS, remove the offending file in HDFS, and rename your file into
the problem one. This works, and after this I have no problem.

The funny thing is that it happens for specific file names, only a few. For
example, job.prop always gives a problem, whereas job.properties does not.

If I were a good boy, I would debug it with "job.prop" file, but of course I
just found a workaround and forgot about it.

Sincerely,
Mark

On Mon, Jan 25, 2010 at 1:01 PM, Ben Hardy <be...@gmail.com> wrote:

> Hey folks,
>
> We're running a 100 node cluster on Hadoop 0.18.3 using Amazon Elastic
> MapReduce.
>
> We've been uploading data to this cluster via SCP and using hadoop fs
> -copyFromLocal to get it into HDFS.
>
> Generally this works fine but our last run saw a failure in this operation
> which only said "RuntimeError".
>
> So we blew away the destination directory in HDFS and tried the
> copyFromLocal again.
>
> This time it failed because it thinks one of the files it's trying to copy
> to HDFS is already in HDFS, however, I don't get how this is possible if we
> just blew away the destination's parent directory. Subequent attepts result
> in identical results.
>
> hadoop fsck reports a HEALTHY filesystem.
>
> We do see a lot of errors like those below in the namenode log. Are these
> normal, or perhaps related to the problem described above?
>
> Would appreciate any advice or suggestions.
>
> b
>
> 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC Server
> handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> blk_-3060969094589165545 is added to invalidSet of 10.245.103.240:9200
> 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC Server
> handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> blk_-3060969094589165545 is added to invalidSet of 10.242.25.206:9200
> 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC Server
> handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> blk_5935615666845780861 is added to invalidSet of 10.242.15.111:9200
> 2010-01-25 16:34:19,762 INFO org.apache.hadoop.dfs.StateChange (IPC Server
> handler 12 on 9000): BLOCK* NameSystem.addToInvalidates:
> blk_5935615666845780861 is added to invalidSet of 10.244.107.18:9200
>