You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by C G <pa...@yahoo.com> on 2008/02/27 21:05:57 UTC

Solving the "hang" problem in dfs -copyToLocal/-cat...

Hi All:
   
  The following write-up is offered to help out anybody else who has seen performance problems and "hangs" while using dfs -copyToLocal/-cat.
   
  One of the performance problems that has been causing big problems for us has been using the dfs commands -copyToLocal and -cat to move data from HDFS to a local file system.  We do this in order to populate a data warehouse that is HDFS-unaware.
   
  The "pattern" I've been using is:
   
  rm -f loadfile.dat
  fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
  for x in `echo ${fileList}`
  do
     bin/hadoop dfs -cat ${x} >> loadfile.dat
  done
   
  This pattern repeats several times, ultimately cat-ing 353 files into several load files.  This process is extremely slow, often taking 20-30 minutes to transfer 142M of data.  More frustrating is that the system simply "pauses" during cat operations.  There is no I/O activity, no CPU activity, nothing written to the log files on any node.  Things just stop.  I changed the pattern to use -copyToLocal instead of -cat and had the same results.  We observe this "pause" behavior without respect for where the -copyToLocal or -cat originates - I've tried running directly on the grid, and also directly on the DB server which is not part of the grid proper.  I've tried many different releases of Hadoop, including 0.16.0, and all exhibit this problem.
   
  I decided to try a different approach and use the HTTP interface to the namenode to transfer the data:
   
  rm -f loadfile.dat
  fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
  for x in `echo ${fileList}`
  do
   wget -q http://mynamenodeserver:50070/data${x}
  done
   
  There is a trivial step to merge the individual part files into one file preparatory for loading data.
   
  I ran this experiment across 10,850 files containing an aggregate total of 4.6G of data.  It ran in under 2 hours, which while not great is significantly better than the 18 hours it previously took -copyToLocal/-cat to run. 
   
  I found it surprising that this solution works better than -copyToLocal/-cat. 
   
  Hope this helps...
  C G
   

       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

Re: Solving the "hang" problem in dfs -copyToLocal/-cat...

Posted by Ted Dunning <td...@veoh.com>.
Ooops.  Should have read the rest of your posting.  Sorry about the noise.


On 2/27/08 12:05 PM, "C G" <pa...@yahoo.com> wrote:

> Hi All:
>    
>   The following write-up is offered to help out anybody else who has seen
> performance problems and "hangs" while using dfs -copyToLocal/-cat.
>    
>   One of the performance problems that has been causing big problems for us
> has been using the dfs commands -copyToLocal and -cat to move data from HDFS
> to a local file system.  We do this in order to populate a data warehouse that
> is HDFS-unaware.
>    
>   The "pattern" I've been using is:
>    
>   rm -f loadfile.dat
>   fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>   for x in `echo ${fileList}`
>   do
>      bin/hadoop dfs -cat ${x} >> loadfile.dat
>   done
>    
>   This pattern repeats several times, ultimately cat-ing 353 files into
> several load files.  This process is extremely slow, often taking 20-30
> minutes to transfer 142M of data.  More frustrating is that the system simply
> "pauses" during cat operations.  There is no I/O activity, no CPU activity,
> nothing written to the log files on any node.  Things just stop.  I changed
> the pattern to use -copyToLocal instead of -cat and had the same results.  We
> observe this "pause" behavior without respect for where the -copyToLocal or
> -cat originates - I've tried running directly on the grid, and also directly
> on the DB server which is not part of the grid proper.  I've tried many
> different releases of Hadoop, including 0.16.0, and all exhibit this problem.
>    
>   I decided to try a different approach and use the HTTP interface to the
> namenode to transfer the data:
>    
>   rm -f loadfile.dat
>   fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>   for x in `echo ${fileList}`
>   do
>    wget -q http://mynamenodeserver:50070/data${x}
>   done
>    
>   There is a trivial step to merge the individual part files into one file
> preparatory for loading data.
>    
>   I ran this experiment across 10,850 files containing an aggregate total of
> 4.6G of data.  It ran in under 2 hours, which while not great is significantly
> better than the 18 hours it previously took -copyToLocal/-cat to run.
>    
>   I found it surprising that this solution works better than
> -copyToLocal/-cat.
>    
>   Hope this helps...
>   C G
>    
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo! Search.


Re: Solving the "hang" problem in dfs -copyToLocal/-cat...

Posted by Ted Dunning <td...@veoh.com>.
It is read-only.

I started a fix to add posting, but didn't finish it.


On 2/27/08 2:59 PM, "C G" <pa...@yahoo.com> wrote:

> I think HTTP access is read-only...you'll need to continue to use
> copyFromLocalFile
>    
>   C G
>   
> 
> Phillip Wu <pw...@helio.com> wrote:
>   Very helpful information.
> 
> Is there any ways to put files into DFS remotely, like http post?
> Or I have to keep using copyFromLocalFile?
> 
> 
> Thanks,
> 
> Phil
> 
> mobile . 626.234.7515 . yim . heliophillip
> www.helio.com
> -----Original Message-----
> From: C G [mailto:parallelguy@yahoo.com]
> Sent: Wednesday, February 27, 2008 2:46 PM
> To: core-user@hadoop.apache.org
> Subject: RE: Solving the "hang" problem in dfs -copyToLocal/-cat...
> 
> I haven't looked at the source code to see how -cat is implemented, but
> I was pretty surprised at the results as well. When I sat down to do
> this experiment I figured I was wasting my time..surprisingly I was not.
> 
> C G
> 
> Joydeep Sen Sarma wrote:
> This is amazing ..
> 
> Wouldn't dfs -cat use the same dfs client codepath that an actual
> map-reduce program would? (If so, should it also start using http client
> instead? (at least for the non-local case))
> 
> Or maybe it already does?
> 
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Wednesday, February 27, 2008 12:10 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Solving the "hang" problem in dfs -copyToLocal/-cat...
> 
> 
> Have you tried using http to fetch the file instead?
> 
> http:///data/
> 
> This will get redirected to one of the datanodes to handle and should be
> pretty fast. It would be interesting to find out if this alternative
> path
> is subject to the same hangs that you are seeing.
> 
> 
> On 2/27/08 12:05 PM, "C G"
> wrote:
> 
>> Hi All:
>> 
>> The following write-up is offered to help out anybody else who has
> seen
>> performance problems and "hangs" while using dfs -copyToLocal/-cat.
>> 
>> One of the performance problems that has been causing big problems
> for us
>> has been using the dfs commands -copyToLocal and -cat to move data
> from HDFS
>> to a local file system. We do this in order to populate a data
> warehouse that
>> is HDFS-unaware.
>> 
>> The "pattern" I've been using is:
>> 
>> rm -f loadfile.dat
>> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>> for x in `echo ${fileList}`
>> do
>> bin/hadoop dfs -cat ${x} >> loadfile.dat
>> done
>> 
>> This pattern repeats several times, ultimately cat-ing 353 files
> into
>> several load files. This process is extremely slow, often taking
> 20-30
>> minutes to transfer 142M of data. More frustrating is that the system
> simply
>> "pauses" during cat operations. There is no I/O activity, no CPU
> activity,
>> nothing written to the log files on any node. Things just stop. I
> changed
>> the pattern to use -copyToLocal instead of -cat and had the same
> results. We
>> observe this "pause" behavior without respect for where the
> -copyToLocal or
>> -cat originates - I've tried running directly on the grid, and also
> directly
>> on the DB server which is not part of the grid proper. I've tried
> many
>> different releases of Hadoop, including 0.16.0, and all exhibit this
> problem.
>> 
>> I decided to try a different approach and use the HTTP interface to
> the
>> namenode to transfer the data:
>> 
>> rm -f loadfile.dat
>> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>> for x in `echo ${fileList}`
>> do
>> wget -q http://mynamenodeserver:50070/data${x}
>> done
>> 
>> There is a trivial step to merge the individual part files into one
> file
>> preparatory for loading data.
>> 
>> I ran this experiment across 10,850 files containing an aggregate
> total of
>> 4.6G of data. It ran in under 2 hours, which while not great is
> significantly
>> better than the 18 hours it previously took -copyToLocal/-cat to run.
>> 
>> I found it surprising that this solution works better than
>> -copyToLocal/-cat.
>> 
>> Hope this helps...
>> C G
>> 
>> 
>> 
>> ---------------------------------
>> Looking for last minute shopping deals? Find them fast with Yahoo!
> Search.
> 
> 
> 
> 
> ---------------------------------
> Looking for last minute shopping deals? Find them fast with Yahoo!
> Search.
> 
> 
>        
> ---------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it
> now.


RE: Solving the "hang" problem in dfs -copyToLocal/-cat...

Posted by C G <pa...@yahoo.com>.
I think HTTP access is read-only...you'll need to continue to use copyFromLocalFile
   
  C G
  

Phillip Wu <pw...@helio.com> wrote:
  Very helpful information.

Is there any ways to put files into DFS remotely, like http post?
Or I have to keep using copyFromLocalFile?


Thanks,

Phil

mobile . 626.234.7515 . yim . heliophillip
www.helio.com
-----Original Message-----
From: C G [mailto:parallelguy@yahoo.com] 
Sent: Wednesday, February 27, 2008 2:46 PM
To: core-user@hadoop.apache.org
Subject: RE: Solving the "hang" problem in dfs -copyToLocal/-cat...

I haven't looked at the source code to see how -cat is implemented, but
I was pretty surprised at the results as well. When I sat down to do
this experiment I figured I was wasting my time..surprisingly I was not.

C G

Joydeep Sen Sarma wrote:
This is amazing ..

Wouldn't dfs -cat use the same dfs client codepath that an actual
map-reduce program would? (If so, should it also start using http client
instead? (at least for the non-local case))

Or maybe it already does?

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, February 27, 2008 12:10 PM
To: core-user@hadoop.apache.org
Subject: Re: Solving the "hang" problem in dfs -copyToLocal/-cat...


Have you tried using http to fetch the file instead?

http:///data/

This will get redirected to one of the datanodes to handle and should be
pretty fast. It would be interesting to find out if this alternative
path
is subject to the same hangs that you are seeing.


On 2/27/08 12:05 PM, "C G" 
wrote:

> Hi All:
> 
> The following write-up is offered to help out anybody else who has
seen
> performance problems and "hangs" while using dfs -copyToLocal/-cat.
> 
> One of the performance problems that has been causing big problems
for us
> has been using the dfs commands -copyToLocal and -cat to move data
from HDFS
> to a local file system. We do this in order to populate a data
warehouse that
> is HDFS-unaware.
> 
> The "pattern" I've been using is:
> 
> rm -f loadfile.dat
> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
> for x in `echo ${fileList}`
> do
> bin/hadoop dfs -cat ${x} >> loadfile.dat
> done
> 
> This pattern repeats several times, ultimately cat-ing 353 files
into
> several load files. This process is extremely slow, often taking
20-30
> minutes to transfer 142M of data. More frustrating is that the system
simply
> "pauses" during cat operations. There is no I/O activity, no CPU
activity,
> nothing written to the log files on any node. Things just stop. I
changed
> the pattern to use -copyToLocal instead of -cat and had the same
results. We
> observe this "pause" behavior without respect for where the
-copyToLocal or
> -cat originates - I've tried running directly on the grid, and also
directly
> on the DB server which is not part of the grid proper. I've tried
many
> different releases of Hadoop, including 0.16.0, and all exhibit this
problem.
> 
> I decided to try a different approach and use the HTTP interface to
the
> namenode to transfer the data:
> 
> rm -f loadfile.dat
> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
> for x in `echo ${fileList}`
> do
> wget -q http://mynamenodeserver:50070/data${x}
> done
> 
> There is a trivial step to merge the individual part files into one
file
> preparatory for loading data.
> 
> I ran this experiment across 10,850 files containing an aggregate
total of
> 4.6G of data. It ran in under 2 hours, which while not great is
significantly
> better than the 18 hours it previously took -copyToLocal/-cat to run.
> 
> I found it surprising that this solution works better than
> -copyToLocal/-cat.
> 
> Hope this helps...
> C G
> 
> 
> 
> ---------------------------------
> Looking for last minute shopping deals? Find them fast with Yahoo!
Search.




---------------------------------
Looking for last minute shopping deals? Find them fast with Yahoo!
Search.


       
---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

RE: Solving the "hang" problem in dfs -copyToLocal/-cat...

Posted by Phillip Wu <pw...@helio.com>.
Very helpful information.

Is there any ways to put files into DFS remotely, like http post?
Or I have to keep using copyFromLocalFile?


Thanks,

Phil

mobile . 626.234.7515 . yim . heliophillip
www.helio.com
-----Original Message-----
From: C G [mailto:parallelguy@yahoo.com] 
Sent: Wednesday, February 27, 2008 2:46 PM
To: core-user@hadoop.apache.org
Subject: RE: Solving the "hang" problem in dfs -copyToLocal/-cat...

I haven't looked at the source code to see how -cat is implemented, but
I was pretty surprised at the results as well.  When I sat down to do
this experiment I figured I was wasting my time..surprisingly I was not.
   
  C G

Joydeep Sen Sarma <js...@facebook.com> wrote:
  This is amazing ..

Wouldn't dfs -cat use the same dfs client codepath that an actual
map-reduce program would? (If so, should it also start using http client
instead? (at least for the non-local case))

Or maybe it already does?

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, February 27, 2008 12:10 PM
To: core-user@hadoop.apache.org
Subject: Re: Solving the "hang" problem in dfs -copyToLocal/-cat...


Have you tried using http to fetch the file instead?

http:///data/

This will get redirected to one of the datanodes to handle and should be
pretty fast. It would be interesting to find out if this alternative
path
is subject to the same hangs that you are seeing.


On 2/27/08 12:05 PM, "C G" 
wrote:

> Hi All:
> 
> The following write-up is offered to help out anybody else who has
seen
> performance problems and "hangs" while using dfs -copyToLocal/-cat.
> 
> One of the performance problems that has been causing big problems
for us
> has been using the dfs commands -copyToLocal and -cat to move data
from HDFS
> to a local file system. We do this in order to populate a data
warehouse that
> is HDFS-unaware.
> 
> The "pattern" I've been using is:
> 
> rm -f loadfile.dat
> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
> for x in `echo ${fileList}`
> do
> bin/hadoop dfs -cat ${x} >> loadfile.dat
> done
> 
> This pattern repeats several times, ultimately cat-ing 353 files
into
> several load files. This process is extremely slow, often taking
20-30
> minutes to transfer 142M of data. More frustrating is that the system
simply
> "pauses" during cat operations. There is no I/O activity, no CPU
activity,
> nothing written to the log files on any node. Things just stop. I
changed
> the pattern to use -copyToLocal instead of -cat and had the same
results. We
> observe this "pause" behavior without respect for where the
-copyToLocal or
> -cat originates - I've tried running directly on the grid, and also
directly
> on the DB server which is not part of the grid proper. I've tried
many
> different releases of Hadoop, including 0.16.0, and all exhibit this
problem.
> 
> I decided to try a different approach and use the HTTP interface to
the
> namenode to transfer the data:
> 
> rm -f loadfile.dat
> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
> for x in `echo ${fileList}`
> do
> wget -q http://mynamenodeserver:50070/data${x}
> done
> 
> There is a trivial step to merge the individual part files into one
file
> preparatory for loading data.
> 
> I ran this experiment across 10,850 files containing an aggregate
total of
> 4.6G of data. It ran in under 2 hours, which while not great is
significantly
> better than the 18 hours it previously took -copyToLocal/-cat to run.
> 
> I found it surprising that this solution works better than
> -copyToLocal/-cat.
> 
> Hope this helps...
> C G
> 
> 
> 
> ---------------------------------
> Looking for last minute shopping deals? Find them fast with Yahoo!
Search.



       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo!
Search.

RE: Solving the "hang" problem in dfs -copyToLocal/-cat...

Posted by C G <pa...@yahoo.com>.
I haven't looked at the source code to see how -cat is implemented, but I was pretty surprised at the results as well.  When I sat down to do this experiment I figured I was wasting my time..surprisingly I was not.
   
  C G

Joydeep Sen Sarma <js...@facebook.com> wrote:
  This is amazing ..

Wouldn't dfs -cat use the same dfs client codepath that an actual
map-reduce program would? (If so, should it also start using http client
instead? (at least for the non-local case))

Or maybe it already does?

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, February 27, 2008 12:10 PM
To: core-user@hadoop.apache.org
Subject: Re: Solving the "hang" problem in dfs -copyToLocal/-cat...


Have you tried using http to fetch the file instead?

http:///data/

This will get redirected to one of the datanodes to handle and should be
pretty fast. It would be interesting to find out if this alternative
path
is subject to the same hangs that you are seeing.


On 2/27/08 12:05 PM, "C G" 
wrote:

> Hi All:
> 
> The following write-up is offered to help out anybody else who has
seen
> performance problems and "hangs" while using dfs -copyToLocal/-cat.
> 
> One of the performance problems that has been causing big problems
for us
> has been using the dfs commands -copyToLocal and -cat to move data
from HDFS
> to a local file system. We do this in order to populate a data
warehouse that
> is HDFS-unaware.
> 
> The "pattern" I've been using is:
> 
> rm -f loadfile.dat
> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
> for x in `echo ${fileList}`
> do
> bin/hadoop dfs -cat ${x} >> loadfile.dat
> done
> 
> This pattern repeats several times, ultimately cat-ing 353 files
into
> several load files. This process is extremely slow, often taking
20-30
> minutes to transfer 142M of data. More frustrating is that the system
simply
> "pauses" during cat operations. There is no I/O activity, no CPU
activity,
> nothing written to the log files on any node. Things just stop. I
changed
> the pattern to use -copyToLocal instead of -cat and had the same
results. We
> observe this "pause" behavior without respect for where the
-copyToLocal or
> -cat originates - I've tried running directly on the grid, and also
directly
> on the DB server which is not part of the grid proper. I've tried
many
> different releases of Hadoop, including 0.16.0, and all exhibit this
problem.
> 
> I decided to try a different approach and use the HTTP interface to
the
> namenode to transfer the data:
> 
> rm -f loadfile.dat
> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
> for x in `echo ${fileList}`
> do
> wget -q http://mynamenodeserver:50070/data${x}
> done
> 
> There is a trivial step to merge the individual part files into one
file
> preparatory for loading data.
> 
> I ran this experiment across 10,850 files containing an aggregate
total of
> 4.6G of data. It ran in under 2 hours, which while not great is
significantly
> better than the 18 hours it previously took -copyToLocal/-cat to run.
> 
> I found it surprising that this solution works better than
> -copyToLocal/-cat.
> 
> Hope this helps...
> C G
> 
> 
> 
> ---------------------------------
> Looking for last minute shopping deals? Find them fast with Yahoo!
Search.



       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

RE: Solving the "hang" problem in dfs -copyToLocal/-cat...

Posted by Joydeep Sen Sarma <js...@facebook.com>.
This is amazing ..

Wouldn't dfs -cat use the same dfs client codepath that an actual
map-reduce program would? (If so, should it also start using http client
instead? (at least for the non-local case))

Or maybe it already does?

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, February 27, 2008 12:10 PM
To: core-user@hadoop.apache.org
Subject: Re: Solving the "hang" problem in dfs -copyToLocal/-cat...


Have you tried using http to fetch the file instead?

http://<name-node-and-port>/data/<file-path>

This will get redirected to one of the datanodes to handle and should be
pretty fast.  It would be interesting to find out if this alternative
path
is subject to the same hangs that you are seeing.


On 2/27/08 12:05 PM, "C G" <pa...@yahoo.com> wrote:

> Hi All:
>    
>   The following write-up is offered to help out anybody else who has
seen
> performance problems and "hangs" while using dfs -copyToLocal/-cat.
>    
>   One of the performance problems that has been causing big problems
for us
> has been using the dfs commands -copyToLocal and -cat to move data
from HDFS
> to a local file system.  We do this in order to populate a data
warehouse that
> is HDFS-unaware.
>    
>   The "pattern" I've been using is:
>    
>   rm -f loadfile.dat
>   fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>   for x in `echo ${fileList}`
>   do
>      bin/hadoop dfs -cat ${x} >> loadfile.dat
>   done
>    
>   This pattern repeats several times, ultimately cat-ing 353 files
into
> several load files.  This process is extremely slow, often taking
20-30
> minutes to transfer 142M of data.  More frustrating is that the system
simply
> "pauses" during cat operations.  There is no I/O activity, no CPU
activity,
> nothing written to the log files on any node.  Things just stop.  I
changed
> the pattern to use -copyToLocal instead of -cat and had the same
results.  We
> observe this "pause" behavior without respect for where the
-copyToLocal or
> -cat originates - I've tried running directly on the grid, and also
directly
> on the DB server which is not part of the grid proper.  I've tried
many
> different releases of Hadoop, including 0.16.0, and all exhibit this
problem.
>    
>   I decided to try a different approach and use the HTTP interface to
the
> namenode to transfer the data:
>    
>   rm -f loadfile.dat
>   fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>   for x in `echo ${fileList}`
>   do
>    wget -q http://mynamenodeserver:50070/data${x}
>   done
>    
>   There is a trivial step to merge the individual part files into one
file
> preparatory for loading data.
>    
>   I ran this experiment across 10,850 files containing an aggregate
total of
> 4.6G of data.  It ran in under 2 hours, which while not great is
significantly
> better than the 18 hours it previously took -copyToLocal/-cat to run.
>    
>   I found it surprising that this solution works better than
> -copyToLocal/-cat.
>    
>   Hope this helps...
>   C G
>    
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo!
Search.


Re: Solving the "hang" problem in dfs -copyToLocal/-cat...

Posted by Ted Dunning <td...@veoh.com>.
Have you tried using http to fetch the file instead?

http://<name-node-and-port>/data/<file-path>

This will get redirected to one of the datanodes to handle and should be
pretty fast.  It would be interesting to find out if this alternative path
is subject to the same hangs that you are seeing.


On 2/27/08 12:05 PM, "C G" <pa...@yahoo.com> wrote:

> Hi All:
>    
>   The following write-up is offered to help out anybody else who has seen
> performance problems and "hangs" while using dfs -copyToLocal/-cat.
>    
>   One of the performance problems that has been causing big problems for us
> has been using the dfs commands -copyToLocal and -cat to move data from HDFS
> to a local file system.  We do this in order to populate a data warehouse that
> is HDFS-unaware.
>    
>   The "pattern" I've been using is:
>    
>   rm -f loadfile.dat
>   fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>   for x in `echo ${fileList}`
>   do
>      bin/hadoop dfs -cat ${x} >> loadfile.dat
>   done
>    
>   This pattern repeats several times, ultimately cat-ing 353 files into
> several load files.  This process is extremely slow, often taking 20-30
> minutes to transfer 142M of data.  More frustrating is that the system simply
> "pauses" during cat operations.  There is no I/O activity, no CPU activity,
> nothing written to the log files on any node.  Things just stop.  I changed
> the pattern to use -copyToLocal instead of -cat and had the same results.  We
> observe this "pause" behavior without respect for where the -copyToLocal or
> -cat originates - I've tried running directly on the grid, and also directly
> on the DB server which is not part of the grid proper.  I've tried many
> different releases of Hadoop, including 0.16.0, and all exhibit this problem.
>    
>   I decided to try a different approach and use the HTTP interface to the
> namenode to transfer the data:
>    
>   rm -f loadfile.dat
>   fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>   for x in `echo ${fileList}`
>   do
>    wget -q http://mynamenodeserver:50070/data${x}
>   done
>    
>   There is a trivial step to merge the individual part files into one file
> preparatory for loading data.
>    
>   I ran this experiment across 10,850 files containing an aggregate total of
> 4.6G of data.  It ran in under 2 hours, which while not great is significantly
> better than the 18 hours it previously took -copyToLocal/-cat to run.
>    
>   I found it surprising that this solution works better than
> -copyToLocal/-cat.
>    
>   Hope this helps...
>   C G
>    
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo! Search.