You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Null Ecksor <nu...@gmail.com> on 2010/04/10 20:03:02 UTC

copying file into hdfs

Hi,

Im mike,
I am a new user of Hadoop. currently, I have a cluster of 8 machines and a
file of size 2 gigs.
When I load it into hdfs using command
hadoop dfs -put /a.dat /data
It actually loads it on all data nodes. dfsadmin -report shows hdfs usage to
16 gigs. And it is taking 2 hours to load that data file.

with 1 node - my mapreduce operation on this data took 150 seconds.

So when I used my mapred operation on this cluster.. it is taking 220
seconds for same file.

Can some one please tell me How to distribute this file over 8 nodes - so
that each of them will have roughly 300 mbs of file chunk and the mapreduce
operation that I have wrote to work in parallel? Isn't hadoop cluster
supposed to be working in parallel?

best.

Re: copying file into hdfs

Posted by James Seigel <ja...@tynt.com>.

Maybe copy your hdfs config here and we can see why it took up 16 gigs  
of space.

Cheers

Sent from my mobile. Please excuse the typos.

On 2010-04-10, at 3:22 PM, "Michael Segel" <mi...@hotmail.com>  
wrote:

>
>
> Mike,
>
> First, you need to see what you set your block size to in Hadoop. By  
> default its 64MB. With large files, you may want to bump that up to  
> 128 MB per block.
> 2GB file will give you roughly 20 m/r jobs.
>
> I'd use hadoop fs -copyFromLocal <local file name> <hdfs file name>.
>
> (Ok, I'm going from memory on the hadoop command, but you can always  
> do a hadoop help to see the command.)
>
> Also you need to see what you set for your replication factor.  
> Usually its 3.
>
> The your 2GB file will be roughly 6GB in size and should be balanced  
> on all of the nodes with 2 or 3 blocks per machine.
>
> HTH
>
> -Mike
>
>> Date: Sat, 10 Apr 2010 14:03:02 -0400
>> Subject: copying file into hdfs
>> From: nullecksor@gmail.com
>> To: common-user@hadoop.apache.org
>>
>> Hi,
>>
>> Im mike,
>> I am a new user of Hadoop. currently, I have a cluster of 8  
>> machines and a
>> file of size 2 gigs.
>> When I load it into hdfs using command
>> hadoop dfs -put /a.dat /data
>> It actually loads it on all data nodes. dfsadmin -report shows hdfs  
>> usage to
>> 16 gigs. And it is taking 2 hours to load that data file.
>>
>> with 1 node - my mapreduce operation on this data took 150 seconds.
>>
>> So when I used my mapred operation on this cluster.. it is taking 220
>> seconds for same file.
>>
>> Can some one please tell me How to distribute this file over 8  
>> nodes - so
>> that each of them will have roughly 300 mbs of file chunk and the  
>> mapreduce
>> operation that I have wrote to work in parallel? Isn't hadoop cluster
>> supposed to be working in parallel?
>>
>> best.
>
> _________________________________________________________________
> The New Busy think 9 to 5 is a cute idea. Combine multiple calendars  
> with Hotmail.
> http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

RE: copying file into hdfs

Posted by Michael Segel <mi...@hotmail.com>.


Mike,

First, you need to see what you set your block size to in Hadoop. By default its 64MB. With large files, you may want to bump that up to 128 MB per block.
2GB file will give you roughly 20 m/r jobs. 

I'd use hadoop fs -copyFromLocal <local file name> <hdfs file name>.

(Ok, I'm going from memory on the hadoop command, but you can always do a hadoop help to see the command.)

Also you need to see what you set for your replication factor. Usually its 3.

The your 2GB file will be roughly 6GB in size and should be balanced on all of the nodes with 2 or 3 blocks per machine.

HTH

-Mike

> Date: Sat, 10 Apr 2010 14:03:02 -0400
> Subject: copying file into hdfs
> From: nullecksor@gmail.com
> To: common-user@hadoop.apache.org
> 
> Hi,
> 
> Im mike,
> I am a new user of Hadoop. currently, I have a cluster of 8 machines and a
> file of size 2 gigs.
> When I load it into hdfs using command
> hadoop dfs -put /a.dat /data
> It actually loads it on all data nodes. dfsadmin -report shows hdfs usage to
> 16 gigs. And it is taking 2 hours to load that data file.
> 
> with 1 node - my mapreduce operation on this data took 150 seconds.
> 
> So when I used my mapred operation on this cluster.. it is taking 220
> seconds for same file.
> 
> Can some one please tell me How to distribute this file over 8 nodes - so
> that each of them will have roughly 300 mbs of file chunk and the mapreduce
> operation that I have wrote to work in parallel? Isn't hadoop cluster
> supposed to be working in parallel?
> 
> best.
 		 	   		  
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5