You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Sean <se...@hotmail.com> on 2010/04/30 01:03:19 UTC

HDFS throughput question(write in batch vs write single log entry)

Hi, I am using HDFS as my storage layer behind Scribe server. 

 

log-entry           (write directly)

--------->scribe -------------> HDFS

 

Right now, my scribe server directly write to HDFS, which means each of my write operations are small chunk of data. Right now, I see the throughput is very low on a 4 data-node cluster. 

 

 

 

So I am wondering if HDFS is not built for such type of write operation, instead it's built for 'bulk' write. If this is the reason of my low throughput problem, I'd have my Scribe server write to the local disk and then write the local file to HDFS. 

 

Any suggestion?

 

Thanks,

Sean
 		 	   		  
_________________________________________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

Re: HDFS throughput question(write in batch vs write single log entry)

Posted by st...@yahoo.com.
HDFS is not designed for small chunks of data. The default block/chunk size is 64MB.

So your log file should be pretty big as well, to implement your solution. 

I have the same issue - TB of 300 KB files, and plan on using hbase for storage. Should work. We'll see..

Take care,
 -stu
Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Sean <se...@hotmail.com>
Date: Thu, 29 Apr 2010 23:03:19 
To: <hd...@hadoop.apache.org>
Subject: HDFS throughput question(write in batch vs write single log entry)


Hi, I am using HDFS as my storage layer behind Scribe server. 

 

log-entry           (write directly)

--------->scribe -------------> HDFS

 

Right now, my scribe server directly write to HDFS, which means each of my write operations are small chunk of data. Right now, I see the throughput is very low on a 4 data-node cluster. 

 

 

 

So I am wondering if HDFS is not built for such type of write operation, instead it's built for 'bulk' write. If this is the reason of my low throughput problem, I'd have my Scribe server write to the local disk and then write the local file to HDFS. 

 

Any suggestion?

 

Thanks,

Sean
 		 	   		  
_________________________________________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

RE: HDFS throughput question(write in batch vs write single log entry)

Posted by Zheng Shao <zs...@facebook.com>.
We use similar architecture here. We are not seeing any problems with the throughput.
Can you elaborate? What performance are you seeing? Did you do "iostat -kx 2" and "top" on the data nodes?

Zheng
From: Sean [mailto:seanatpurdue@hotmail.com]
Sent: Thursday, April 29, 2010 4:03 PM
To: hdfs-user@hadoop.apache.org
Subject: HDFS throughput question(write in batch vs write single log entry)

Hi, I am using HDFS as my storage layer behind Scribe server.

log-entry           (write directly)
--------->scribe -------------> HDFS

Right now, my scribe server directly write to HDFS, which means each of my write operations are small chunk of data. Right now, I see the throughput is very low on a 4 data-node cluster.



So I am wondering if HDFS is not built for such type of write operation, instead it's built for 'bulk' write. If this is the reason of my low throughput problem, I'd have my Scribe server write to the local disk and then write the local file to HDFS.

Any suggestion?

Thanks,
Sean
________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox. Get started.<http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3>