You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Scott Carey <sc...@richrelevance.com> on 2010/11/10 20:32:29 UTC

Re: Read() block mysteriously when using big BytesPerChecksum size

On Oct 7, 2010, at 2:35 AM, elton sky wrote:

> Hello experts,
> 
> I was benchmarking sequential write throughput of HDFS.
> 
> For testing affect of bytesPerChecksum (bpc) size to write performance, I am
> using different bpc size: 2M, 256K, 32K, 4K, 512B.
> 
> My cluster has 1 name node and 5 data nodes. They are xen VMs and each of
> them configured with 56MB/s duplex ethernet connection. I
> 
> I try to create a 10G file with different bpc. When bpc is 2M, the
> throughput drops dramatically compared with others:
> 
> time(ms): 333008  bpc: 2M
> 
> time(ms): 234180  bpc: 256K
> 
> time(ms): 223737  bpc: 32K
> 
> time(ms): 228842  bpc: 4K
> 
> time(ms): 228238  bpc: 512
> 
> After dig into the source, I found the problem happens on data nodes.
> In org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket():
> 
> private int readNextPacket() throws IOException {
> ...
> 
> while (buf.remaining() < SIZE_OF_INTEGER) {
> 
>     if (buf.position() > 0) {
>        shiftBufData();
>      }
> 
> *      readToBuf(-1); // this line takes 30ms or more for each packet before
> returns*
>    }
> ...
> 
> while (toRead > 0) { //this loop also takes around 30 ms
>        toRead -= readToBuf(toRead);
>      }
> ...
> }
> 
> private long readToBufTime(int toRead) throws IOException {
> ...
> 
> *int nRead = in.read(buf.array(), buf.limit(), toRead);**// this is the line
> actually causes the delay*
> ...
> 
> }
> 
> The *in.read() *takes around 30ms to wait for data before it returns. And
> when it returns it reads a few KBs data.  The while loop comes later takes
> similar time to finish, which reads (2MB - a few KBs reads before).
> 
> I couldn't understand the reason for the pause of *in.read()*. Why data node
> needs to wait?  why data is not available then?

It is probably waiting on disk or network.
>  Why this happens when using
> big bpc?
> 

Linux tends to asynchronously 'read-ahead' from disks if sequential access is detected in a file.  The default is to read-ahead in chunks of up to 128K.  You can change this on a per device level with "blockdev --setra" (google it).
Since Hadoop fetches data in a synchronous loop, it loses the benefit of the OS asynchronous read-ahead past 128K unless you change that setting.

I recommend a readahead value of ~2MB for today's SATA drives if you need top sequential access performance from linux.  This would look something like this for 2MB:

# blockdev --setra 4096 /dev/sda


> any idea will be appreciated!


Re: Read() block mysteriously when using big BytesPerChecksum size

Posted by elton sky <el...@gmail.com>.
Thanks for reply Scott,

I figured out the reason for the slow down when bpc goes up to 2M is due to
the socket buffer. The default socket buffer (both SO_SNDBUF & SO_SNDRCV) is
set to 128K, which means any packet smaller than that will fit in the buffer
and the send function will return. And the receive function on the other end
can get packet from the buffer. It's kind of asynchronous.

When I use buffer bigger than 128K, data packet is split and
transferred synchronizely,  which results in slow throughput. That means if
I increase my socket buffer to 2MB of more, throughput should come up again.

I ll try the blockdev as well, see if any improve though


On Thu, Nov 11, 2010 at 6:32 AM, Scott Carey <sc...@richrelevance.com>wrote:

>
> On Oct 7, 2010, at 2:35 AM, elton sky wrote:
>
> > Hello experts,
> >
> > I was benchmarking sequential write throughput of HDFS.
> >
> > For testing affect of bytesPerChecksum (bpc) size to write performance, I
> am
> > using different bpc size: 2M, 256K, 32K, 4K, 512B.
> >
> > My cluster has 1 name node and 5 data nodes. They are xen VMs and each of
> > them configured with 56MB/s duplex ethernet connection. I
> >
> > I try to create a 10G file with different bpc. When bpc is 2M, the
> > throughput drops dramatically compared with others:
> >
> > time(ms): 333008  bpc: 2M
> >
> > time(ms): 234180  bpc: 256K
> >
> > time(ms): 223737  bpc: 32K
> >
> > time(ms): 228842  bpc: 4K
> >
> > time(ms): 228238  bpc: 512
> >
> > After dig into the source, I found the problem happens on data nodes.
> > In org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket():
> >
> > private int readNextPacket() throws IOException {
> > ...
> >
> > while (buf.remaining() < SIZE_OF_INTEGER) {
> >
> >     if (buf.position() > 0) {
> >        shiftBufData();
> >      }
> >
> > *      readToBuf(-1); // this line takes 30ms or more for each packet
> before
> > returns*
> >    }
> > ...
> >
> > while (toRead > 0) { //this loop also takes around 30 ms
> >        toRead -= readToBuf(toRead);
> >      }
> > ...
> > }
> >
> > private long readToBufTime(int toRead) throws IOException {
> > ...
> >
> > *int nRead = in.read(buf.array(), buf.limit(), toRead);**// this is the
> line
> > actually causes the delay*
> > ...
> >
> > }
> >
> > The *in.read() *takes around 30ms to wait for data before it returns. And
> > when it returns it reads a few KBs data.  The while loop comes later
> takes
> > similar time to finish, which reads (2MB - a few KBs reads before).
> >
> > I couldn't understand the reason for the pause of *in.read()*. Why data
> node
> > needs to wait?  why data is not available then?
>
> It is probably waiting on disk or network.
> >  Why this happens when using
> > big bpc?
> >
>
> Linux tends to asynchronously 'read-ahead' from disks if sequential access
> is detected in a file.  The default is to read-ahead in chunks of up to
> 128K.  You can change this on a per device level with "blockdev --setra"
> (google it).
> Since Hadoop fetches data in a synchronous loop, it loses the benefit of
> the OS asynchronous read-ahead past 128K unless you change that setting.
>
> I recommend a readahead value of ~2MB for today's SATA drives if you need
> top sequential access performance from linux.  This would look something
> like this for 2MB:
>
> # blockdev --setra 4096 /dev/sda
>
>
> > any idea will be appreciated!
>
>