You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by heyongqiang <he...@software.ict.ac.cn> on 2008/06/20 11:01:50 UTC

hadoop download performace when user app adopt multi-thread

ipc.Client object is designed be able to share across threads, and each thread can only made synchronized rpc call,which means each thread call and wait for a result or error.This is implemented by a novel technique:each thread made distinct call(with different call object),the user thread then wait at his call object which later will be notified by the connection receiver thread.The user thread made a call by first add his call object into the call list which later be used by the response receiver,and synchronized at the connection's socket outputstream waiting for writing his call out. And the connection's thread is running to collect response on behalf of all user threads.
which i have not mentioned is that Client actually maintains a connection table.
In every Client object ,a connection culler is running behind as a daemon,which's sole purpose is to remove idel connection from the connection table,
but it seems that this culler thread does not close the socket the connection associated with,it only make a mark and do a notify. all the clean staff is handled by the connection thread itself.This is really a wonderful design! even the culler thread can culled the connection from the table, the connection thread also includes remove code. That's because there is chance that the connection thread would encounter some exception.

The above is a brief summary of  my understanding of hadoop's ipc code.
The below is a test result which is used to test the data throughput of hadoop:
+--------------+------------------+
| threadCounts | avg(averageRate) |
+--------------+------------------+
|            1 |   53030539.48913 |
|            2 |  35325499.583756 |
|            3 |  24998284.969072 |
|            4 |   19824934.28125 |
|            5 |  15956391.489583 |
|            6 |  15948640.175532 |
|            7 |  14623977.375691 |
|            8 |  16098080.160131 |
|            9 |  8967970.3877005 |
|           10 |  14569087.178947 |
|           11 |  8962683.6662088 |
|           12 |  20063735.297872 |
|           13 |  13174481.053977 |
|           14 |  10137907.034188 |
|           15 |  6464513.2013889 |
|           16 |   23064338.76087 |
|           17 |   18688537.44385 |
|           18 |  18270909.854317 |
|           19 |  13086261.536538 |
|           20 |  10784059.367347 |
+--------------+------------------+

the first column represents the thread counts of my test application, the second column is the average download rate.It seems the rate download sharply when the thread count increases.
This is very simple test application.Anyone can tell me why?where is the bottleneck when user app adopt multiple thread.




heyongqiang
2008-06-20

Re: Re: hadoop download performace when user app adopt multi-thread

Posted by heyongqiang <he...@software.ict.ac.cn>.

Actually this test result is a good result,it is just my misunderstanding of the result.my mistake.
the second column actually is the average download rate per thread.And this post test was run on one node,we also run test simultaneously on multiple nodes,and the performance results seem acceptable for us.
But what u said is right,but this overhead(seek time and I/O consumption ) seems not easy to optimize.
thank you for your attention.



Best regards,
 
Yongqiang He
2008-07-09

Email: heyongqiang@software.ict.ac.cn
Tel:   86-10-62600966(O)
 
Research Center for Grid and Service Computing,
Institute of Computing Technology, 
Chinese Academy of Sciences
P.O.Box 2704, 100080, Beijing, China 



发件人： Samuel Guo
发送时间： 2008-07-09 09:47:32
收件人： core-user@hadoop.apache.org
抄送： 
主题： Re: hadoop download performace when user app adopt multi-thread

heyongqiang 写道:
> ipc.Client object is designed be able to share across threads, and each thread can only made synchronized rpc call,which means each thread call and wait for a result or error.This is implemented by a novel technique:each thread made distinct call(with different call object),the user thread then wait at his call object which later will be notified by the connection receiver thread.The user thread made a call by first add his call object into the call list which later be used by the response receiver,and synchronized at the connection's socket outputstream waiting for writing his call out. And the connection's thread is running to collect response on behalf of all user threads.
> which i have not mentioned is that Client actually maintains a connection table.
> In every Client object ,a connection culler is running behind as a daemon,which's sole purpose is to remove idel connection from the connection table,
> but it seems that this culler thread does not close the socket the connection associated with,it only make a mark and do a notify. all the clean staff is handled by the connection thread itself.This is really a wonderful design! even the culler thread can culled the connection from the table, the connection thread also includes remove code. That's because there is chance that the connection thread would encounter some exception.
>
> The above is a brief summary of  my understanding of hadoop's ipc code.
> The below is a test result which is used to test the data throughput of hadoop:
> +--------------+------------------+
> | threadCounts | avg(averageRate) |
> +--------------+------------------+
> |            1 |   53030539.48913 |
> |            2 |  35325499.583756 |
> |            3 |  24998284.969072 |
> |            4 |   19824934.28125 |
> |            5 |  15956391.489583 |
> |            6 |  15948640.175532 |
> |            7 |  14623977.375691 |
> |            8 |  16098080.160131 |
> |            9 |  8967970.3877005 |
> |           10 |  14569087.178947 |
> |           11 |  8962683.6662088 |
> |           12 |  20063735.297872 |
> |           13 |  13174481.053977 |
> |           14 |  10137907.034188 |
> |           15 |  6464513.2013889 |
> |           16 |   23064338.76087 |
> |           17 |   18688537.44385 |
> |           18 |  18270909.854317 |
> |           19 |  13086261.536538 |
> |           20 |  10784059.367347 |
> +--------------+------------------+
>
> the first column represents the thread counts of my test application, the second column is the average download rate.It seems the rate download sharply when the thread count increases.
> This is very simple test application.Anyone can tell me why?where is the bottleneck when user app adopt multiple thread.
>
>   

As you known, a block of the file in HDFS is presented as a file in the
local filesystem resides in a datanode.
Different threads read different files in HDFS or different blocks of a
(same) file in HDFS, may result a burst of read requests in different
local files(blocks of HDFS files) in a certain datanode. so the disk
seek time and I/O consumption will become heavy and the response time
will be longer.
But it is just a local behavior of a (single) datanode. The whole
throughput of the Hadoop cluster will be good.

so, can you supply any information about your test?
> heyongqiang
> 2008-06-20
>
>

Re: hadoop download performace when user app adopt multi-thread

Posted by Samuel Guo <gu...@gmail.com>.

heyongqiang 写道:
> ipc.Client object is designed be able to share across threads, and each thread can only made synchronized rpc call,which means each thread call and wait for a result or error.This is implemented by a novel technique:each thread made distinct call(with different call object),the user thread then wait at his call object which later will be notified by the connection receiver thread.The user thread made a call by first add his call object into the call list which later be used by the response receiver,and synchronized at the connection's socket outputstream waiting for writing his call out. And the connection's thread is running to collect response on behalf of all user threads.
> which i have not mentioned is that Client actually maintains a connection table.
> In every Client object ,a connection culler is running behind as a daemon,which's sole purpose is to remove idel connection from the connection table,
> but it seems that this culler thread does not close the socket the connection associated with,it only make a mark and do a notify. all the clean staff is handled by the connection thread itself.This is really a wonderful design! even the culler thread can culled the connection from the table, the connection thread also includes remove code. That's because there is chance that the connection thread would encounter some exception.
>
> The above is a brief summary of  my understanding of hadoop's ipc code.
> The below is a test result which is used to test the data throughput of hadoop:
> +--------------+------------------+
> | threadCounts | avg(averageRate) |
> +--------------+------------------+
> |            1 |   53030539.48913 |
> |            2 |  35325499.583756 |
> |            3 |  24998284.969072 |
> |            4 |   19824934.28125 |
> |            5 |  15956391.489583 |
> |            6 |  15948640.175532 |
> |            7 |  14623977.375691 |
> |            8 |  16098080.160131 |
> |            9 |  8967970.3877005 |
> |           10 |  14569087.178947 |
> |           11 |  8962683.6662088 |
> |           12 |  20063735.297872 |
> |           13 |  13174481.053977 |
> |           14 |  10137907.034188 |
> |           15 |  6464513.2013889 |
> |           16 |   23064338.76087 |
> |           17 |   18688537.44385 |
> |           18 |  18270909.854317 |
> |           19 |  13086261.536538 |
> |           20 |  10784059.367347 |
> +--------------+------------------+
>
> the first column represents the thread counts of my test application, the second column is the average download rate.It seems the rate download sharply when the thread count increases.
> This is very simple test application.Anyone can tell me why?where is the bottleneck when user app adopt multiple thread.
>
>   

As you known, a block of the file in HDFS is presented as a file in the
local filesystem resides in a datanode.
Different threads read different files in HDFS or different blocks of a
(same) file in HDFS, may result a burst of read requests in different
local files(blocks of HDFS files) in a certain datanode. so the disk
seek time and I/O consumption will become heavy and the response time
will be longer.
But it is just a local behavior of a (single) datanode. The whole
throughput of the Hadoop cluster will be good.

so, can you supply any information about your test?
> heyongqiang
> 2008-06-20
>
>