You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by heyongqiang <he...@software.ict.ac.cn> on 2008/06/20 11:00:47 UTC

understanding of client connection code

ipc.Client object is designed be able to share across threads, and each thread can only made synchronized rpc call,which means each thread call and wait for a result or error.This is implemented by a novel technique:each thread made distinct call(with different call object),the user thread then wait at his call object which later will be notified by the connection receiver thread.The user thread made a call by first add his call object into the call list which later be used by the response receiver,and synchronized at the connection's socket outputstream waiting for writing his call out. And the connection's thread is running to collect response on behalf of all user threads.
which i have not mentioned is that Client actually maintains a connection table.
In every Client object ,a connection culler is running behind as a daemon,which's sole purpose is to remove idel connection from the connection table,
but it seems that this culler thread does not close the socket the connection associated with,it only make a mark and do a notify. all the clean staff is handled by the connection thread itself.This is really a wonderful design! even the culler thread can culled the connection from the table, the connection thread also includes remove code. That's because there is chance that the connection thread would encounter some exception.

The above is a brief summary of  my understanding of hadoop's ipc code.
The below is a test result which is used to test the data throughput of hadoop:
+--------------+------------------+
| threadCounts | avg(averageRate) |
+--------------+------------------+
|            1 |   53030539.48913 |
|            2 |  35325499.583756 |
|            3 |  24998284.969072 |
|            4 |   19824934.28125 |
|            5 |  15956391.489583 |
|            6 |  15948640.175532 |
|            7 |  14623977.375691 |
|            8 |  16098080.160131 |
|            9 |  8967970.3877005 |
|           10 |  14569087.178947 |
|           11 |  8962683.6662088 |
|           12 |  20063735.297872 |
|           13 |  13174481.053977 |
|           14 |  10137907.034188 |
|           15 |  6464513.2013889 |
|           16 |   23064338.76087 |
|           17 |   18688537.44385 |
|           18 |  18270909.854317 |
|           19 |  13086261.536538 |
|           20 |  10784059.367347 |
+--------------+------------------+

the first column represents the thread counts of my test application, the second column is the average download rate.It seems the rate download sharply when the thread count increases.
This is very simple test application.Anyone can tell me why?where is the bottleneck when user app adopt multiple thread.




heyongqiang
2008-06-20

Re: Re: understanding of client connection code

Posted by heyongqiang <he...@software.ict.ac.cn>.

hehe

    I notices that in the DFSClient's DataStreamer thread, the run method is sending data out with synchronized on the dataqueue, is this really need?
I mean remove,wait,and getFirst of variable dataQueue should be synchronized on the dataQueue,but is it need to hold a lock when send one packet out?
I doubt. Can any developer give me one reason for doing that?




heyongqiang
2008-06-23



发件人： hong
发送时间： 2008-06-21 10:10:59
收件人： core-user@hadoop.apache.org
抄送： 
主题： Re: understanding of client connection code

兄弟是 余海燕 的部队吗？

在 2008-6-20，下午5:00，heyongqiang 写道：

> ipc.Client object is designed be able to share across threads, and  
> each thread can only made synchronized rpc call,which means each  
> thread call and wait for a result or error.This is implemented by a  
> novel technique:each thread made distinct call(with different call  
> object),the user thread then wait at his call object which later  
> will be notified by the connection receiver thread.The user thread  
> made a call by first add his call object into the call list which  
> later be used by the response receiver,and synchronized at the  
> connection's socket outputstream waiting for writing his call out.  
> And the connection's thread is running to collect response on  
> behalf of all user threads.
> which i have not mentioned is that Client actually maintains a  
> connection table.
> In every Client object ,a connection culler is running behind as a  
> daemon,which's sole purpose is to remove idel connection from the  
> connection table,
> but it seems that this culler thread does not close the socket the  
> connection associated with,it only make a mark and do a notify. all  
> the clean staff is handled by the connection thread itself.This is  
> really a wonderful design! even the culler thread can culled the  
> connection from the table, the connection thread also includes  
> remove code. That's because there is chance that the connection  
> thread would encounter some exception.
>
> The above is a brief summary of  my understanding of hadoop's ipc  
> code.
> The below is a test result which is used to test the data  
> throughput of hadoop:
> +--------------+------------------+
> | threadCounts | avg(averageRate) |
> +--------------+------------------+
> |            1 |   53030539.48913 |
> |            2 |  35325499.583756 |
> |            3 |  24998284.969072 |
> |            4 |   19824934.28125 |
> |            5 |  15956391.489583 |
> |            6 |  15948640.175532 |
> |            7 |  14623977.375691 |
> |            8 |  16098080.160131 |
> |            9 |  8967970.3877005 |
> |           10 |  14569087.178947 |
> |           11 |  8962683.6662088 |
> |           12 |  20063735.297872 |
> |           13 |  13174481.053977 |
> |           14 |  10137907.034188 |
> |           15 |  6464513.2013889 |
> |           16 |   23064338.76087 |
> |           17 |   18688537.44385 |
> |           18 |  18270909.854317 |
> |           19 |  13086261.536538 |
> |           20 |  10784059.367347 |
> +--------------+------------------+
>
> the first column represents the thread counts of my test  
> application, the second column is the average download rate.It  
> seems the rate download sharply when the thread count increases.
> This is very simple test application.Anyone can tell me why?where  
> is the bottleneck when user app adopt multiple thread.
>
>
>
>
> heyongqiang
> 2008-06-20

Re: understanding of client connection code

Posted by hong <mi...@163.com>.

兄弟是 余海燕 的部队吗？

在 2008-6-20，下午5:00，heyongqiang 写道：

> ipc.Client object is designed be able to share across threads, and  
> each thread can only made synchronized rpc call,which means each  
> thread call and wait for a result or error.This is implemented by a  
> novel technique:each thread made distinct call(with different call  
> object),the user thread then wait at his call object which later  
> will be notified by the connection receiver thread.The user thread  
> made a call by first add his call object into the call list which  
> later be used by the response receiver,and synchronized at the  
> connection's socket outputstream waiting for writing his call out.  
> And the connection's thread is running to collect response on  
> behalf of all user threads.
> which i have not mentioned is that Client actually maintains a  
> connection table.
> In every Client object ,a connection culler is running behind as a  
> daemon,which's sole purpose is to remove idel connection from the  
> connection table,
> but it seems that this culler thread does not close the socket the  
> connection associated with,it only make a mark and do a notify. all  
> the clean staff is handled by the connection thread itself.This is  
> really a wonderful design! even the culler thread can culled the  
> connection from the table, the connection thread also includes  
> remove code. That's because there is chance that the connection  
> thread would encounter some exception.
>
> The above is a brief summary of  my understanding of hadoop's ipc  
> code.
> The below is a test result which is used to test the data  
> throughput of hadoop:
> +--------------+------------------+
> | threadCounts | avg(averageRate) |
> +--------------+------------------+
> |            1 |   53030539.48913 |
> |            2 |  35325499.583756 |
> |            3 |  24998284.969072 |
> |            4 |   19824934.28125 |
> |            5 |  15956391.489583 |
> |            6 |  15948640.175532 |
> |            7 |  14623977.375691 |
> |            8 |  16098080.160131 |
> |            9 |  8967970.3877005 |
> |           10 |  14569087.178947 |
> |           11 |  8962683.6662088 |
> |           12 |  20063735.297872 |
> |           13 |  13174481.053977 |
> |           14 |  10137907.034188 |
> |           15 |  6464513.2013889 |
> |           16 |   23064338.76087 |
> |           17 |   18688537.44385 |
> |           18 |  18270909.854317 |
> |           19 |  13086261.536538 |
> |           20 |  10784059.367347 |
> +--------------+------------------+
>
> the first column represents the thread counts of my test  
> application, the second column is the average download rate.It  
> seems the rate download sharply when the thread count increases.
> This is very simple test application.Anyone can tell me why?where  
> is the bottleneck when user app adopt multiple thread.
>
>
>
>
> heyongqiang
> 2008-06-20

Re: Re: hadoop download performace when user app adopt multi-thread

Posted by heyongqiang <he...@software.ict.ac.cn>.

Actually this test result is a good result,it is just my misunderstanding of the result.my mistake.
the second column actually is the average download rate per thread.And this post test was run on one node,we also run test simultaneously on multiple nodes,and the performance results seem acceptable for us.
But what u said is right,but this overhead(seek time and I/O consumption ) seems not easy to optimize.
thank you for your attention.



Best regards,
 
Yongqiang He
2008-07-09

Email: heyongqiang@software.ict.ac.cn
Tel:   86-10-62600966(O)
 
Research Center for Grid and Service Computing,
Institute of Computing Technology, 
Chinese Academy of Sciences
P.O.Box 2704, 100080, Beijing, China 



发件人： Samuel Guo
发送时间： 2008-07-09 09:47:32
收件人： core-user@hadoop.apache.org
抄送： 
主题： Re: hadoop download performace when user app adopt multi-thread

heyongqiang 写道:
> ipc.Client object is designed be able to share across threads, and each thread can only made synchronized rpc call,which means each thread call and wait for a result or error.This is implemented by a novel technique:each thread made distinct call(with different call object),the user thread then wait at his call object which later will be notified by the connection receiver thread.The user thread made a call by first add his call object into the call list which later be used by the response receiver,and synchronized at the connection's socket outputstream waiting for writing his call out. And the connection's thread is running to collect response on behalf of all user threads.
> which i have not mentioned is that Client actually maintains a connection table.
> In every Client object ,a connection culler is running behind as a daemon,which's sole purpose is to remove idel connection from the connection table,
> but it seems that this culler thread does not close the socket the connection associated with,it only make a mark and do a notify. all the clean staff is handled by the connection thread itself.This is really a wonderful design! even the culler thread can culled the connection from the table, the connection thread also includes remove code. That's because there is chance that the connection thread would encounter some exception.
>
> The above is a brief summary of  my understanding of hadoop's ipc code.
> The below is a test result which is used to test the data throughput of hadoop:
> +--------------+------------------+
> | threadCounts | avg(averageRate) |
> +--------------+------------------+
> |            1 |   53030539.48913 |
> |            2 |  35325499.583756 |
> |            3 |  24998284.969072 |
> |            4 |   19824934.28125 |
> |            5 |  15956391.489583 |
> |            6 |  15948640.175532 |
> |            7 |  14623977.375691 |
> |            8 |  16098080.160131 |
> |            9 |  8967970.3877005 |
> |           10 |  14569087.178947 |
> |           11 |  8962683.6662088 |
> |           12 |  20063735.297872 |
> |           13 |  13174481.053977 |
> |           14 |  10137907.034188 |
> |           15 |  6464513.2013889 |
> |           16 |   23064338.76087 |
> |           17 |   18688537.44385 |
> |           18 |  18270909.854317 |
> |           19 |  13086261.536538 |
> |           20 |  10784059.367347 |
> +--------------+------------------+
>
> the first column represents the thread counts of my test application, the second column is the average download rate.It seems the rate download sharply when the thread count increases.
> This is very simple test application.Anyone can tell me why?where is the bottleneck when user app adopt multiple thread.
>
>   

As you known, a block of the file in HDFS is presented as a file in the
local filesystem resides in a datanode.
Different threads read different files in HDFS or different blocks of a
(same) file in HDFS, may result a burst of read requests in different
local files(blocks of HDFS files) in a certain datanode. so the disk
seek time and I/O consumption will become heavy and the response time
will be longer.
But it is just a local behavior of a (single) datanode. The whole
throughput of the Hadoop cluster will be good.

so, can you supply any information about your test?
> heyongqiang
> 2008-06-20
>
>

Re: hadoop download performace when user app adopt multi-thread

Posted by Samuel Guo <gu...@gmail.com>.

heyongqiang 写道:
> ipc.Client object is designed be able to share across threads, and each thread can only made synchronized rpc call,which means each thread call and wait for a result or error.This is implemented by a novel technique:each thread made distinct call(with different call object),the user thread then wait at his call object which later will be notified by the connection receiver thread.The user thread made a call by first add his call object into the call list which later be used by the response receiver,and synchronized at the connection's socket outputstream waiting for writing his call out. And the connection's thread is running to collect response on behalf of all user threads.
> which i have not mentioned is that Client actually maintains a connection table.
> In every Client object ,a connection culler is running behind as a daemon,which's sole purpose is to remove idel connection from the connection table,
> but it seems that this culler thread does not close the socket the connection associated with,it only make a mark and do a notify. all the clean staff is handled by the connection thread itself.This is really a wonderful design! even the culler thread can culled the connection from the table, the connection thread also includes remove code. That's because there is chance that the connection thread would encounter some exception.
>
> The above is a brief summary of  my understanding of hadoop's ipc code.
> The below is a test result which is used to test the data throughput of hadoop:
> +--------------+------------------+
> | threadCounts | avg(averageRate) |
> +--------------+------------------+
> |            1 |   53030539.48913 |
> |            2 |  35325499.583756 |
> |            3 |  24998284.969072 |
> |            4 |   19824934.28125 |
> |            5 |  15956391.489583 |
> |            6 |  15948640.175532 |
> |            7 |  14623977.375691 |
> |            8 |  16098080.160131 |
> |            9 |  8967970.3877005 |
> |           10 |  14569087.178947 |
> |           11 |  8962683.6662088 |
> |           12 |  20063735.297872 |
> |           13 |  13174481.053977 |
> |           14 |  10137907.034188 |
> |           15 |  6464513.2013889 |
> |           16 |   23064338.76087 |
> |           17 |   18688537.44385 |
> |           18 |  18270909.854317 |
> |           19 |  13086261.536538 |
> |           20 |  10784059.367347 |
> +--------------+------------------+
>
> the first column represents the thread counts of my test application, the second column is the average download rate.It seems the rate download sharply when the thread count increases.
> This is very simple test application.Anyone can tell me why?where is the bottleneck when user app adopt multiple thread.
>
>   

As you known, a block of the file in HDFS is presented as a file in the
local filesystem resides in a datanode.
Different threads read different files in HDFS or different blocks of a
(same) file in HDFS, may result a burst of read requests in different
local files(blocks of HDFS files) in a certain datanode. so the disk
seek time and I/O consumption will become heavy and the response time
will be longer.
But it is just a local behavior of a (single) datanode. The whole
throughput of the Hadoop cluster will be good.

so, can you supply any information about your test?
> heyongqiang
> 2008-06-20
>
>

hadoop download performace when user app adopt multi-thread

Posted by heyongqiang <he...@software.ict.ac.cn>.

ipc.Client object is designed be able to share across threads, and each thread can only made synchronized rpc call,which means each thread call and wait for a result or error.This is implemented by a novel technique:each thread made distinct call(with different call object),the user thread then wait at his call object which later will be notified by the connection receiver thread.The user thread made a call by first add his call object into the call list which later be used by the response receiver,and synchronized at the connection's socket outputstream waiting for writing his call out. And the connection's thread is running to collect response on behalf of all user threads.
which i have not mentioned is that Client actually maintains a connection table.
In every Client object ,a connection culler is running behind as a daemon,which's sole purpose is to remove idel connection from the connection table,
but it seems that this culler thread does not close the socket the connection associated with,it only make a mark and do a notify. all the clean staff is handled by the connection thread itself.This is really a wonderful design! even the culler thread can culled the connection from the table, the connection thread also includes remove code. That's because there is chance that the connection thread would encounter some exception.

The above is a brief summary of  my understanding of hadoop's ipc code.
The below is a test result which is used to test the data throughput of hadoop:
+--------------+------------------+
| threadCounts | avg(averageRate) |
+--------------+------------------+
|            1 |   53030539.48913 |
|            2 |  35325499.583756 |
|            3 |  24998284.969072 |
|            4 |   19824934.28125 |
|            5 |  15956391.489583 |
|            6 |  15948640.175532 |
|            7 |  14623977.375691 |
|            8 |  16098080.160131 |
|            9 |  8967970.3877005 |
|           10 |  14569087.178947 |
|           11 |  8962683.6662088 |
|           12 |  20063735.297872 |
|           13 |  13174481.053977 |
|           14 |  10137907.034188 |
|           15 |  6464513.2013889 |
|           16 |   23064338.76087 |
|           17 |   18688537.44385 |
|           18 |  18270909.854317 |
|           19 |  13086261.536538 |
|           20 |  10784059.367347 |
+--------------+------------------+

the first column represents the thread counts of my test application, the second column is the average download rate.It seems the rate download sharply when the thread count increases.
This is very simple test application.Anyone can tell me why?where is the bottleneck when user app adopt multiple thread.




heyongqiang
2008-06-20