You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/11/10 21:03:19 UTC

threading versus nio

I've recently had the opportunity to experiment with Nutch on a 200
machine cluster running linux 2.4 kernels.  It works well, but a larger
cluster might have problems using a 2.4 kernel.

The NDFS master (namenode) has a thread per connection.  Every task,
tasktracker and datanode process keeps a connection open to the
namenode.  2.4 linux kernels are limited to around 800 threads per
process.  With 200 boxes this means we can run two tasks per box at a
time (2 tasks, 1 tasktracker, and 1 datanode = 4 connections per box).
With 300 boxes, we might not even be able to run one task per box, since
that would result in 900 connections (1 task, 1 tasktracker, and 1
datanode per box).

Last week I looked into rewriting org.apache.nutch.ipc.Server to use a
single nio-based listener thread and a fixed number of worker threads.
I got it working, but it was slower and considerably more complex.  The
complexity is because Nutch's object i/o is based on blocking,
stream-based i/o.

If I were to work more on it, here's how I might do it:

   . associate a pipe with each connection;
   . keep a queue of connections with new requests
   . keep a set of connections with requests in progress

   . listener loops, selecting connections for input
     for all connections with new input
       if the connection is in requests in progress set
         read the input and write to its pipe, potentially blocking
         (but not for long, since someone is reading it)
       else
         remove the connection from the selector
         queue the connection

   . worker threads loop
       pop a connection off the queue
       add it to the selector
       add it to the requests in progress set
       loop
         read a request from the pipe (potentially blocking)
         after request, if no more input is available in pipe
           remove it from the requests in progress set
           break
         compute the response
         write the response to the connection

This may not always be fair.  If a given client sends requests without
pause then there is the possibility that this client can starve other
clients.  In practice I don't think this would be a problem.  But I
don't see how to avoid it, since socket read boundaries may not
correspond to request boundaries.

I'm not sure this is worth working more on any more, since 2.6 kernels
can easily handle 10,000 or more threads.

Doug

Re: threading versus nio

Posted by Stefan Groschupf <sg...@media-style.com>.
> I've looked at it, but I don't think it solves the problem.  We  
> need stream-based handling with thousands of connections.  Mina's  
> stream io handler requires a separate thread per connection:
>
Hmm, to bad. :-/
I understand that IPC is one of the most critical components of nutch  
and it would be to good to outsource this. :-)

Stefan 

Re: threading versus nio

Posted by Earl Cahill <ca...@yahoo.com>.
Not sure if it will help, but I think memcached uses
epoll

http://www.xmailserver.org/linux-patches/epoll.txt

and handles potentially thousands of concurrent
connections.

Earl

--- Doug Cutting <cu...@nutch.org> wrote:

> Stefan Groschupf wrote:
> > do you know apache mina?
> > http://directory.apache.org/subprojects/network/
> > this is a nice introducing.
> >
>
http://directory.apache.org/subprojects/network/mina.pdf
> 
> I've looked at it, but I don't think it solves the
> problem.  We need 
> stream-based handling with thousands of connections.
>  Mina's stream io 
> handler requires a separate thread per connection:
> 
>
http://directory.apache.org/subprojects/network/apidocs/org/apache/mina/handler/StreamIoHandler.html
> 
> So mina doesn't appear to solve this problem.
> 
> > I know you are trying to stay as much as possible
> independent from  
> > third party libraries, but may
> > but mina looks very interesting from my point of
> view.
> 
> I'm not afraid of 3rd party libraries.  Mina did not
> exist when I wrote 
> Nutch's IPC code.  If someone would like to re-write
> this using Mina, 
> and it uses less code and runs as fast, then I'd be
> happy to use it.
> 
> Doug
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: threading versus nio

Posted by Johannes Zillmann <jz...@yahoo.de>.
Alright i think i have a clue know...
Anyway if you though decide to use nio i can just
advise to use mina. 
I think its easy to use and a lot of people who have
experiance with nio said mina greatly simplify things.
And you should be able to easily use the
StreamIoHandler in the way you want so you will have
no thread per handler.

greetings
Johannes 



--- Doug Cutting <cu...@nutch.org> schrieb:
> In general, you are correct.  But Nutch's IPC is
> somewhere between. 
> Requests and responses are discreet messages, but of
> potentially 
> unlimited size, and not length-prefixed.  So streams
> and threads are 
> required to parse a request.  But we don't expect
> individual clients to 
> be hammering the server with repeated requests,
> rather with periodic 
> requests every second or so.  When a connection is
> between requests it 
> needs no thread.  Thus a thread gets input on a
> connection until a 
> request is parsed, then disassociates itself from
> that connection. 
> Things only get complicated if multiple requests
> from a client are 
> merged into a single i/o event.  This is unlikely,
> but when it occurs 
> that thread is forced to immediately handle another
> request from the 
> same client, which could result in the starving of
> other clients.  But 
> as long as most individual clients don't make
> requests too quickly this 
> shouldn't be a problem.
> 
> But I still think I'm most likely to just use
> kernels and JVMs that 
> efficiently handle lots of threads and avoid the
> complexity of async nio.
> 
> Doug
> 



	

	
		
___________________________________________________________ 
Gesendet von Yahoo! Mail - Jetzt mit 1GB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de

Re: threading versus nio

Posted by Doug Cutting <cu...@nutch.org>.
Johannes Zillmann wrote:
> please correct me if i'm wrong, but if i understood
> all right there are 2 choices...
> (1) message based communication
> (2) stream based communication
> 
> In case of (2) you won't come along without one thread
> per connection.

In general, you are correct.  But Nutch's IPC is somewhere between. 
Requests and responses are discreet messages, but of potentially 
unlimited size, and not length-prefixed.  So streams and threads are 
required to parse a request.  But we don't expect individual clients to 
be hammering the server with repeated requests, rather with periodic 
requests every second or so.  When a connection is between requests it 
needs no thread.  Thus a thread gets input on a connection until a 
request is parsed, then disassociates itself from that connection. 
Things only get complicated if multiple requests from a client are 
merged into a single i/o event.  This is unlikely, but when it occurs 
that thread is forced to immediately handle another request from the 
same client, which could result in the starving of other clients.  But 
as long as most individual clients don't make requests too quickly this 
shouldn't be a problem.

But I still think I'm most likely to just use kernels and JVMs that 
efficiently handle lots of threads and avoid the complexity of async nio.

Doug

Re: threading versus nio

Posted by Johannes Zillmann <jz...@yahoo.de>.
Well,

i understood that. But the thing is...
There is the "message" or better "event" based
communication paradigm and there is the "thread" based

communication paradigm.
Is there an alternative to that both possibility ?
I at least see no one in what doug described.
Either you have one thread per pipe and since there
should be one pipe for each connection you end up in
one thread per connection.
Or you have one thread/several threads for a set of
pipes, but in this case that would be just another
disguise event based solution.
... from what i know.

wdyt ?
Johannes

-- Stefan Groschupf <sg...@media-style.com> schrieb:

> Hi Johannes,
> right, but in case you have 200 boxes and each box
> need to open 4  
> different connections to the master.
> Than the master has 200 * 4 connections = 800
> threads = the limit of  
> the 2.4 kernel.
> In case you open only one conenction per box you are
> also limited to  
> run 800 boxes per master.
> Since map reduce is focused to have thousands of
> boxes involved this  
> would be a limitation. (google has actually more
> than 150.000 :-O   
> boxes I had read in a interesting book)
> 
> Stefan
> 



	

	
		
___________________________________________________________ 
Gesendet von Yahoo! Mail - Jetzt mit 1GB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de

Re: threading versus nio

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Johannes,
right, but in case you have 200 boxes and each box need to open 4  
different connections to the master.
Than the master has 200 * 4 connections = 800 threads = the limit of  
the 2.4 kernel.
In case you open only one conenction per box you are also limited to  
run 800 boxes per master.
Since map reduce is focused to have thousands of boxes involved this  
would be a limitation. (google has actually more than 150.000 :-O   
boxes I had read in a interesting book)

Stefan


Am 12.11.2005 um 19:42 schrieb Johannes Zillmann:

> Hello Doug,
>
> --- Doug Cutting <cu...@nutch.org> schrieb:
>>  Mina's stream io
>> handler requires a separate thread per connection:
>
> please correct me if i'm wrong, but if i understood
> all right there are 2 choices...
> (1) message based communication
> (2) stream based communication
>
> In case of (2) you won't come along without one thread
> per connection.
> Cite "read the input and write to its pipe" from
> you're initial descrition of further progress...
> Either writing data to the pipe will also immidiately
> handle the data (->(1)) or a thread is waiting on the
> other side of pipe an reading/handling all incomming
> data (->(2)).
>
> best regards
> Johannes
>
>
> 	
>
> 	
> 		
> ___________________________________________________________
> Gesendet von Yahoo! Mail - Jetzt mit 1GB Speicher kostenlos - Hier  
> anmelden: http://mail.yahoo.de
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re: threading versus nio

Posted by Johannes Zillmann <jz...@yahoo.de>.
Hello Doug,

--- Doug Cutting <cu...@nutch.org> schrieb:
>  Mina's stream io 
> handler requires a separate thread per connection:

please correct me if i'm wrong, but if i understood
all right there are 2 choices...
(1) message based communication
(2) stream based communication

In case of (2) you won't come along without one thread
per connection.
Cite "read the input and write to its pipe" from
you're initial descrition of further progress...
Either writing data to the pipe will also immidiately
handle the data (->(1)) or a thread is waiting on the
other side of pipe an reading/handling all incomming
data (->(2)).

best regards
Johannes


	

	
		
___________________________________________________________ 
Gesendet von Yahoo! Mail - Jetzt mit 1GB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de

Re: threading versus nio

Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> do you know apache mina?
> http://directory.apache.org/subprojects/network/
> this is a nice introducing.
> http://directory.apache.org/subprojects/network/mina.pdf

I've looked at it, but I don't think it solves the problem.  We need 
stream-based handling with thousands of connections.  Mina's stream io 
handler requires a separate thread per connection:

http://directory.apache.org/subprojects/network/apidocs/org/apache/mina/handler/StreamIoHandler.html

So mina doesn't appear to solve this problem.

> I know you are trying to stay as much as possible independent from  
> third party libraries, but may
> but mina looks very interesting from my point of view.

I'm not afraid of 3rd party libraries.  Mina did not exist when I wrote 
Nutch's IPC code.  If someone would like to re-write this using Mina, 
and it uses less code and runs as fast, then I'd be happy to use it.

Doug

Re: threading versus nio

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Doug,
do you know apache mina?
http://directory.apache.org/subprojects/network/
this is a nice introducing.
http://directory.apache.org/subprojects/network/mina.pdf

I know you are trying to stay as much as possible independent from  
third party libraries, but may
but mina looks very interesting from my point of view.

Stefan

Am 10.11.2005 um 21:03 schrieb Doug Cutting:

> I've recently had the opportunity to experiment with Nutch on a 200
> machine cluster running linux 2.4 kernels.  It works well, but a  
> larger
> cluster might have problems using a 2.4 kernel.
>
> The NDFS master (namenode) has a thread per connection.  Every task,
> tasktracker and datanode process keeps a connection open to the
> namenode.  2.4 linux kernels are limited to around 800 threads per
> process.  With 200 boxes this means we can run two tasks per box at a
> time (2 tasks, 1 tasktracker, and 1 datanode = 4 connections per box).
> With 300 boxes, we might not even be able to run one task per box,  
> since
> that would result in 900 connections (1 task, 1 tasktracker, and 1
> datanode per box).
>
> Last week I looked into rewriting org.apache.nutch.ipc.Server to use a
> single nio-based listener thread and a fixed number of worker threads.
> I got it working, but it was slower and considerably more complex.   
> The
> complexity is because Nutch's object i/o is based on blocking,
> stream-based i/o.
>
> If I were to work more on it, here's how I might do it:
>
>   . associate a pipe with each connection;
>   . keep a queue of connections with new requests
>   . keep a set of connections with requests in progress
>
>   . listener loops, selecting connections for input
>     for all connections with new input
>       if the connection is in requests in progress set
>         read the input and write to its pipe, potentially blocking
>         (but not for long, since someone is reading it)
>       else
>         remove the connection from the selector
>         queue the connection
>
>   . worker threads loop
>       pop a connection off the queue
>       add it to the selector
>       add it to the requests in progress set
>       loop
>         read a request from the pipe (potentially blocking)
>         after request, if no more input is available in pipe
>           remove it from the requests in progress set
>           break
>         compute the response
>         write the response to the connection
>
> This may not always be fair.  If a given client sends requests without
> pause then there is the possibility that this client can starve other
> clients.  In practice I don't think this would be a problem.  But I
> don't see how to avoid it, since socket read boundaries may not
> correspond to request boundaries.
>
> I'm not sure this is worth working more on any more, since 2.6 kernels
> can easily handle 10,000 or more threads.
>
> Doug
>