You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Volos Stavros <st...@epfl.ch> on 2011/06/15 11:00:25 UTC

Multiple nutch processes in the same node

Hi,

I am trying to utilize a 12-core machine with 24-GB of memory when performing search queries. I observed that throughput does 
not scale linearly after 6 cores,s o I am trying to use two nutch processes instead of one. Although I map each process  into a different 
set of cores, I cannot utilize my cores. 

I would like to ask you whether it's straight forward to run two nutch processes at the same node. 

When running the following commands on two separate nodes, each process utilizes 4 cores, so the distributed version of
nutch runs pretty ok.

$ taskset -c 0,2,4,6 bin/nutch server 8890
$ taskset -c 1,3,5,7 bin/nutch server 8891

When running the two commands on the same node, I observed a 8% IO-wait. IPtraf shows that network is not saturated. So my
understanding is that I am IO-bound. Each process uses a 4GB dataset. I would expect that the datasets would be cached in the 
disk, but it seems they do not.

Any thoughts that may cause the problem I am observing? 

Thanks in advance, 
Stavros.