You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Yang Sun <ys...@ist.psu.edu> on 2007/09/13 23:53:40 UTC
problem with Multithreaded crawler using httpclient
Hi,
I am implementing a multithread crawler using httpclient. The fetching
tasks are managed by ThreadPoolExecutor.
But I met a weired problem. The memory usage keeps increasing when each
new task starts to run. Here's the code of the fetcher:
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.methods.GetMethod;
public class TestFetcher implements Runnable{
String urlObj;
HttpClient client;
TestFetcher(HttpClient client, String url) {
this.client = client;
urlObj = url;
}
public void run() {
process(urlObj);
}
public synchronized void process(String url){
HttpMethod method = new GetMethod(url);
method.setFollowRedirects(true);
String content = null;
int fd= 0;
try{
client.executeMethod(method);
Thread.sleep(1000);
int code = method.getStatusCode();
if(code == 200){
content = method.getResponseBodyAsString();
} else fd = 10+ code/100;
} catch (Exception e) {
fd = 10;
} finally {
method.releaseConnection();
method = null;
}
}
}
And this is how I create new task:
while(true){
taskPool.execute(new TestFetcher(httpclient,
urlPool.getTaskQueue().take()));
while(some condition) Thread.sleep(delay);
}
I used to use HttpURLConnection do the fetching. There is no memory
problem at all. The reason I want to use httpclient is because it can
take IP addresses instead of using domain names.
Please help.
Thanks,
Yang
---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-user-help@jakarta.apache.org