You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Yang Sun <ys...@ist.psu.edu> on 2007/09/13 23:53:40 UTC

problem with Multithreaded crawler using httpclient

Hi,
I am implementing a multithread crawler using httpclient. The fetching 
tasks are managed by ThreadPoolExecutor.
But I met a weired  problem. The memory usage keeps increasing when each 
new task starts to run. Here's the code of the fetcher:

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.methods.GetMethod;

public class TestFetcher implements Runnable{
    String urlObj;
    HttpClient client;
    TestFetcher(HttpClient client, String url) {
        this.client = client;
        urlObj = url;
    }
   
    public void run() {
        process(urlObj);
    }
    public synchronized void process(String url){
        HttpMethod method = new GetMethod(url);
        method.setFollowRedirects(true);
        String content = null;
        int fd= 0;
        try{
            client.executeMethod(method);
            Thread.sleep(1000);
            int code = method.getStatusCode();
            if(code == 200){
                content = method.getResponseBodyAsString();
            } else fd = 10+ code/100;
        } catch (Exception e) {
            fd = 10;
        } finally {
            method.releaseConnection();
            method = null;
        }
    }
}

And this is how I create new task:

while(true){
    taskPool.execute(new TestFetcher(httpclient, 
urlPool.getTaskQueue().take()));
    while(some condition) Thread.sleep(delay);
}

I used to use HttpURLConnection do the fetching. There is no memory 
problem at all. The reason I want to use httpclient is because it can 
take IP addresses instead of using domain names.

Please help.
Thanks,

Yang


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-user-help@jakarta.apache.org