You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Li Li <fa...@gmail.com> on 2014/02/11 05:07:31 UTC

OOM problem

I am using httpclient 4.3 to crawl webpages.
I start 200 threads and PoolingHttpClientConnectionManager with
totalMax 1000 and perHostMax 5
I give java 2GB memory and one thread throws an exception(others still
running, this thread is dead)

Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space
        at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56)
        at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133)
        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221)
        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
        at com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233)
        at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198)
        at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134)
        at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231)

does it mean my code has some memory leak probelm?

my codes:
public String httpGet(String url) throws Exception {
if (!isValid)
throw new RuntimeException("not valid now, you should init first");
HttpGet httpget = new HttpGet(url);

// Create a custom response handler
ResponseHandler<String> responseHandler = new ResponseHandler<String>() {

public String handleResponse(final HttpResponse response)
throws ClientProtocolException, IOException {
int status = response.getStatusLine().getStatusCode();
if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();
if (entity == null)
return null;

byte[] bytes = EntityUtils.toByteArray(entity);
String charSet = CharsetDetector.getCharset(bytes);

return new String(bytes, charSet);
} else {
throw new ClientProtocolException(
"Unexpected response status: " + status);
}
}

};

String responseBody = client.execute(httpget, responseHandler);
return responseBody;
}

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: OOM problem

Posted by Li Li <fa...@gmail.com>.
jmap result:
Debugger attached successfully.
Server compiler detected.
JVM version is 22.1-b02

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 2147483648 (2048.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 85983232 (82.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 599130112 (571.375MB)
   used     = 143482424 (136.83550262451172MB)
   free     = 455647688 (434.5394973754883MB)
   23.94845812723898% used
>From Space:
   capacity = 41811968 (39.875MB)
   used     = 41757744 (39.82328796386719MB)
   free     = 54224 (0.0517120361328125MB)
   99.87031464292711% used
To Space:
   capacity = 57671680 (55.0MB)
   used     = 0 (0.0MB)
   free     = 57671680 (55.0MB)
   0.0% used
PS Old Generation
   capacity = 1009254400 (962.5MB)
   used     = 557209200 (531.3961029052734MB)
   free     = 452045200 (431.10389709472656MB)
   55.209984717431006% used
PS Perm Generation
   capacity = 34275328 (32.6875MB)
   used     = 24751016 (23.604408264160156MB)
   free     = 9524312 (9.083091735839844MB)
   72.21233885785134% used

On Tue, Feb 11, 2014 at 12:07 PM, Li Li <fa...@gmail.com> wrote:
> I am using httpclient 4.3 to crawl webpages.
> I start 200 threads and PoolingHttpClientConnectionManager with
> totalMax 1000 and perHostMax 5
> I give java 2GB memory and one thread throws an exception(others still
> running, this thread is dead)
>
> Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space
>         at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56)
>         at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133)
>         at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221)
>         at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
>         at com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233)
>         at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198)
>         at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134)
>         at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231)
>
> does it mean my code has some memory leak probelm?
>
> my codes:
> public String httpGet(String url) throws Exception {
> if (!isValid)
> throw new RuntimeException("not valid now, you should init first");
> HttpGet httpget = new HttpGet(url);
>
> // Create a custom response handler
> ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
>
> public String handleResponse(final HttpResponse response)
> throws ClientProtocolException, IOException {
> int status = response.getStatusLine().getStatusCode();
> if (status >= 200 && status < 300) {
> HttpEntity entity = response.getEntity();
> if (entity == null)
> return null;
>
> byte[] bytes = EntityUtils.toByteArray(entity);
> String charSet = CharsetDetector.getCharset(bytes);
>
> return new String(bytes, charSet);
> } else {
> throw new ClientProtocolException(
> "Unexpected response status: " + status);
> }
> }
>
> };
>
> String responseBody = client.execute(httpget, responseHandler);
> return responseBody;
> }

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: OOM problem

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Mon, 2014-02-10 at 20:57 -0800, Ken Krugler wrote:
> If you're crawling web pages, you need to have a limit to the amount of data any page returns.
> 
> Otherwise you'll eventually run into a site that returns an unbounded amount of data, which will kill your JVM.
> 
> See SimpleHttpFetcher in Bixo for an example of one way to do this type of limiting (though not optimal).
> 
> -- Ken
> 
> 
> On Feb 10, 2014, at 8:07pm, Li Li <fa...@gmail.com> wrote:
> 
> > I am using httpclient 4.3 to crawl webpages.
> > I start 200 threads and PoolingHttpClientConnectionManager with
> > totalMax 1000 and perHostMax 5
> > I give java 2GB memory and one thread throws an exception(others still
> > running, this thread is dead)
> > 
> > Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space
> >        at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56)
> >        at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133)

Moreover, buffering response content in memory (either as byte array or
string) sounds like a really bad idea to me.

Oleg


> >        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221)
> >        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211)
> >        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
> >        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
> >        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
> >        at com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233)
> >        at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198)
> >        at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134)
> >        at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231)
> > 
> > does it mean my code has some memory leak probelm?
> > 
> > my codes:
> > public String httpGet(String url) throws Exception {
> > if (!isValid)
> > throw new RuntimeException("not valid now, you should init first");
> > HttpGet httpget = new HttpGet(url);
> > 
> > // Create a custom response handler
> > ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
> > 
> > public String handleResponse(final HttpResponse response)
> > throws ClientProtocolException, IOException {
> > int status = response.getStatusLine().getStatusCode();
> > if (status >= 200 && status < 300) {
> > HttpEntity entity = response.getEntity();
> > if (entity == null)
> > return null;
> > 
> > byte[] bytes = EntityUtils.toByteArray(entity);
> > String charSet = CharsetDetector.getCharset(bytes);
> > 
> > return new String(bytes, charSet);
> > } else {
> > throw new ClientProtocolException(
> > "Unexpected response status: " + status);
> > }
> > }
> > 
> > };
> > 
> > String responseBody = client.execute(httpget, responseHandler);
> > return responseBody;
> > }
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> > For additional commands, e-mail: httpclient-users-help@hc.apache.org
> > 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 
> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: OOM problem

Posted by Ken Krugler <kk...@transpac.com>.
If you're crawling web pages, you need to have a limit to the amount of data any page returns.

Otherwise you'll eventually run into a site that returns an unbounded amount of data, which will kill your JVM.

See SimpleHttpFetcher in Bixo for an example of one way to do this type of limiting (though not optimal).

-- Ken


On Feb 10, 2014, at 8:07pm, Li Li <fa...@gmail.com> wrote:

> I am using httpclient 4.3 to crawl webpages.
> I start 200 threads and PoolingHttpClientConnectionManager with
> totalMax 1000 and perHostMax 5
> I give java 2GB memory and one thread throws an exception(others still
> running, this thread is dead)
> 
> Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space
>        at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56)
>        at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133)
>        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221)
>        at com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211)
>        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
>        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
>        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
>        at com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233)
>        at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198)
>        at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134)
>        at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231)
> 
> does it mean my code has some memory leak probelm?
> 
> my codes:
> public String httpGet(String url) throws Exception {
> if (!isValid)
> throw new RuntimeException("not valid now, you should init first");
> HttpGet httpget = new HttpGet(url);
> 
> // Create a custom response handler
> ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
> 
> public String handleResponse(final HttpResponse response)
> throws ClientProtocolException, IOException {
> int status = response.getStatusLine().getStatusCode();
> if (status >= 200 && status < 300) {
> HttpEntity entity = response.getEntity();
> if (entity == null)
> return null;
> 
> byte[] bytes = EntityUtils.toByteArray(entity);
> String charSet = CharsetDetector.getCharset(bytes);
> 
> return new String(bytes, charSet);
> } else {
> throw new ClientProtocolException(
> "Unexpected response status: " + status);
> }
> }
> 
> };
> 
> String responseBody = client.execute(httpget, responseHandler);
> return responseBody;
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr