You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/28 19:38:29 UTC

[jira] Created: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

OpenSearchServlet: OutOfMemoryError: Java heap space
----------------------------------------------------

         Key: NUTCH-292
         URL: http://issues.apache.org/jira/browse/NUTCH-292
     Project: Nutch
        Type: Bug

  Components: web gui  
    Versions: 0.8-dev    
    Reporter: Stefan Neufeind
    Priority: Critical


java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

The URL I use is:

[...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url

It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine.
Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary?

!! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

Posted by "Marcel Schnippe (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413983 ] 

Marcel Schnippe commented on NUTCH-292:
---------------------------------------

The cause for the OutOfMemoryError in my document, was an (large) Document containing a very large set of token. Most of the tokens are made of overlapping substrings like in 

"all your base are belong to us" => all, all-your, your, your-base, all-your-base, your-base, base-are etc


> OpenSearchServlet: OutOfMemoryError: Java heap space
> ----------------------------------------------------
>
>          Key: NUTCH-292
>          URL: http://issues.apache.org/jira/browse/NUTCH-292
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>     Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> 	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
> 	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413778 ] 

Stefan Neufeind commented on NUTCH-292:
---------------------------------------

That patch is for the 0.7-branch, right? In 0.8-dev you'd want to do that in BasicSummarizer.java. But to me it looks like something similar is already in place:

        // Iterate through as long as we're before the end of
        // the document and we haven't hit the max-number-of-items
        // -in-a-summary.
        //
        while ((j < endToken) && (j - startToken < sumLength)) {

But I also suspect it might have something to do with tokens. What I experienced is that several search-results currently contain arbitrary binary data. Those are the cases where a parser-plugin has "failed" and where parse-text was used as a fallback. If I'm right this might lead to actually quite large tokens because no whitespace is found in a row of characters.

@Marcel: Thank you for the fix anyway ... you help is very much appreciated.

> OpenSearchServlet: OutOfMemoryError: Java heap space
> ----------------------------------------------------
>
>          Key: NUTCH-292
>          URL: http://issues.apache.org/jira/browse/NUTCH-292
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>     Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> 	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
> 	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

Posted by "Marcel Schnippe (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413982 ] 

Marcel Schnippe commented on NUTCH-292:
---------------------------------------

Hi Stefan, 

Thanks for trying out the Patch. Yes, you were right, it was for 0.7. I should definitly switch, but i made so many custom changes.
The proper place to apply would be in summary-basic.getTokens like in 

  private Token[] getTokens(String text) {
    ArrayList result = new ArrayList();
    TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
    Token token = null;
-     while (true)  {
+    while (result.size()<token_deep) {
      try {
        token = ts.next();
      } catch (IOException e) {
        token = null;
      }
      if (token == null) { break; }
      result.add(token);
    }
    try {
      ts.close();
    } catch (IOException e) {
      // ignore
    }
    return (Token[]) result.toArray(new Token[result.size()]);
  }

<humor>Beware of the above code. I have only proven it correct, not tested it  (D.Knuth)</humor>

> OpenSearchServlet: OutOfMemoryError: Java heap space
> ----------------------------------------------------
>
>          Key: NUTCH-292
>          URL: http://issues.apache.org/jira/browse/NUTCH-292
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>     Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> 	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
> 	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] 

Stefan Groschupf commented on NUTCH-292:
----------------------------------------

+1, Can someone create a clean patch file?

> OpenSearchServlet: OutOfMemoryError: Java heap space
> ----------------------------------------------------
>
>          Key: NUTCH-292
>          URL: http://issues.apache.org/jira/browse/NUTCH-292
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>     Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> 	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
> 	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

Posted by "Marcel Schnippe (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-292?page=all ]

Marcel Schnippe updated NUTCH-292:
----------------------------------

    Attachment: summarizer.diff

The exception seems familar to me.
Maybe the attached patch does help? 

Marcel Schnippe

> OpenSearchServlet: OutOfMemoryError: Java heap space
> ----------------------------------------------------
>
>          Key: NUTCH-292
>          URL: http://issues.apache.org/jira/browse/NUTCH-292
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>     Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> 	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
> 	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-292?page=all ]
     
Sami Siren resolved NUTCH-292:
------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed
      Assign To: Sami Siren

I just committed this, thank you!

> OpenSearchServlet: OutOfMemoryError: Java heap space
> ----------------------------------------------------
>
>          Key: NUTCH-292
>          URL: http://issues.apache.org/jira/browse/NUTCH-292
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>     Assignee: Sami Siren
>     Priority: Critical
>      Fix For: 0.8-dev
>  Attachments: NUTCH-292-summarizer08.diff, summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> 	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
> 	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-292?page=all ]

Stefan Neufeind updated NUTCH-292:
----------------------------------

    Attachment: NUTCH-292-summarizer08.diff

As per demand, here is the patch.

Please note that it has not throughly been testeed by myself. But the patch looks fine and makes sense :-) Oh, and it compiles clean ...

> OpenSearchServlet: OutOfMemoryError: Java heap space
> ----------------------------------------------------
>
>          Key: NUTCH-292
>          URL: http://issues.apache.org/jira/browse/NUTCH-292
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>     Priority: Critical
>  Attachments: NUTCH-292-summarizer08.diff, summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> 	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
> 	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira