You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2013/10/04 14:04:44 UTC
[jira] [Commented] (NUTCH-1640) OOM in ParseSegment Phase

    [ https://issues.apache.org/jira/browse/NUTCH-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786091#comment-13786091 ] 

Julien Nioche commented on NUTCH-1640:
--------------------------------------

Hi Mitesh

Thanks for such a clear and well documented issue. I haven't managed to reproduce the OOM even when creating 5M ParseUtil instances. I reported an issue previously where the caches could be leaking as their access is not always synchronised by the Factory classes but since the parsing is not multi-threaded I don't think this is the cause of the problem here.

Here is the simple test I used : 

{code}
  public void testInstanciations() {
    ParseUtil p = null;
    for (int i = 0; i < 5000000; i++){
      p = new ParseUtil(conf);
      if (i%100000==0){
        System.gc();
        
        Runtime runtime = Runtime.getRuntime();

        NumberFormat format = NumberFormat.getInstance();

        StringBuilder sb = new StringBuilder();
        long maxMemory = runtime.maxMemory();
        long allocatedMemory = runtime.totalMemory();
        long freeMemory = runtime.freeMemory();

        sb.append("free memory: " + format.format(freeMemory / 1024));
        sb.append("\t allocated memory: " + format.format(allocatedMemory / 1024));
        sb.append("\t max memory: " + format.format(maxMemory / 1024));
        sb.append("\t total free memory: " + format.format((freeMemory + (maxMemory - allocatedMemory)) / 1024));
        
        System.out.println(sb.toString());
      }
    }
  }
{code}

This is different from your situation as this test is done in local mode and no content is actually parsed.

Even if I can't reproduce the issue or explain why you were having it I think it would be a good idea to commit your patch anyway. I can't think of a reason why we'd want to recreate a new ParseUtil instance for every document to process. This would result in a gain in processing time but certainly not in the proportion you had as in your case the difference was probably due to the GCing.

Will commit shortly unless someone comes up with a good reason not to.

Thanks

Julien





> OOM in ParseSegment Phase
> -------------------------
>
>                 Key: NUTCH-1640
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1640
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>         Environment: RHEL 6.2 x86_64
>            Reporter: Mitesh Singh Jat
>         Attachments: NUTCH-1640.patch
>
>
> The nutch ParseSegment phase fails after 2 runs on same TaskTracker, with the following Exception:
> {noformat}
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:640)
> 	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:553)
> 	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvmRunner(JvmManager.java:317)
> 	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvm(JvmManager.java:297)
> 	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.taskKilled(JvmManager.java:289)
> 	at org.apache.hadoop.mapred.JvmManager.taskKilled(JvmManager.java:158)
> 	at org.apache.hadoop.mapred.TaskRunner.kill(TaskRunner.java:802)
> 	at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:3315)
> 	at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.jobHasFinished(TaskTracker.java:3287)
> 	at org.apache.hadoop.mapred.TaskTracker.purgeTask(TaskTracker.java:2316)
> 	at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3710)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1118)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> 	at $Proxy1.fatalError(Unknown Source)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:310)
> {noformat}
> Whereas similar parsing when done in Nutch Fetcher Phase (fetcher.parse=true, fetcher.store.content=false) does not give such issue.
> Hence, on analysing the code of Fetcher and ParseSegment, it seems the issue
> should be related to creation parseResult foreach url in ParseSegment.java.
> {code}
>  95     ParseResult parseResult = null;
>  96     try {
>  97       parseResult = new ParseUtil(getConf()).parse(content); // <*****
>  98     } catch (Exception e) {
>  99       LOG.warn("Error parsing: " + key + ": " + StringUtils.stringifyException(e));
> 100       return;
> 101     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)