You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ferdy Galema (JIRA)" <ji...@apache.org> on 2012/05/07 14:11:48 UTC

[jira] [Created] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Ferdy Galema created NUTCH-1356:
-----------------------------------

             Summary: ParseUtil use ExecutorService instead of manually thread handling.
                 Key: NUTCH-1356
                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
             Project: Nutch
          Issue Type: Improvement
            Reporter: Ferdy Galema
             Fix For: nutchgora
         Attachments: NUTCH-1356.patch

Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.

By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1356:
--------------------------------

    Fix Version/s: 1.6

Sure will create patch for 1.x too. (Seems not that different).
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285466#comment-13285466 ] 

Markus Jelsma commented on NUTCH-1356:
--------------------------------------

I kept checking the log and found some more of the exceptions above and several similar:

{code}
java.util.concurrent.TimeoutException
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
	at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:162)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
{code}

More interesting is that i see additional stack traces elsewhere indicating a problem with a parse filter:

{code}
Caused by: java.lang.NullPointerException
	at org.apache.xerces.dom.ParentNode.nodeListItem(Unknown Source)
	at org.apache.xerces.dom.ParentNode.item(Unknown Source)
	at org.apache.nutch.util.NodeWalker.nextNode(NodeWalker.java:75)
	at org.apache.nutch.parse.headings.HeadingsParseFilter.getElement(HeadingsParseFilter.java:79)
	at org.apache.nutch.parse.headings.HeadingsParseFilter.filter(HeadingsParseFilter.java:48)
	at org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:98)
	at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:208)
	at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
	at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
{code}
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285455#comment-13285455 ] 

Markus Jelsma commented on NUTCH-1356:
--------------------------------------

I came across an NPE when inspecting the logs. We use the patch on 1.5 and i can't remember i've seen this error before. Is it related to this patch?

It is, unfortunately, not reproducible with the parserchecker. I should also note that this fetcher is having a very tough job (all others finished a while ago) so something is going on, perhaps eating up memory due to a leak somewhere (Tika?).

{code}
2012-05-30 07:03:39,560 WARN org.apache.nutch.parse.ParseUtil: Error parsing http://www.vvvroomshoopseboys.nl/e7-programma.html with org.apache.nutch.parse.tika.TikaParser@2bd2e84e
java.util.concurrent.ExecutionException: java.lang.NullPointerException
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
	at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:162)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
{code}
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1356:
--------------------------------

    Attachment: NUTCH-1356-trunk-v2.patch

It was working though, I guess that is because of a transitive dependancy. Anyway it's best to declare it as a direct dependancy too. Patch v2 does this. (11.0.2 --> the same as the already present jar).
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271073#comment-13271073 ] 

Hudson commented on NUTCH-1356:
-------------------------------

Integrated in Nutch-nutchgora #248 (See [https://builds.apache.org/job/Nutch-nutchgora/248/])
    NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling. (Revision 1335065)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParseUtil.java

                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1356:
--------------------------------

    Attachment: NUTCH-1356-trunk.patch

Patch for trunk.
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269695#comment-13269695 ] 

Ferdy Galema commented on NUTCH-1356:
-------------------------------------

committed at nutchgora
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1356:
--------------------------------

    Attachment: NUTCH-1356.patch
    
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269589#comment-13269589 ] 

Markus Jelsma commented on NUTCH-1356:
--------------------------------------

Guava is not listed as dependency on trunk.

{code}
<dependency org="com.google.guava" name="guava" rev="12.0" />
{code}

Great work!
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293510#comment-13293510 ] 

Ferdy Galema commented on NUTCH-1356:
-------------------------------------

Thanks.

"The parser threads you refer to, is that a known problem? Can we solve it?"
To solve it correctly every parser should check the interrupted state at regular intervals. This is pretty huge task considering the amount of parsers. For now it is something to keep in mind. I'll create an issue for reference.
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269568#comment-13269568 ] 

Markus Jelsma commented on NUTCH-1356:
--------------------------------------

Nice! Do you have a 1.x patch as well?
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286545#comment-13286545 ] 

Markus Jelsma commented on NUTCH-1356:
--------------------------------------

Thanks for clearing things up.

These errors are indeed not reproducable. I collected about 12 of these error throwing URL's and manually fetched them again with success, the parsechecker also processes them without error.

The parser threads you refer to, is that a known problem? Can we solve it? Once in a while a stalling parser eats up all remaining memory and dies.
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295803#comment-13295803 ] 

Hudson commented on NUTCH-1356:
-------------------------------

Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/])
    NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling (Revision 1349230)

     Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349230
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseUtil.java

                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1356.
----------------------------------

    Resolution: Fixed

Committed for 1.6 in rev. 1349230.
Thanks Ferdy.
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285510#comment-13285510 ] 

Ferdy Galema commented on NUTCH-1356:
-------------------------------------

I find it difficult to believe those exceptions are caused by this patch. It does not change the way exceptions/timeouts are handled, it only makes sure parser threads are reused. 

It seems you are suffering from two types of (unrelated) exceptions. The first is ExecutionException. This is caused whenever the execution inside the FutureTask.get() throws an exception that is not catched anywere but the FutureTask.get() itself. In your case this seems to be a NPE during the parse of the html page. Might be a bug but then again it is strange that it is not reproducible with the ParserChecker. (You sure about this?)

The second is TimeoutException, caused whenever the FutureTask.get() cannot be completed within the specified timeout. The tricky part is that single urls might be perfectly able to complete within the timeout, but when there is a heavy concurrent load (a lot of semi-expensive parses) the parser load might stack up and cause many parses to timeout. This can be the case with parsing during fetch. But when using a separate parserjob this can also happen because Parser implementation do not necessarily have to respond to a thread interrupt. (Which is fired away with the task.cancel(true) call). If a parser does not check the Thread.interrupted state at regular intervals, it will just continue to run and eat up resources. I find it very helpful to debug stalling fetchers/parsers with the lazy men's profiler: kill -QUIT <process_id>. This will dump stacktraces, sometimes exposing the fact that hundreds of parser threads are still active in the background. (Of course many of them already timed out a long time ago).
                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293548#comment-13293548 ] 

Hudson commented on NUTCH-1356:
-------------------------------

Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/])
    NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling (Revision 1349230)

     Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseUtil.java

                
> ParseUtil use ExecutorService instead of manually thread handling.
> ------------------------------------------------------------------
>
>                 Key: NUTCH-1356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1356
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora, 1.6
>
>         Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for every parse it sometimes happens that specific parsers are very expensive. For example, parsers that have threadlocal fields will initialize them for every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira