You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "lufeng (JIRA)" <ji...@apache.org> on 2013/04/01 03:06:14 UTC

[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

    [ https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13618536#comment-13618536 ] 

lufeng commented on NUTCH-1538:
-------------------------------

Hi Roland,
yes, i mean that may be 3rd part plugin will use these fields not only the content field. 

yes, Maybe if all generated urls have been crawled, read these contents actually take up a lot of time. but I'm also not sure what are the side effects if we comment these codes. i see that ParserJob#getFields method will load the parsePluginFields,htmlParsePluginFields and signaturePluginFields. so i have said that 3rd part plugin will load some fields in WebPage. I'll probably make a test. and other people has any comments. :)
                
> tuning of loaded fields during fetcherJob start-up
> --------------------------------------------------
>
>                 Key: NUTCH-1538
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1538
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 2.1
>         Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / gora-core 0.2.1 
> running fetch with parse=true
>            Reporter: Roland von Herget
>         Attachments: NUTCH-1538-FetcherJob-v1.patch
>
>
> Main problem is, nutch is loading nearly every row & column from DB during startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB (fetcher.store.content=true) you'll end up loading GBs of unused content during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira