You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/03/04 20:19:12 UTC

[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

    [ https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592524#comment-13592524 ] 

Lewis John McGibbney commented on NUTCH-1538:
---------------------------------------------

Thank god you got to the bottom of this one Roland. 
I never use a parsing fetcher.
Just to clarify, are you stating that the fields which result in slow loading are always loaded regardless of whether a parsing fetcher is used or not?
If this is not the case then no patch need be applied, however it is certainly something folks needs to be aware of IF they choose to use a parsing fetcher and to store content.
This one had me stumped.
                
> tuning of loaded fields during fetcherJob start-up
> --------------------------------------------------
>
>                 Key: NUTCH-1538
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1538
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 2.1
>         Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / gora-core 0.2.1 
> running fetch with parse=true
>            Reporter: Roland
>
> Main problem is, nutch is loading nearly every row & column from DB during startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB (fetcher.store.content=true) you'll end up loading GBs of unused content during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira