You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2018/01/16 10:53:00 UTC

[jira] [Comment Edited] (NUTCH-2496) Speed up link inversion step in crawling script

    [ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326999#comment-16326999 ] 

Markus Jelsma edited comment on NUTCH-2496 at 1/16/18 10:52 AM:
----------------------------------------------------------------

Yes it makes a lot of sense to disable it everywhere except when running the injector and during the parse phase (or fetch phase if you parse during fetch). 

Parse and inject are the points of entry, that is where new records are added to the DB, that is where you want to filter and normalize.

You only need to filter in another phase when you have changed your normalizers and/or filters.


was (Author: markus17):
Yes it makes a lot of sense to disable it everywhere except when running the injector and during the parse phase (or fetch phase if you parse during fetch).

 

Parse and inject are the points of entry, that is where new records are added to the DB, that is where you want to filter.

 

You only need to filter in an other phase when you have changed your normalizers and/or filters.

> Speed up link inversion step in crawling script
> -----------------------------------------------
>
>                 Key: NUTCH-2496
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2496
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Moreno Feltscher
>            Assignee: Lewis John McGibbney
>            Priority: Major
>
> While working on a project where I have to index a huge number of URLs I encountered an issue with the link inversion step of the crawling script. A while ago Ian Lopata stumbled upon the same issue as described here: http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)