You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 17:09:08 UTC

[jira] [Closed] (NUTCH-659) Help! No urls fetched for internal repository website

     [ https://issues.apache.org/jira/browse/NUTCH-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-659.
-------------------------------


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> Help! No urls fetched for internal repository website
> -----------------------------------------------------
>
>                 Key: NUTCH-659
>                 URL: https://issues.apache.org/jira/browse/NUTCH-659
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: nutch 0.9, TOMCAT6.0.18, JAVA 1.6.0_10, CentOS 5.2
>            Reporter: Bryan
>            Priority: Critical
>
> I am new to Nutch, and implemented Nutch for my internal company websites search. The version is nutch-2008-11-02_04-01-26.tar.
>  
> My internal company websites includes several HTTP websites. 
> Another one is SVN repository HTTPS websites in XML structure, using <dir> and <file> tag.
>  
> The search in HTTP websites is good. 
> The HTTPS is ok. We have some links in those HTTP websites which point to Word files under SVN website. They can be indexed.
>  
> But the Nutch does not search my SVN website. If I only search the SVN website, it is always: 0 urls fetched.
>  
> My nutch-site.xml is as following:
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  
> # skip file:, ftp:, & mailto: urls
> -^(ftp|mailto):
>  
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*smartlabs.com.au/
>  
> Any help would be much appreciated. Thanks in advnce.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira