You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2007/09/23 17:38:17 UTC
Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?
(Copied from nutch-user, this is more a dev topic now)
> It's not an issue with readseg or readlinkdb themselves, because a
> segment fetched in the older nutch (using the exact same
> configuration) expels png links in trunk's readlinkdb. It appears
> the fetcher now only parses URLs that pass the filters into the
> segment.
I checked the diffs from my old version (mid-December 06) and trunk
ParseOutputFormat. It appears now that the parse puts the outlink
URLs through the URLFilters. I confirmed this by taking out .png from
my URLFilters and re-running a crawl -- pngs now appear in the
readlinkdb.
1) Was it a bug that URLs that would not pass URLFilters got into the
linkdb for analysis?
2) If so, why is there a -noFilter option for readlinkdb? The linkdb
has already been filtered whether you like it or not. -noFilter will
never have any effect.
There needs to be a way to have the linkdb reflect all URLs
(unfiltered) for further analysis. I suggest a -noFilterOutlinks
(default off) in the fetch command (as the default behavior of fetch
is to parse.) This would simply not call the filter in
ParseOutputFormat, if my theory is correct.
Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?
Posted by Brian Whitman <br...@variogr.am>.
On Sep 23, 2007, at 11:38 AM, Brian Whitman wrote:
>
> 2) If so, why is there a -noFilter option for readlinkdb?
>
mistake, change this to
> 2) If so, why is there a -noFilter option for invertlinks?