You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2007/09/22 21:37:12 UTC

nutch trunk filtering URLs in invertlinks even if -noFilter is on?

We recently upgraded from a late 06 nightly of nutch to trunk, and  
most things have been working faster and stabler.

However, there is one catch: we have a "readlinkdb" call in our crawl  
process as we want to catalog links to a binary file type (say .png)  
that other programs of ours can try to download and parse.

We have .png in our url filters because we don't want nutch to try to  
download these files, but we do want the linkdb to note them.

In our old crawl script, we did:

bin/nutch invertlinks crawl/linkdb -dir crawl/segment -noFilter
bin/nutch readlinkdb crawl/linkdb -dump linkdb_dump

which worked fine and there were many .png files in the dump.  
However, with trunk, this doesn't seem to be the case anymore. There  
are no .png files in the linkdb dump, only html (pretty much the only  
filetype we allow nutch to download.)

Is this intended? Am I doing something wrong?

Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?

Posted by Brian Whitman <br...@variogr.am>.

On Sep 23, 2007, at 11:38 AM, Brian Whitman wrote:
>
> 2) If so, why is there a -noFilter option for readlinkdb?
>

mistake, change this to

> 2) If so, why is there a -noFilter option for invertlinks?

Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?

Posted by Brian Whitman <br...@variogr.am>.

(Copied from nutch-user, this is more a dev topic now)
> It's not an issue with readseg or readlinkdb themselves, because a  
> segment fetched in the older nutch (using the exact same  
> configuration) expels png links in trunk's readlinkdb. It appears  
> the fetcher now only parses URLs that pass the filters into the  
> segment.


I checked the diffs from my old version (mid-December 06) and trunk  
ParseOutputFormat. It appears now that the parse puts the outlink  
URLs through the URLFilters. I confirmed this by taking out .png from  
my URLFilters and re-running a crawl -- pngs now appear in the  
readlinkdb.

1) Was it a bug that URLs that would not pass URLFilters got into the  
linkdb for analysis?

2) If so, why is there a -noFilter option for readlinkdb? The linkdb  
has already been filtered whether you like it or not. -noFilter will  
never have any effect.

There needs to be a way to have the linkdb reflect all URLs  
(unfiltered) for further analysis. I suggest a -noFilterOutlinks  
(default off) in the fetch command (as the default behavior of fetch  
is to parse.) This would simply not call the filter in  
ParseOutputFormat, if my theory is correct.

Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?

Posted by Brian Whitman <br...@variogr.am>.

On Sep 22, 2007, at 3:37 PM, Brian Whitman wrote:
>
> which worked fine and there were many .png files in the dump.  
> However, with trunk, this doesn't seem to be the case anymore.  
> There are no .png files in the linkdb dump, only html (pretty much  
> the only filetype we allow nutch to download.)
>

More info on this... I noticed that the two readlinkdb outputs with - 
noFilter on and off were identical (diff returned nothing.)

I dumped the segment with readseg and none of the URL or outlink:  
lines are for anything but things that would pass my url filters.

It's not an issue with readseg or readlinkdb themselves, because a  
segment fetched in the older nutch (using the exact same  
configuration) expels png links in trunk's readlinkdb. It appears the  
fetcher now only parses URLs that pass the filters into the segment.

I assume this behavior is incorrect because otherwise why would  
readlinkdb need a -noFilter? Also, this makes it tough to do what I'm  
trying to do  -- have nutch index text but have other things grab the  
binary files.

Any ideas?