You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by reinhard schwab <re...@aon.at> on 2009/07/17 18:43:18 UTC

dump all outlinks

is any tool available to dump all outlinks (filtered outlinks included)?
(i know the tools to dump crawldb, linkdb and segments)
or do i have to implement such a tool and if, how?
i want to know them to adapt/manage the url filters.
parse the contents with urlfilters disabled?

reinhard

Re: dump all outlinks

Posted by reinhard schwab <re...@aon.at>.
i have done something like that.
the problem is the parsers filter the outlinks according to the
configuration of the url filters.
i have had to delete the *parse* directories, reparse the segments and
before parsing reset the url filters,
that no outlink is filtered out.

kevin chen schrieb:
> You can dump segment info to a directory, let's say "tmps",
> $NUTCH_HOME/bin/nutch readseg -dump $segment tmps -nocontent
>
> Then, go to the directory, you should see a file "dump"
> grep outlink: dump | cut -f5 -d" " > outlinks
>
> On Fri, 2009-07-17 at 18:43 +0200, reinhard schwab wrote:
>   
>> is any tool available to dump all outlinks (filtered outlinks included)?
>> (i know the tools to dump crawldb, linkdb and segments)
>> or do i have to implement such a tool and if, how?
>> i want to know them to adapt/manage the url filters.
>> parse the contents with urlfilters disabled?
>>
>> reinhard
>>     
>
>
>   


Re: dump all outlinks

Posted by kevin chen <ke...@bdsing.com>.
You can dump segment info to a directory, let's say "tmps",
$NUTCH_HOME/bin/nutch readseg -dump $segment tmps -nocontent

Then, go to the directory, you should see a file "dump"
grep outlink: dump | cut -f5 -d" " > outlinks

On Fri, 2009-07-17 at 18:43 +0200, reinhard schwab wrote:
> is any tool available to dump all outlinks (filtered outlinks included)?
> (i know the tools to dump crawldb, linkdb and segments)
> or do i have to implement such a tool and if, how?
> i want to know them to adapt/manage the url filters.
> parse the contents with urlfilters disabled?
> 
> reinhard