You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2013/12/14 18:50:07 UTC

webgraph in limited domain

Dear nutchers,

I have a larger set of domains and many URLs which I want to process. I
only want to crawl pages from those domains, but I am interested in all
outlinks regardless wether its inbound or not.

I am using property db.ignore.external.links=true. And I want to create
a webgraphdb. Currently, I am getting an empty webgraphdb.

In org/apache/nutch/parse/ParseOutputFormat.java non-domain anchors are
filtered out already at parse phase and do not make their way in
parsedata. I had somehow the hope this happens at a later stage.

Any (hackish) way for doing that? 

Any suggestions are very welcome.

Martin