You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2013/12/14 18:50:07 UTC
webgraph in limited domain
Dear nutchers,
I have a larger set of domains and many URLs which I want to process. I
only want to crawl pages from those domains, but I am interested in all
outlinks regardless wether its inbound or not.
I am using property db.ignore.external.links=true. And I want to create
a webgraphdb. Currently, I am getting an empty webgraphdb.
In org/apache/nutch/parse/ParseOutputFormat.java non-domain anchors are
filtered out already at parse phase and do not make their way in
parsedata. I had somehow the hope this happens at a later stage.
Any (hackish) way for doing that?
Any suggestions are very welcome.
Martin