You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Simone Frenzel <ps...@googlemail.com> on 2011/08/08 13:34:38 UTC
Subcollection
Hi,
my Nutch crawl job and the Indexing with solr works fine.Except for the
Subcollcetion. I configured the subcollcetion.xml
*<subcollections>
<subcollection>
<name>wiki</name>
<id>wiki</id>
<whitelist>/plugins/mediawiki/wiki/</whitelist>
<blacklist />
</subcollection>
</subcollections>*
and add the Plugin in teh nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>mediawiki</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|subcollection|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
when I take a look with Luke to the Index there is no subcollcetion-field.
Have anybody exprience with this problem or an idea which may help?
Thanks and greetings
psimone
Re: Subcollection
Posted by psimone <ps...@psimone.de>.
Hi again,
solved the problem. A Subcollection mußt be part of the Start-url.
The Crawler just go deeper in the Url-tree and don't to a url on the same
Level.
Starturl http: xyz.org/hans/
Subcollection xzy.org/sepp/wiki
won't work even hans links to sepp.
Starturl http: xyz.org/
Subcollection xzy.org/sepp/wiki
works
> Hi,
> my Nutch crawl job and the Indexing with solr works fine.Except for the
> Subcollcetion. I configured the subcollcetion.xml
> *<subcollections>
> <subcollection>
> <name>wiki</name>
> <id>wiki</id>
> <whitelist>/plugins/mediawiki/wiki/</whitelist>
> <blacklist />
> </subcollection>
> </subcollections>*
>
> and add the Plugin in teh nutch-site.xml
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value>mediawiki</value>
> </property>
>
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|subcollection|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
>
> when I take a look with Luke to the Index there is no subcollcetion-field.
>
> Have anybody exprience with this problem or an idea which may help?
> Thanks and greetings
>
> psimone
>