You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Simone Frenzel <ps...@googlemail.com> on 2011/08/08 13:34:38 UTC

Subcollection

Hi,
my Nutch crawl job and the Indexing with solr works fine.Except for the
Subcollcetion. I configured the subcollcetion.xml
*<subcollections>
    <subcollection>
        <name>wiki</name>
        <id>wiki</id>
        <whitelist>/plugins/mediawiki/wiki/</whitelist>
        <blacklist />
    </subcollection>
</subcollections>*

and add the Plugin in teh nutch-site.xml
<configuration>
    <property>
        <name>http.agent.name</name>
        <value>mediawiki</value>
    </property>


    <property>
        <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|subcollection|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>

when I take a look with Luke to the Index there is no subcollcetion-field.

Have anybody exprience with this problem or an idea which may help?
Thanks and greetings

psimone

Re: Subcollection

Posted by psimone <ps...@psimone.de>.
Hi again,
solved the problem. A Subcollection mußt be part of the Start-url.
The Crawler just go deeper in the Url-tree and don't to a url on the same
Level.

Starturl http: xyz.org/hans/
Subcollection  xzy.org/sepp/wiki
won't work even hans links to sepp.

Starturl http: xyz.org/
Subcollection  xzy.org/sepp/wiki
works

> Hi,
> my Nutch crawl job and the Indexing with solr works fine.Except for the
> Subcollcetion. I configured the subcollcetion.xml
> *<subcollections>
>     <subcollection>
>         <name>wiki</name>
>         <id>wiki</id>
>         <whitelist>/plugins/mediawiki/wiki/</whitelist>
>         <blacklist />
>     </subcollection>
> </subcollections>*
>
> and add the Plugin in teh nutch-site.xml
> <configuration>
>     <property>
>         <name>http.agent.name</name>
>         <value>mediawiki</value>
>     </property>
>
>
>     <property>
>         <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|subcollection|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>     </property>
>
> when I take a look with Luke to the Index there is no subcollcetion-field.
>
> Have anybody exprience with this problem or an idea which may help?
> Thanks and greetings
>
> psimone
>