You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jigal van Hemert | alterNET internet BV <ji...@alternet.nl> on 2016/04/06 11:38:18 UTC

Configuration of very specific requirements

Hi,

Probably not too complex for those who are used to fiddling with the
configuration, but I could use some pointer on how to achieve the following.

One site is indexed by Nutch. Now it should be limited to the pages that
are linked in the seed URL (no further crawling necessary). Furthermore all
pages must be revisited daily (and new pages must be indexed daily too).

Another wish is to exclude pages with certain content on them. Currently we
do this by a delete query after Nutch finishes. We can keep it this way,
but I wondered if there was a smarter option.

Thanks in advance for pointing me in the right direction.

-- 


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

jigal@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !

Re: Configuration of very specific requirements

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Jigal,

>> <property>
>>   <name>scoring.depth.max</name>
>>   <value>2</value>
> Will try that.

Please, note that 2 is the right value.  We've discussed this behind
the scenes and Julien verified that the right value for your use
case is 2.
 depth 1  :  fetch seeds only
 depth 2  :  seeds + pages reachable by one link/hop from the seeds
The description does specify this and does not give an example.
Feel free to open a Jira issue to improve the description.
Whether you start list indexes or counts from 0 or 1 is a frequent
source of misunderstandings among programmers.


> Is my assumption correct that if
>
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
>
> is used that only db.fet.interval.default is used? All the other properties
> are then ignored?

All db.fetch.schedule.adaptive.* are ignored then.
db.fetch.interval.max is used to determine when 404 pages
are retried - removed pages may appear again after some time.

> It sounds really stupid, but the maker of that site does not output a 404
> header, but puts an HTML formatted message on the page like "Code 303
> Description you are not allowed to access this item"
>
> Currently my cron job just calls the solr update handler and sends a delete
> query that searches for content matching "Code 303 Description" (all HTML
> and whitespace are stripped anyway in the solr index) in the stream body.
>
> Writing a plug in to filter this out is indeed cleaner, but the work
> involved is too much compared to what is gained. The workaround does its
> job. If there was a plugin that does this already that would be nice.
>

I've once hit exactly the same problem of such "nice" customized 404 pages.
And my solution was also to handle this on the index level:
if the layout of the 404 pages is changed you can react quickly,
and if the index is not too big, it's clean again after a couple of minutes
while it definitely takes longer to reconfigure the crawler and recrawl
the content (or reparse and reindex).

Cheers,
Sebastian


On 04/06/2016 04:14 PM, Jigal van Hemert | alterNET internet BV wrote:
> Hi Julien and Sebastian,
> 
> Thank you for your replies!
> 
> (both replies had a lot of similarities, so I'll answer them both)
> 
> On 6 April 2016 at 14:16, Sebastian Nagel <wa...@googlemail.com>
> wrote:
> 
>>> One site is indexed by Nutch. Now it should be limited to the pages that
>>> are linked in the seed URL (no further crawling necessary).
>> Have a look at the plugin "scoring-depth" and add to your nutch-site.xml
>> (cf. conf/nutch-default.xml):
>>
>>
>> <!-- scoring-depth properties
>>  Add 'scoring-depth' to the list of active plugins
>>  in the parameter 'plugin.includes' in order to use it.
>>  -->
>>
>> <property>
>>   <name>scoring.depth.max</name>
>>   <value>2</value>
>>   <description>Max depth value from seed allowed by default.
>>   Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE"
>>   as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
>>   to track the distance from the seed it was found from.
>>   The depth is used to prioritise URLs in the generation step so that
>>   shallower pages are fetched first.
>>   </description>
>> </property>
>>
> 
> Will try that.
> 
> 
>>
>>> Furthermore all
>>> pages must be revisited daily (and new pages must be indexed daily too).
>>
>> See property "db.fetch.interval.default",
>> also take the time to check other
>>   db.fetch.interval.*
>>   db.fetch.schedule.*
>> properties.
>>
> 
> Is my assumption correct that if
> 
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
>   <description>The implementation of fetch schedule. DefaultFetchSchedule
> simply
>   adds the original fetchInterval to the last fetch time, regardless of
>   page changes.</description>
> </property>
> 
> is used that only db.fet.interval.default is used? All the other properties
> are then ignored?
> 
> 
>>> Another wish is to exclude pages with certain content on them. Currently
>> we
>>> do this by a delete query after Nutch finishes. We can keep it this way,
>>> but I wondered if there was a smarter option.
>>
>> How is such content identified?
>>
> 
> It sounds really stupid, but the maker of that site does not output a 404
> header, but puts an HTML formatted message on the page like "Code 303
> Description you are not allowed to access this item"
> 
> Currently my cron job just calls the solr update handler and sends a delete
> query that searches for content matching "Code 303 Description" (all HTML
> and whitespace are stripped anyway in the solr index) in the stream body.
> 
> Writing a plug in to filter this out is indeed cleaner, but the work
> involved is too much compared to what is gained. The workaround does its
> job. If there was a plugin that does this already that would be nice.
> 


Re: Configuration of very specific requirements

Posted by Jigal van Hemert | alterNET internet BV <ji...@alternet.nl>.
Hi Julien and Sebastian,

Thank you for your replies!

(both replies had a lot of similarities, so I'll answer them both)

On 6 April 2016 at 14:16, Sebastian Nagel <wa...@googlemail.com>
wrote:

> > One site is indexed by Nutch. Now it should be limited to the pages that
> > are linked in the seed URL (no further crawling necessary).
> Have a look at the plugin "scoring-depth" and add to your nutch-site.xml
> (cf. conf/nutch-default.xml):
>
>
> <!-- scoring-depth properties
>  Add 'scoring-depth' to the list of active plugins
>  in the parameter 'plugin.includes' in order to use it.
>  -->
>
> <property>
>   <name>scoring.depth.max</name>
>   <value>2</value>
>   <description>Max depth value from seed allowed by default.
>   Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE"
>   as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
>   to track the distance from the seed it was found from.
>   The depth is used to prioritise URLs in the generation step so that
>   shallower pages are fetched first.
>   </description>
> </property>
>

Will try that.


>
> > Furthermore all
> > pages must be revisited daily (and new pages must be indexed daily too).
>
> See property "db.fetch.interval.default",
> also take the time to check other
>   db.fetch.interval.*
>   db.fetch.schedule.*
> properties.
>

Is my assumption correct that if

<property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule
simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
</property>

is used that only db.fet.interval.default is used? All the other properties
are then ignored?


> > Another wish is to exclude pages with certain content on them. Currently
> we
> > do this by a delete query after Nutch finishes. We can keep it this way,
> > but I wondered if there was a smarter option.
>
> How is such content identified?
>

It sounds really stupid, but the maker of that site does not output a 404
header, but puts an HTML formatted message on the page like "Code 303
Description you are not allowed to access this item"

Currently my cron job just calls the solr update handler and sends a delete
query that searches for content matching "Code 303 Description" (all HTML
and whitespace are stripped anyway in the solr index) in the stream body.

Writing a plug in to filter this out is indeed cleaner, but the work
involved is too much compared to what is gained. The workaround does its
job. If there was a plugin that does this already that would be nice.

-- 


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

jigal@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !

Re: Configuration of very specific requirements

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Jigal,

> One site is indexed by Nutch. Now it should be limited to the pages that
> are linked in the seed URL (no further crawling necessary).
Have a look at the plugin "scoring-depth" and add to your nutch-site.xml
(cf. conf/nutch-default.xml):


<!-- scoring-depth properties
 Add 'scoring-depth' to the list of active plugins
 in the parameter 'plugin.includes' in order to use it.
 -->

<property>
  <name>scoring.depth.max</name>
  <value>2</value>
  <description>Max depth value from seed allowed by default.
  Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE"
  as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
  to track the distance from the seed it was found from.
  The depth is used to prioritise URLs in the generation step so that
  shallower pages are fetched first.
  </description>
</property>

> Furthermore all
> pages must be revisited daily (and new pages must be indexed daily too).

See property "db.fetch.interval.default",
also take the time to check other
  db.fetch.interval.*
  db.fetch.schedule.*
properties.

> Another wish is to exclude pages with certain content on them. Currently we
> do this by a delete query after Nutch finishes. We can keep it this way,
> but I wondered if there was a smarter option.

How is such content identified?

Cheers,
Sebastian

On 04/06/2016 11:38 AM, Jigal van Hemert | alterNET internet BV wrote:
> Hi,
> 
> Probably not too complex for those who are used to fiddling with the
> configuration, but I could use some pointer on how to achieve the following.
> 
> One site is indexed by Nutch. Now it should be limited to the pages that
> are linked in the seed URL (no further crawling necessary). Furthermore all
> pages must be revisited daily (and new pages must be indexed daily too).
> 
> Another wish is to exclude pages with certain content on them. Currently we
> do this by a delete query after Nutch finishes. We can keep it this way,
> but I wondered if there was a smarter option.
> 
> Thanks in advance for pointing me in the right direction.
> 


Re: Configuration of very specific requirements

Posted by Julien Nioche <li...@gmail.com>.
Hi Jigal,

You can do this by activating the scoring-depth plugin and setting
scoring.depth.max to 1 in nutch-site.xml
For the scheduling simply set

<property>
<name>db.fetch.interval.default</name>
<value>86400</value>


</property>

in nutch-site.xml

Filtering URLs from being indexed based on the content could be done by
writing a custom IndexingFilter and get it to set the NutchDocument to null
e.g based on an arbitrary metadata key set by a custom ParseFilter.

Hope it helps

Julien


2016-04-06 10:38 GMT+01:00 Jigal van Hemert | alterNET internet BV <
jigal@alternet.nl>:

> Hi,
>
> Probably not too complex for those who are used to fiddling with the
> configuration, but I could use some pointer on how to achieve the
> following.
>
> One site is indexed by Nutch. Now it should be limited to the pages that
> are linked in the seed URL (no further crawling necessary). Furthermore all
> pages must be revisited daily (and new pages must be indexed daily too).
>
> Another wish is to exclude pages with certain content on them. Currently we
> do this by a delete query after Nutch finishes. We can keep it this way,
> but I wondered if there was a smarter option.
>
> Thanks in advance for pointing me in the right direction.
>
> --
>
>
> Met vriendelijke groet,
>
>
> Jigal van Hemert | Ontwikkelaar
>
>
>
> Langesteijn 124
> 3342LG Hendrik-Ido-Ambacht
>
> T. +31 (0)78 635 1200
> F. +31 (0)848 34 9697
> KvK. 23 09 28 65
>
> jigal@alternet.nl
> www.alternet.nl
>
>
> Disclaimer:
> Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
> bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
> direct per e-mail of telefoon contact op met de verzender en verwijder dit
> bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
> welke wijze dan ook te delen met derden of anderszins openbaar te maken
> zonder schriftelijke toestemming van alterNET Internet BV. U wordt
> geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
> enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
> van virussen.
>
> Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
> Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
> uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
> alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
> hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
> toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
> dit bericht kunnen geen rechten worden ontleend.
>
> ! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>