You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Xiao Li <sh...@gmail.com> on 2012/05/04 21:13:30 UTC

Avoid crawling nonsense calendar webpage

Hi Nutch people,

I am using Nutch to index a website. I notice that Nutch has crawled
some junk webpages, such as
http://**************/category/events/2015-11. This webpage is about
the event occurring in 2015, 11. This is completely nonsense for me. I
want to know is it possible for Nutch to intelligently skip such
webpages. It may be argued that I can use Regex to avoid this.
However, as the naming pattern of calendar webpages are not the same
all the time, there is no way to write a perfect Regex for this. I
know Heritrix (a Internet archive crawler) has such capabilities to
avoid crawling nonsense calendar webpage. Does anyone solve this
issue?

Regrds
Xiao

Re: Avoid crawling nonsense calendar webpage

Posted by Markus Jelsma <ma...@openindex.io>.
 Hi,

 This is a tough problem indeed. We partially mitigate this problem by 
 using several regular expressions, linkrank scores with domain limiting 
 generator for regular crawls and a second shallow crawl, only following 
 links from the home page.

 A custom URLFilter as Ferdy explains is a good idea indeed. However, 
 URLFilters operate on single URL's only, which is as difficuly as 
 creating regular expressions. If we can process all outlinks of a given 
 page at the same time it's easier to compare and calculate similarity 
 and if needed, discard them if we consider them as unwanted calendars.

 Can you explain how Heretrix does it? Perhaps we can learn from it.

 Cheers,
 Markus

 On Sat, 5 May 2012 12:44:27 +0200, Ferdy Galema 
 <fe...@kalooga.com> wrote:
> Hi,
>
> Fetching unwanted pages such as in this case dynamically generated 
> pages is
> a general problem. Currently I'm not aware of any pending 
> improvements in
> this area, but feel free to contribute if you have a solution. 
> Probably the
> best way to solve such a problem is by implementing a custom 
> URLFilter.
> This filter might have some heuristics that is able to detect 
> dynamically
> generated urls.
>
> Ferdy.
>
> On Fri, May 4, 2012 at 9:13 PM, Xiao Li <sh...@gmail.com> 
> wrote:
>
>> Hi Nutch people,
>>
>> I am using Nutch to index a website. I notice that Nutch has crawled
>> some junk webpages, such as
>> http://**************/category/events/2015-11. This webpage is about
>> the event occurring in 2015, 11. This is completely nonsense for me. 
>> I
>> want to know is it possible for Nutch to intelligently skip such
>> webpages. It may be argued that I can use Regex to avoid this.
>> However, as the naming pattern of calendar webpages are not the same
>> all the time, there is no way to write a perfect Regex for this. I
>> know Heritrix (a Internet archive crawler) has such capabilities to
>> avoid crawling nonsense calendar webpage. Does anyone solve this
>> issue?
>>
>> Regrds
>> Xiao
>>

 --

Re: Avoid crawling nonsense calendar webpage

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

Fetching unwanted pages such as in this case dynamically generated pages is
a general problem. Currently I'm not aware of any pending improvements in
this area, but feel free to contribute if you have a solution. Probably the
best way to solve such a problem is by implementing a custom URLFilter.
This filter might have some heuristics that is able to detect dynamically
generated urls.

Ferdy.

On Fri, May 4, 2012 at 9:13 PM, Xiao Li <sh...@gmail.com> wrote:

> Hi Nutch people,
>
> I am using Nutch to index a website. I notice that Nutch has crawled
> some junk webpages, such as
> http://**************/category/events/2015-11. This webpage is about
> the event occurring in 2015, 11. This is completely nonsense for me. I
> want to know is it possible for Nutch to intelligently skip such
> webpages. It may be argued that I can use Regex to avoid this.
> However, as the naming pattern of calendar webpages are not the same
> all the time, there is no way to write a perfect Regex for this. I
> know Heritrix (a Internet archive crawler) has such capabilities to
> avoid crawling nonsense calendar webpage. Does anyone solve this
> issue?
>
> Regrds
> Xiao
>