You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Xiao Li <sh...@gmail.com> on 2012/05/04 21:13:30 UTC
Avoid crawling nonsense calendar webpage
Hi Nutch people,
I am using Nutch to index a website. I notice that Nutch has crawled
some junk webpages, such as
http://**************/category/events/2015-11. This webpage is about
the event occurring in 2015, 11. This is completely nonsense for me. I
want to know is it possible for Nutch to intelligently skip such
webpages. It may be argued that I can use Regex to avoid this.
However, as the naming pattern of calendar webpages are not the same
all the time, there is no way to write a perfect Regex for this. I
know Heritrix (a Internet archive crawler) has such capabilities to
avoid crawling nonsense calendar webpage. Does anyone solve this
issue?
Regrds
Xiao
Re: Avoid crawling nonsense calendar webpage
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
This is a tough problem indeed. We partially mitigate this problem by
using several regular expressions, linkrank scores with domain limiting
generator for regular crawls and a second shallow crawl, only following
links from the home page.
A custom URLFilter as Ferdy explains is a good idea indeed. However,
URLFilters operate on single URL's only, which is as difficuly as
creating regular expressions. If we can process all outlinks of a given
page at the same time it's easier to compare and calculate similarity
and if needed, discard them if we consider them as unwanted calendars.
Can you explain how Heretrix does it? Perhaps we can learn from it.
Cheers,
Markus
On Sat, 5 May 2012 12:44:27 +0200, Ferdy Galema
<fe...@kalooga.com> wrote:
> Hi,
>
> Fetching unwanted pages such as in this case dynamically generated
> pages is
> a general problem. Currently I'm not aware of any pending
> improvements in
> this area, but feel free to contribute if you have a solution.
> Probably the
> best way to solve such a problem is by implementing a custom
> URLFilter.
> This filter might have some heuristics that is able to detect
> dynamically
> generated urls.
>
> Ferdy.
>
> On Fri, May 4, 2012 at 9:13 PM, Xiao Li <sh...@gmail.com>
> wrote:
>
>> Hi Nutch people,
>>
>> I am using Nutch to index a website. I notice that Nutch has crawled
>> some junk webpages, such as
>> http://**************/category/events/2015-11. This webpage is about
>> the event occurring in 2015, 11. This is completely nonsense for me.
>> I
>> want to know is it possible for Nutch to intelligently skip such
>> webpages. It may be argued that I can use Regex to avoid this.
>> However, as the naming pattern of calendar webpages are not the same
>> all the time, there is no way to write a perfect Regex for this. I
>> know Heritrix (a Internet archive crawler) has such capabilities to
>> avoid crawling nonsense calendar webpage. Does anyone solve this
>> issue?
>>
>> Regrds
>> Xiao
>>
--
Re: Avoid crawling nonsense calendar webpage
Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,
Fetching unwanted pages such as in this case dynamically generated pages is
a general problem. Currently I'm not aware of any pending improvements in
this area, but feel free to contribute if you have a solution. Probably the
best way to solve such a problem is by implementing a custom URLFilter.
This filter might have some heuristics that is able to detect dynamically
generated urls.
Ferdy.
On Fri, May 4, 2012 at 9:13 PM, Xiao Li <sh...@gmail.com> wrote:
> Hi Nutch people,
>
> I am using Nutch to index a website. I notice that Nutch has crawled
> some junk webpages, such as
> http://**************/category/events/2015-11. This webpage is about
> the event occurring in 2015, 11. This is completely nonsense for me. I
> want to know is it possible for Nutch to intelligently skip such
> webpages. It may be argued that I can use Regex to avoid this.
> However, as the naming pattern of calendar webpages are not the same
> all the time, there is no way to write a perfect Regex for this. I
> know Heritrix (a Internet archive crawler) has such capabilities to
> avoid crawling nonsense calendar webpage. Does anyone solve this
> issue?
>
> Regrds
> Xiao
>