You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/09/12 16:33:53 UTC

Relative outlinks without base

Hi,

Would it not be a good idea to patch DomContentUtils with an option not to 
consider relative outlinks without a base url? This example [1] will currently 
quickly take over the crawl db and produce countless unique URL's that cannot 
be filtered out with the regex that detects repeating URI segments.

There are many websites on the internet that suffer from this problem.

A patch would protect this common crawler trap but not against incorrect 
absolute URL's - one that is supposed to be absolute but for example has an 
incorrect protocol scheme.

[1]: http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: need help

Posted by Markus Jelsma <ma...@openindex.io>.
how to ask questions: http://catb.org/~esr/faqs/smart-questions.html

and please do not hijack other persons' threads


On Wednesday 14 September 2011 15:36:40 Marlen wrote:
> hello every one!!.. Im using Nutch 1.2 but I having problem with the
> indexing.. its not indexing the .pdf file, those who are protected...
> could any one tell me whats going on.. please!!!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Relative outlinks without base

Posted by Markus Jelsma <ma...@openindex.io>.

On Tuesday 13 September 2011 13:12:41 Alexander Aristov wrote:
> yes you can stop but how do you know if a URL is good or not?
> You can use URL filter to discard unwanted URLs.

We see that many sites with relative URL's without base href produce erroneous 
links. As with the example there is a pattern but sometimes a pattern of hard 
to find.

If we can stop relative URL's without base href for now, we can at least 
continue crawling. Right now we need to manually check samples of the crawled 
URL's (millions) for crawler traps which we (for now) add to the regex filter.

We do need to look for a better solution but we cannot work on a solution and 
manually test the crawldb at the same time. One related problem is detecting 
calendars / agenda's.

> 
> Dedup works to remove old/obsolet content and you cannot check it without
> downloading it.
> 
> Best Regards
> Alexander Aristov
> 
> On 13 September 2011 14:57, Markus Jelsma <ma...@openindex.io>wrote:
> > Yes, we use several deduplication mechanisms and they work fine. The
> > problem
> > is wasting a lot of CPU cycles for nothing. Why not stop those unwanted
> > URL's
> > from entering the CrawlDB in the first place instead of getting rid of
> > them afterwards?
> > 
> > Growth of the CrawlDB is something very significant, especially with
> > thousands
> > of long URL's.
> > 
> > On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote:
> > > Hi Markus,
> > > 
> > > Please correct me if I'm wrong, but isn't there a document signature
> > 
> > check
> > 
> > > to detect if the page contains same content with some other already
> > 
> > parsed
> > 
> > > and indexed.
> > > 
> > > Dinçer
> > > 
> > > 2011/9/12 Markus Jelsma <ma...@openindex.io>
> > > 
> > > > Hi,
> > > > 
> > > > Would it not be a good idea to patch DomContentUtils with an option
> > > > not to consider relative outlinks without a base url? This example
> > > > [1] will currently
> > > > quickly take over the crawl db and produce countless unique URL's
> > > > that cannot
> > > > be filtered out with the regex that detects repeating URI segments.
> > > > 
> > > > There are many websites on the internet that suffer from this
> > > > problem.
> > > > 
> > > > A patch would protect this common crawler trap but not against
> > 
> > incorrect
> > 
> > > > absolute URL's - one that is supposed to be absolute but for example
> > 
> > has
> > 
> > > > an incorrect protocol scheme.
> > 
> > > > [1]:
> > http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
> > 
> > > > Cheers,
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

need help

Posted by Marlen <zm...@facinf.uho.edu.cu>.
hello every one!!.. Im using Nutch 1.2 but I having problem with the 
indexing.. its not indexing the .pdf file, those who are protected... 
could any one tell me whats going on.. please!!!

Re: Relative outlinks without base

Posted by Alexander Aristov <al...@gmail.com>.
yes you can stop but how do you know if a URL is good or not?
You can use URL filter to discard unwanted URLs.

Dedup works to remove old/obsolet content and you cannot check it without
downloading it.

Best Regards
Alexander Aristov


On 13 September 2011 14:57, Markus Jelsma <ma...@openindex.io>wrote:

> Yes, we use several deduplication mechanisms and they work fine. The
> problem
> is wasting a lot of CPU cycles for nothing. Why not stop those unwanted
> URL's
> from entering the CrawlDB in the first place instead of getting rid of them
> afterwards?
>
> Growth of the CrawlDB is something very significant, especially with
> thousands
> of long URL's.
>
> On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote:
> > Hi Markus,
> >
> > Please correct me if I'm wrong, but isn't there a document signature
> check
> > to detect if the page contains same content with some other already
> parsed
> > and indexed.
> >
> > Dinçer
> >
> > 2011/9/12 Markus Jelsma <ma...@openindex.io>
> >
> > > Hi,
> > >
> > > Would it not be a good idea to patch DomContentUtils with an option not
> > > to consider relative outlinks without a base url? This example [1] will
> > > currently
> > > quickly take over the crawl db and produce countless unique URL's that
> > > cannot
> > > be filtered out with the regex that detects repeating URI segments.
> > >
> > > There are many websites on the internet that suffer from this problem.
> > >
> > > A patch would protect this common crawler trap but not against
> incorrect
> > > absolute URL's - one that is supposed to be absolute but for example
> has
> > > an incorrect protocol scheme.
> > >
> > > [1]:
> > >
> http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
> > >
> > > Cheers,
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Relative outlinks without base

Posted by Markus Jelsma <ma...@openindex.io>.
Yes, we use several deduplication mechanisms and they work fine. The problem 
is wasting a lot of CPU cycles for nothing. Why not stop those unwanted URL's 
from entering the CrawlDB in the first place instead of getting rid of them 
afterwards? 

Growth of the CrawlDB is something very significant, especially with thousands 
of long URL's.

On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote:
> Hi Markus,
> 
> Please correct me if I'm wrong, but isn't there a document signature check
> to detect if the page contains same content with some other already parsed
> and indexed.
> 
> Dinçer
> 
> 2011/9/12 Markus Jelsma <ma...@openindex.io>
> 
> > Hi,
> > 
> > Would it not be a good idea to patch DomContentUtils with an option not
> > to consider relative outlinks without a base url? This example [1] will
> > currently
> > quickly take over the crawl db and produce countless unique URL's that
> > cannot
> > be filtered out with the regex that detects repeating URI segments.
> > 
> > There are many websites on the internet that suffer from this problem.
> > 
> > A patch would protect this common crawler trap but not against incorrect
> > absolute URL's - one that is supposed to be absolute but for example has
> > an incorrect protocol scheme.
> > 
> > [1]:
> > http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
> > 
> > Cheers,
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Relative outlinks without base

Posted by Dinçer Kavraal <dk...@gmail.com>.
Hi Markus,

Please correct me if I'm wrong, but isn't there a document signature check
to detect if the page contains same content with some other already parsed
and indexed.

Dinçer

2011/9/12 Markus Jelsma <ma...@openindex.io>

> Hi,
>
> Would it not be a good idea to patch DomContentUtils with an option not to
> consider relative outlinks without a base url? This example [1] will
> currently
> quickly take over the crawl db and produce countless unique URL's that
> cannot
> be filtered out with the regex that detects repeating URI segments.
>
> There are many websites on the internet that suffer from this problem.
>
> A patch would protect this common crawler trap but not against incorrect
> absolute URL's - one that is supposed to be absolute but for example has an
> incorrect protocol scheme.
>
> [1]:
> http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
>
> Cheers,
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>