You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/09/12 16:33:53 UTC
Relative outlinks without base
Hi,
Would it not be a good idea to patch DomContentUtils with an option not to
consider relative outlinks without a base url? This example [1] will currently
quickly take over the crawl db and produce countless unique URL's that cannot
be filtered out with the regex that detects repeating URI segments.
There are many websites on the internet that suffer from this problem.
A patch would protect this common crawler trap but not against incorrect
absolute URL's - one that is supposed to be absolute but for example has an
incorrect protocol scheme.
[1]: http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
Cheers,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: need help
Posted by Markus Jelsma <ma...@openindex.io>.
how to ask questions: http://catb.org/~esr/faqs/smart-questions.html
and please do not hijack other persons' threads
On Wednesday 14 September 2011 15:36:40 Marlen wrote:
> hello every one!!.. Im using Nutch 1.2 but I having problem with the
> indexing.. its not indexing the .pdf file, those who are protected...
> could any one tell me whats going on.. please!!!
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Relative outlinks without base
Posted by Markus Jelsma <ma...@openindex.io>.
On Tuesday 13 September 2011 13:12:41 Alexander Aristov wrote:
> yes you can stop but how do you know if a URL is good or not?
> You can use URL filter to discard unwanted URLs.
We see that many sites with relative URL's without base href produce erroneous
links. As with the example there is a pattern but sometimes a pattern of hard
to find.
If we can stop relative URL's without base href for now, we can at least
continue crawling. Right now we need to manually check samples of the crawled
URL's (millions) for crawler traps which we (for now) add to the regex filter.
We do need to look for a better solution but we cannot work on a solution and
manually test the crawldb at the same time. One related problem is detecting
calendars / agenda's.
>
> Dedup works to remove old/obsolet content and you cannot check it without
> downloading it.
>
> Best Regards
> Alexander Aristov
>
> On 13 September 2011 14:57, Markus Jelsma <ma...@openindex.io>wrote:
> > Yes, we use several deduplication mechanisms and they work fine. The
> > problem
> > is wasting a lot of CPU cycles for nothing. Why not stop those unwanted
> > URL's
> > from entering the CrawlDB in the first place instead of getting rid of
> > them afterwards?
> >
> > Growth of the CrawlDB is something very significant, especially with
> > thousands
> > of long URL's.
> >
> > On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote:
> > > Hi Markus,
> > >
> > > Please correct me if I'm wrong, but isn't there a document signature
> >
> > check
> >
> > > to detect if the page contains same content with some other already
> >
> > parsed
> >
> > > and indexed.
> > >
> > > Dinçer
> > >
> > > 2011/9/12 Markus Jelsma <ma...@openindex.io>
> > >
> > > > Hi,
> > > >
> > > > Would it not be a good idea to patch DomContentUtils with an option
> > > > not to consider relative outlinks without a base url? This example
> > > > [1] will currently
> > > > quickly take over the crawl db and produce countless unique URL's
> > > > that cannot
> > > > be filtered out with the regex that detects repeating URI segments.
> > > >
> > > > There are many websites on the internet that suffer from this
> > > > problem.
> > > >
> > > > A patch would protect this common crawler trap but not against
> >
> > incorrect
> >
> > > > absolute URL's - one that is supposed to be absolute but for example
> >
> > has
> >
> > > > an incorrect protocol scheme.
> >
> > > > [1]:
> > http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
> >
> > > > Cheers,
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
need help
Posted by Marlen <zm...@facinf.uho.edu.cu>.
hello every one!!.. Im using Nutch 1.2 but I having problem with the
indexing.. its not indexing the .pdf file, those who are protected...
could any one tell me whats going on.. please!!!
Re: Relative outlinks without base
Posted by Alexander Aristov <al...@gmail.com>.
yes you can stop but how do you know if a URL is good or not?
You can use URL filter to discard unwanted URLs.
Dedup works to remove old/obsolet content and you cannot check it without
downloading it.
Best Regards
Alexander Aristov
On 13 September 2011 14:57, Markus Jelsma <ma...@openindex.io>wrote:
> Yes, we use several deduplication mechanisms and they work fine. The
> problem
> is wasting a lot of CPU cycles for nothing. Why not stop those unwanted
> URL's
> from entering the CrawlDB in the first place instead of getting rid of them
> afterwards?
>
> Growth of the CrawlDB is something very significant, especially with
> thousands
> of long URL's.
>
> On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote:
> > Hi Markus,
> >
> > Please correct me if I'm wrong, but isn't there a document signature
> check
> > to detect if the page contains same content with some other already
> parsed
> > and indexed.
> >
> > Dinçer
> >
> > 2011/9/12 Markus Jelsma <ma...@openindex.io>
> >
> > > Hi,
> > >
> > > Would it not be a good idea to patch DomContentUtils with an option not
> > > to consider relative outlinks without a base url? This example [1] will
> > > currently
> > > quickly take over the crawl db and produce countless unique URL's that
> > > cannot
> > > be filtered out with the regex that detects repeating URI segments.
> > >
> > > There are many websites on the internet that suffer from this problem.
> > >
> > > A patch would protect this common crawler trap but not against
> incorrect
> > > absolute URL's - one that is supposed to be absolute but for example
> has
> > > an incorrect protocol scheme.
> > >
> > > [1]:
> > >
> http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
> > >
> > > Cheers,
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
Re: Relative outlinks without base
Posted by Markus Jelsma <ma...@openindex.io>.
Yes, we use several deduplication mechanisms and they work fine. The problem
is wasting a lot of CPU cycles for nothing. Why not stop those unwanted URL's
from entering the CrawlDB in the first place instead of getting rid of them
afterwards?
Growth of the CrawlDB is something very significant, especially with thousands
of long URL's.
On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote:
> Hi Markus,
>
> Please correct me if I'm wrong, but isn't there a document signature check
> to detect if the page contains same content with some other already parsed
> and indexed.
>
> Dinçer
>
> 2011/9/12 Markus Jelsma <ma...@openindex.io>
>
> > Hi,
> >
> > Would it not be a good idea to patch DomContentUtils with an option not
> > to consider relative outlinks without a base url? This example [1] will
> > currently
> > quickly take over the crawl db and produce countless unique URL's that
> > cannot
> > be filtered out with the regex that detects repeating URI segments.
> >
> > There are many websites on the internet that suffer from this problem.
> >
> > A patch would protect this common crawler trap but not against incorrect
> > absolute URL's - one that is supposed to be absolute but for example has
> > an incorrect protocol scheme.
> >
> > [1]:
> > http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
> >
> > Cheers,
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Relative outlinks without base
Posted by Dinçer Kavraal <dk...@gmail.com>.
Hi Markus,
Please correct me if I'm wrong, but isn't there a document signature check
to detect if the page contains same content with some other already parsed
and indexed.
Dinçer
2011/9/12 Markus Jelsma <ma...@openindex.io>
> Hi,
>
> Would it not be a good idea to patch DomContentUtils with an option not to
> consider relative outlinks without a base url? This example [1] will
> currently
> quickly take over the crawl db and produce countless unique URL's that
> cannot
> be filtered out with the regex that detects repeating URI segments.
>
> There are many websites on the internet that suffer from this problem.
>
> A patch would protect this common crawler trap but not against incorrect
> absolute URL's - one that is supposed to be absolute but for example has an
> incorrect protocol scheme.
>
> [1]:
> http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
>
> Cheers,
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>