You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@buyways.nl> on 2010/09/22 12:11:15 UTC

Funky duplicate url's

Hi,

 

This is not about deduplication, but about preventing certain url's to end up in the CrawlDB. I'm crawling a news site for testing purposes, it has the usual categories etc. News item pages feature a gray text block that's got some url's as well. See http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.

 

The problem is, the parser somehow manages to concatenate the href with the inner anchor text (which happens to be an url as you can see). So, subsequent fetches are completely messed up, i'm almost only fetching duplicates:

 

fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece

 

This is not desired behavior, as you'd expect. The question is, where to fix and how to fix it? Is it a problem with the parser? Or is it fixable using some freaky url filter for this one?

 

 

Cheers,

Re: Funky duplicate url's, getting much worse!

Posted by Markus Jelsma <ma...@buyways.nl>.

The following regex 

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

prevents URL's such as

http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/

to end up in the CrawlDB.  The problem with the blikopnieuws URL's is that 
they don't contain exact repeating parts. They do have stuff like 
http://HOST/path/item/ID_1/item/ID_2 but that's quite a common schema on the 
internet. Adding a regex that filters these occurences would silently discard 
many other valid URL's.

http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie

Thanks for your comments, it looks like i'm stuck with this at least for now 
=)


On Wednesday 29 September 2010 14:58:10 Julien Nioche wrote:
> What I did for similarpages.com was to write a custom URL filter that
> detected repetition of path elements and discarded a URL if it had a path
> occurring more than N times. I don't know what regex AJ suggested but the
> approach above was generic and also quite fast.
> 
> We also had other things like filtering out ridiculously long URLS (not
>  only do they tend to be rubbish but they cause the normalisation to take a
>  lot of CPU) or dynamically generated host names by splitting on say dashes
>  and remove the URL if the hostname had more than N tokens.
> 
> These are all small tricks but they help controlling the content of the
> crawldb and not waste time trying to fetch rubbish or scanning an
> unnecessarily large number of entries during the generation or update.
> 
> Detecting adult pages is also quite important for large scale crawls as
> these tend to quickly take over the whole crawldb and they generally yield
> an awful lot of outlinks.
> 
> HTH
> 
> Julien
> 
> > Thanks!
> >
> > We're back with the base URL issue. The stuff i `found` in the
> > TestOutlinkExtractor was my own doing. No patch here. Using the
> > ParserChecker
> > it was clear that the problem came up because the http:// URL schema was
> > not
> > present in some href's. The problem is also present when using an
> > ordinary browser and it can be solved by using the regex AJ supplied.
> >
> > The problem with the blikopnieuws site (relative URL's without base URL)
> > remains, though. Check this link http://www.blikopnieuws.nl/nieuwsblok
> > On the right side you'll see a latest news block with (in the browser)
> > proper
> > URL's. Check the source and you'll see relative URL's. It, of course,
> > also stops working the the browser when you have a trailing slash.
> >
> > Now use the parser checker:
> > bin/nutch org.apache.nutch.parse.ParserChecker
> > http://www.blikopnieuws.nl/nieuwsblok
> >
> > And you'll see that Nutch uses http://www.blikopnieuws.nl/nieuwsblok/ as
> > base
> > URL for relative URL's, just as the browser does. Everything works as
> > expected
> > because of the relative URL's.
> >
> > The problem is, the website is itself not consistent. It mostly features
> > the
> > URL in the footer without trailing slash but from some unknown page i got
> > the
> > same URL with the trailing slash. From there on, everything starts to go
> > wrong.
> >
> > To conclude, i got fooled! But how can we in the future prevent this from
> > happening? I could use url filtering but that would mean the index
> > already contains garbage because i cannot filter what i don't know.
> >
> > Cheers,
> >
> > On Wednesday 29 September 2010 11:25:55 Julien Nioche wrote:
> > > Don't know how to run a single test but if you do ant test you should
> > > be able to find the logs for each individual class in ./build/test with
> > > a separate log for TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt
> >
> > that
> >
> > >  will be easier that going through a single huge file
> > >
> > > J.
> > >
> > >
> > > On 29 September 2010 10:11, Markus Jelsma <ma...@buyways.nl>
> >
> > wrote:
> > > Yes but i need a little more testing. Anyone knows how i can only test
> >
> > that
> >
> > > class? I currently use ant -v test -l logfile and need to dig through
> > > the log file, also, it takes too long because of other tests.
> > >
> > > On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> > > > Hi guys,
> > > >
> > > > IIRC the OutlinkExtractor is the same in parse-tika and parse-html.
> >
> > Could
> >
> > > > you please open a JIRA and attach a patch for the
> > > > TestOutlinkExtractor
> >
> > so
> >
> > > > that we can reproduce the problem?
> > > >
> > > > Thanks
> > > >
> > > > Julien
> > > >
> > > > > Hello Mathijs,
> > > > >
> > > > >
> > > > >
> > > > > I inspected the code base and found that the problem is most likely
> >
> > in
> >
> > > > > the parse-tika code where the text is being extracted and the
> > > > > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > > > > expression that can output a lot of garbage. I've added a test to
> > > > > the TestOutlinkExtractor where it's clear that at least one URL
> > > > > does not pass but it does not point me in the right direction for
> > > > > solving the relative path problem.
> > > > >
> > > > >
> > > > >
> > > > > Unless someone knows, i'll try to find out how the OutlinkExtractor
> > > > > works with the current base URL because just a plain relative URL
> > > > > in the test will obviously fail.
> > > > >
> > > > >
> > > > >
> > > > > Thanks for the pointer =)
> > > > >
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > -----Original message-----
> > > > > From: Mathijs Homminga <ma...@knowlogy.nl>
> > > > > Sent: Tue 28-09-2010 21:01
> > > > > To: user@nutch.apache.org;
> > > > > Subject: Re: Funky duplicate url's, getting much worse!
> > > > >
> > > > > Hi Marcus,
> > > > >
> > > > > I remember Nutch had some troubles with honoring the page's BASE
> > > > > tag when resolving relative outlinks.
> > > > > However, I don't see this BASE tag being used in the HTML pages you
> > > > > provide so that's might not be it.
> > > > >
> > > > > Mathijs
> > > > >
> > > > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > > > > Anyone? Where is a proper solution for this issue? As expected,
> > > > > > the regex
> > > > >
> > > > > won't catch all imaginable kinds of funky URL's that somehow ended
> > > > > up in the CrawlDB. Before the weekend, i added another news site to
> > > > > the tests i conduct and let it run continuously. Unfortunately, the
> > > > > generator now comes up with all kinds of completely useless URL's,
> > > > > although they do exist but that's just the web application ignoring
> > > > > most parts of the URL's.
> > > > >
> > > > > > This is the URL that should be considered as proper URL:
> > > > > >
> > > > > > http://www.blikopnieuws.nl/nieuwsblok
> > > > > >
> > > > > >
> > > > > >
> > > > > > Here are two URL's that are completely useless:
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> >
> > > > >ri cht/119033/bericht/119047/economie
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/11
> >
> > > > >90 35/archief/bericht/119038/archief/
> > > > >
> > > > > > It is very hard to use deduplication on these simply because the
> > > > > > content
> > > > >
> > > > > is actually changes too much as time progresses - the latest news
> >
> > block
> >
> > > > > for example. It is therefore a necessity to keep these URL's from
> > > > > ending up in the CrawlDB and so not to waste disk space and update
> >
> > time
> >
> > > > > of the CrawlDB and and huge load of bandwidth - i'm in my current
> >
> > fetch
> >
> > > > > probably going to waste at least a few GB's.
> > > > >
> > > > > > Looking at the HTML source, it looks like the parser cannot
> >
> > properly
> >
> > > > > handle relative URL's. It is, of course, quite ugly for a site to
> > > > > do this but the parser must not fool itself and come up with URL's
> > > > > that really aren't there. Combined with the issue i began the
> > > > > thread with
> >
> > i
> >
> > > > > believe the following two problems are present - the parser returns
> > > > > imaginary (false)
> > > > >
> > > > > URL's because of:
> > > > > > 1. relative href's;
> > > > > >
> > > > > > 2. URL's in anchors (that is the XML element's body) next to the
> >
> > rhef
> >
> > > > > attribute.
> > > > >
> > > > > > Please help in finding the source of the problem (Tika? Nutch?)
> > > > > > and how
> > > > >
> > > > > to proceed in having it fixed so other users won't waste bandwidth,
> > > > > disk space and CPU cycles =)
> > > > >
> > > > > > Oh, here's a snippet of the fetch job that's currently running,
> >
> > also,
> >
> > > > > notice the news item with the 119039 ID, it's the same as above
> > > > > although that copy/paste was 15 minutes ago. Most item ID's you see
> > > > > below continue to return in the current log output.
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/11904
> >
> > > > >2/ hetweer/game/persberichtaanleveren
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/beric
> >
> > > > >ht /119036/game/tipons
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/beric
> >
> > > > >ht /119035/bericht/119033/disclaimer
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/b
> >
> > > > >er icht/119036/groningen
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss
> >
> > > > >/b ericht/119042/persberichtaanleveren
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/arc
> >
> > > > >hi ef/bericht/119036/bericht/119038/zuidholland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> >
> > > > >5/ bericht/119036/game/hetweer/vandaag
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/beric
> >
> > > > >ht /119035/game/archief/donderdag
> > > > >
> > > > > > fetching
> > > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/beric
> >
> > > > >ht /119034/archief/zeeland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> >
> > > > >ri cht/119041/bericht/119047/lifestyle
> > > > >
> > > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/beric
> >
> > > > >ht
> >
> > /119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.h
> >
> > > > >tml
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/
> >
> > > > >be richt/119038/game/lennythelizard
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/arc
> >
> > > > >hi
> >
> > ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defec
> >
> > > > >t.h tml
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> >
> > > > >5/ game/bericht/119035/noordbrabant
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss
> >
> > > > >/b ericht/119036/
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/ar
> >
> > > > >ch ief/bericht/119043/game/bioballboom
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/11
> >
> > > > >90 33/archief/bericht/119046/wetenschap
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/ar
> >
> > > > >ch
> > > > > ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/ga
> >
> > > > >me /archief/rss/
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/het
> >
> > > > >we er/game/archief/overijssel
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/1190
> >
> > > > >38 /bericht/119048/binnenland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/11904
> >
> > > > >2/ bericht/119038/game/auto
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archi
> >
> > > > >ef /bericht/119049/zeeland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/
> >
> > > > >ar chief/meewerken
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/11903
> >
> > > > >5/ game/bericht/119034/gelderland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/gam
> >
> > > > >e/ bericht/119042/game/binnenland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/arc
> >
> > > > >hi ef/bericht/119035/bericht/119035/gelderland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht
> >
> > > > >/1 19038/archief/lifestyle
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/be
> >
> > > > >ri cht/119041/hetweer/archief/woensdag
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/11
> >
> > > > >90 42/archief/bericht/119047/lifestyle
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/be
> >
> > > > >ri cht/119034/bericht/119047/glossy
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/beric
> >
> > > > >ht /119038/bericht/119045/glossy
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/11
> >
> > > > >90 36/game/bericht/119042/archief/zaterdag
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/11903
> >
> > > > >5/ archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > > > >
> > > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/11
> >
> > > > >90 37/archief/bericht/119046/economie
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/1
> >
> > > > >19 033/bericht/119037/overijssel
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/beric
> >
> > > > >ht /119036/bericht/119037/
> > > > >
> > > > > > -----Original message-----
> > > > > > From: Markus Jelsma <ma...@buyways.nl>
> > > > > > Sent: Wed 22-09-2010 20:47
> > > > > > To: user@nutch.apache.org;
> > > > > > Subject: RE: Re: Funky duplicate url's
> > > > > >
> > > > > > Thanks! I've already implemented a similar (but not as generic)
> >
> > regex
> >
> > > > > > to
> > > > >
> > > > > catch these url's. But it is, of course, not a proper solution to
> >
> > solve
> >
> > > > > a parsing problem with subsequent regex's. Please, correct me if
> > > > > i'm wrong, but i'm quite sure those url's are not to be found in
> > > > > the HTML sources. I'd better to be fixed where the problem seems to
> > > > > be.
> > > > >
> > > > > > I'll test your regex but i'd still like to know where the exact
> > > > > > problem
> > > > >
> > > > > lies and hopefully fix or help fixing it.
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > -----Original message-----
> > > > > > From: AJ Chen <aj...@web2express.org>
> > > > > > Sent: Wed 22-09-2010 20:29
> > > > > > To: user@nutch.apache.org;
> > > > > > Subject: Re: Funky duplicate url's
> > > > > >
> > > > > > the conf/regex-urlfilter.txt file has an exclusion rule that
> > > > > > should skip these viral urls.
> > > > > >
> > > > > > # skip URLs with slash-delimited segment that repeats 3+ times,
> > > > > > to break loops
> > > > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > > > > >
> > > > > > -aj
> > > > > >
> > > > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > > > > <markus.jelsma@buyways.nl
> > > > > >
> > > > > >wrote:
> > > > > >> Well, using a regex to catch these troublemakers isn't going to
> > > > > >> be
> > > > >
> > > > > useful.
> > > > >
> > > > > >> Although i caught the first faulty url's, there can be many more
> >
> > and
> >
> > > > > it's
> > > > >
> > > > > >> unpredictable; here's just a random pick from the list of
> > > > > >> errors:
> >
> > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.
> >
> > > > >is
> > > > > /Key-Sectors/Data-Centers-in-Iceland/
> >
> > www.invest.is/Key-Sectors/Data-Cen
> >
> > > > >ter
> > > > > s-in-Iceland/
> >
> > www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inve
> >
> > > > >st.
> > > > > is/Key-Sectors/Data-Centers-in-Iceland/
> >
> > www.invest.is/Key-Sectors/Data-C
> >
> > > > >ent
> > > > >
> > > > >
> > > > >
> > > > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > > > >
> > > > > >> Here's another very disturbing url it's trying to fetch:
> >
> > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/200
> >
> > > > >5/
> > > > > 02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida
> >
> > > > >_li
> > > > > censes_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovony
> >
> > > > >x/h
> > > > > ttp/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther
> >
> > > > >egi
> >
> > ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200
> >
> > > > >5/0
> > > > > 2/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_
> >
> > > > >lic
> > > > > enses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx
> >
> > > > >/ht
> > > > > tp/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there
> >
> > > > >gis
> >
> > ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005
> >
> > > > >/02
> > > > > /04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_l
> >
> > > > >ice
> > > > > nses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> >
> > > > >htt
> > > > > p/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg
> >
> > > > >ist
> >
> > er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/
> >
> > > > >02/
> > > > > 04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_li
> >
> > > > >cen
> > > > > ses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> >
> > > > >ttp
> > > > > /
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> >
> > > > >ste
> >
> > r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> >
> > > > >2/0
> > > > > 4/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_lic
> >
> > > > >ens
> > > > > es_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> >
> > > > >tp/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> >
> > > > >ter
> > > > > .com/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02
> >
> > > > >/04
> > > > > /elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_lice
> >
> > > > >nse
> > > > > s_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> >
> > > > >p/w
> >
> > ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> >
> > > > >er.
> > > > > com/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/
> >
> > > > >04/
> > > > > elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licen
> >
> > > > >ses
> > > > > _ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> >
> > > > >/ww
> >
> > w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> >
> > > > >r.c
> > > > > om/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/0
> >
> > > > >4/e
> > > > > lpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licens
> >
> > > > >es_
> > > > > ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > > > >www
> > > > > .
> >
> > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> >
> > > > >.co
> > > > > m/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04
> >
> > > > >/el
> > > > > pida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_license
> >
> > > > >s_o
> > > > >
> > > > >
> > > > >vonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> >
> > > > >w.
> >
> > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> >
> > > > >com
> > > > > /2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/
> >
> > > > >elp
> > > > > ida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses
> >
> > > > >_ov
> > > > >
> > > > >onyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> >
> > > > >.t
> >
> > heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> >
> > > > >om/
> > > > > 2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/e
> >
> > > > >lpi
> > > > > da_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_
> >
> > > > >ovo
> > > > >
> > > > >nyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> >
> > > > >th
> >
> > eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> >
> > > > >m/2
> > > > > 005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/el
> >
> > > > >pid
> > > > > a_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_o
> >
> > > > >von
> > > > >
> > > > >yx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> >
> > > > >he
> >
> > register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> >
> > > > >/20
> > > > > 05/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elp
> >
> > > > >ida
> > > > > _licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ov
> >
> > > > >ony x/
> > > > >
> > > > > >> I'm seems these bad url's are somehow found by the parser and
> > > > > >> get
> > > > >
> > > > > fetched
> > > > >
> > > > > >> the next time, and the next time making the url grow longer and
> > > > > >> longer
> > > > >
> > > > > for
> > > > >
> > > > > >> each fetch and parse and updateDB cycle.
> >
> > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1
> >
> > > > >99
> > > > > 9/article1513468.ece/
> >
> > www.microsoft.com/office/www.microsoft.com/office/
> >
> > > > >www
> > > > > .
> >
> > microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> >
> > > > >/ww
> >
> > w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> >
> > > > >e/w
> >
> > ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> >
> > > > >ce/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> >
> > > > >ice
> > > > > /
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> >
> > > > >fic
> > > > > e/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> >
> > > > >ffi
> > > > > ce/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/
> >
> > > > >off
> > > > > ice/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> >
> > > > >/of
> > > > >
> > > > >fice/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> >
> > > > >/o
> > > > >
> > > > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> > > > >
> > > > > >> This doesn't look good at all. Anyone got a suggestion or some
> > > > > >> pointer?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> -----Original message-----
> > > > > >> From: Markus Jelsma <ma...@buyways.nl>
> > > > > >> Sent: Wed 22-09-2010 12:12
> > > > > >> To: user@nutch.apache.org;
> > > > > >> Subject: Funky duplicate url's
> > > > > >>
> > > > > >> Hi,
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> This is not about deduplication, but about preventing certain
> >
> > url's
> >
> > > > > >> to
> > > > >
> > > > > end
> > > > >
> > > > > >> up in the CrawlDB. I'm crawling a news site for testing
> > > > > >> purposes,
> >
> > it
> >
> > > > > >> has
> > > > >
> > > > > the
> > > > >
> > > > > >> usual categories etc. News item pages feature a gray text block
> > > > > >> that's
> > > > >
> > > > > got
> > > > >
> > > > > >> some url's as well. See
> > > > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> > > > >
> > > > > example.
> > > > >
> > > > > >> The problem is, the parser somehow manages to concatenate the
> > > > > >> href with
> > > > >
> > > > > the
> > > > >
> > > > > >> inner anchor text (which happens to be an url as you can see).
> > > > > >> So, subsequent fetches are completely messed up, i'm almost only
> > > > > >> fetching duplicates:
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> >
> > > > >ni
> > > > > euws/economie/
> >
> > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/w
> >
> > > > >ww.
> >
> > trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuw
> >
> > > > >s/e
> > > > > conomie/
> >
> > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro
> >
> > > > >uw.
> > > > > nl/opinie/weblogs/
> >
> > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblo
> >
> > > > >gs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/n
> >
> > > > >ieu ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> > > > >
> > > > > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> >
> > > > >ni
> > > > > euws/economie/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/
> >
> > > > >www
> > > > > .
> >
> > trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> >
> > > > >ie/
> > > > > weblogs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> >
> > > > >ouw
> > > > > .nl/nieuws/economie/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/ec
> >
> > > > >ono
> > > > > mie/
> >
> > www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.
> >
> > > > >nl/ nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > > > >
> > > > > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> >
> > > > >op
> > > > > inie/weblogs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/
> >
> > > > >www
> > > > > .
> >
> > trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> >
> > > > >ie/
> > > > > weblogs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> >
> > > > >ouw
> > > > > .nl/nieuws/economie/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/we
> >
> > > > >blo
> > > > >
> > > > >
> > > > >gs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl
> >
> > > > >/o
> > > > >
> > > > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > > > >
> > > > > >> This is not desired behavior, as you'd expect. The question is,
> > > > > >> where to fix and how to fix it? Is it a problem with the parser?
> >
> > Or
> >
> > > > > >> is it fixable using some freaky url filter for this one?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Cheers,
> > > > > >
> > > > > > --
> > > > > > AJ Chen, PhD
> > > > > > Chair, Semantic Web SIG, sdforum.org
> > > > > > http://web2express.org
> > > > > > twitter @web2express
> > > > > > Palo Alto, CA, USA
> > >
> > > Markus Jelsma - Technisch Architect - Buyways BV
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Funky duplicate url's, getting much worse!

Posted by Julien Nioche <li...@gmail.com>.

What I did for similarpages.com was to write a custom URL filter that
detected repetition of path elements and discarded a URL if it had a path
occurring more than N times. I don't know what regex AJ suggested but the
approach above was generic and also quite fast.

We also had other things like filtering out ridiculously long URLS (not only
do they tend to be rubbish but they cause the normalisation to take a lot of
CPU) or dynamically generated host names by splitting on say dashes and
remove the URL if the hostname had more than N tokens.

These are all small tricks but they help controlling the content of the
crawldb and not waste time trying to fetch rubbish or scanning an
unnecessarily large number of entries during the generation or update.

Detecting adult pages is also quite important for large scale crawls as
these tend to quickly take over the whole crawldb and they generally yield
an awful lot of outlinks.

HTH

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 29 September 2010 13:27, Markus Jelsma <ma...@buyways.nl> wrote:

> Thanks!
>
> We're back with the base URL issue. The stuff i `found` in the
> TestOutlinkExtractor was my own doing. No patch here. Using the
> ParserChecker
> it was clear that the problem came up because the http:// URL schema was
> not
> present in some href's. The problem is also present when using an ordinary
> browser and it can be solved by using the regex AJ supplied.
>
> The problem with the blikopnieuws site (relative URL's without base URL)
> remains, though. Check this link http://www.blikopnieuws.nl/nieuwsblok
> On the right side you'll see a latest news block with (in the browser)
> proper
> URL's. Check the source and you'll see relative URL's. It, of course, also
> stops working the the browser when you have a trailing slash.
>
> Now use the parser checker:
> bin/nutch org.apache.nutch.parse.ParserChecker
> http://www.blikopnieuws.nl/nieuwsblok
>
> And you'll see that Nutch uses http://www.blikopnieuws.nl/nieuwsblok/ as
> base
> URL for relative URL's, just as the browser does. Everything works as
> expected
> because of the relative URL's.
>
> The problem is, the website is itself not consistent. It mostly features
> the
> URL in the footer without trailing slash but from some unknown page i got
> the
> same URL with the trailing slash. From there on, everything starts to go
> wrong.
>
> To conclude, i got fooled! But how can we in the future prevent this from
> happening? I could use url filtering but that would mean the index already
> contains garbage because i cannot filter what i don't know.
>
> Cheers,
>
> On Wednesday 29 September 2010 11:25:55 Julien Nioche wrote:
> > Don't know how to run a single test but if you do ant test you should be
> >  able to find the logs for each individual class in ./build/test with a
> >  separate log for TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt
> that
> >  will be easier that going through a single huge file
> >
> > J.
> >
> >
> > On 29 September 2010 10:11, Markus Jelsma <ma...@buyways.nl>
> wrote:
> > Yes but i need a little more testing. Anyone knows how i can only test
> that
> > class? I currently use ant -v test -l logfile and need to dig through the
> >  log file, also, it takes too long because of other tests.
> >
> > On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> > > Hi guys,
> > >
> > > IIRC the OutlinkExtractor is the same in parse-tika and parse-html.
> Could
> > > you please open a JIRA and attach a patch for the TestOutlinkExtractor
> so
> > > that we can reproduce the problem?
> > >
> > > Thanks
> > >
> > > Julien
> > >
> > > > Hello Mathijs,
> > > >
> > > >
> > > >
> > > > I inspected the code base and found that the problem is most likely
> in
> > > > the parse-tika code where the text is being extracted and the
> > > > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > > > expression that can output a lot of garbage. I've added a test to the
> > > > TestOutlinkExtractor where it's clear that at least one URL does not
> > > > pass but it does not point me in the right direction for solving the
> > > > relative path problem.
> > > >
> > > >
> > > >
> > > > Unless someone knows, i'll try to find out how the OutlinkExtractor
> > > > works with the current base URL because just a plain relative URL in
> > > > the test will obviously fail.
> > > >
> > > >
> > > >
> > > > Thanks for the pointer =)
> > > >
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > -----Original message-----
> > > > From: Mathijs Homminga <ma...@knowlogy.nl>
> > > > Sent: Tue 28-09-2010 21:01
> > > > To: user@nutch.apache.org;
> > > > Subject: Re: Funky duplicate url's, getting much worse!
> > > >
> > > > Hi Marcus,
> > > >
> > > > I remember Nutch had some troubles with honoring the page's BASE tag
> > > > when resolving relative outlinks.
> > > > However, I don't see this BASE tag being used in the HTML pages you
> > > > provide so that's might not be it.
> > > >
> > > > Mathijs
> > > >
> > > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > > > Anyone? Where is a proper solution for this issue? As expected, the
> > > > > regex
> > > >
> > > > won't catch all imaginable kinds of funky URL's that somehow ended up
> > > > in the CrawlDB. Before the weekend, i added another news site to the
> > > > tests i conduct and let it run continuously. Unfortunately, the
> > > > generator now comes up with all kinds of completely useless URL's,
> > > > although they do exist but that's just the web application ignoring
> > > > most parts of the URL's.
> > > >
> > > > > This is the URL that should be considered as proper URL:
> > > > >
> > > > > http://www.blikopnieuws.nl/nieuwsblok
> > > > >
> > > > >
> > > > >
> > > > > Here are two URL's that are completely useless:
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> > > >ri cht/119033/bericht/119047/economie
> > > >
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/11
> > > >90 35/archief/bericht/119038/archief/
> > > >
> > > > > It is very hard to use deduplication on these simply because the
> > > > > content
> > > >
> > > > is actually changes too much as time progresses - the latest news
> block
> > > > for example. It is therefore a necessity to keep these URL's from
> > > > ending up in the CrawlDB and so not to waste disk space and update
> time
> > > > of the CrawlDB and and huge load of bandwidth - i'm in my current
> fetch
> > > > probably going to waste at least a few GB's.
> > > >
> > > > > Looking at the HTML source, it looks like the parser cannot
> properly
> > > >
> > > > handle relative URL's. It is, of course, quite ugly for a site to do
> > > > this but the parser must not fool itself and come up with URL's that
> > > > really aren't there. Combined with the issue i began the thread with
> i
> > > > believe the following two problems are present - the parser returns
> > > > imaginary (false)
> > > >
> > > > URL's because of:
> > > > > 1. relative href's;
> > > > >
> > > > > 2. URL's in anchors (that is the XML element's body) next to the
> rhef
> > > >
> > > > attribute.
> > > >
> > > > > Please help in finding the source of the problem (Tika? Nutch?) and
> > > > > how
> > > >
> > > > to proceed in having it fixed so other users won't waste bandwidth,
> > > > disk space and CPU cycles =)
> > > >
> > > > > Oh, here's a snippet of the fetch job that's currently running,
> also,
> > > >
> > > > notice the news item with the 119039 ID, it's the same as above
> > > > although that copy/paste was 15 minutes ago. Most item ID's you see
> > > > below continue to return in the current log output.
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/11904
> > > >2/ hetweer/game/persberichtaanleveren
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/beric
> > > >ht /119036/game/tipons
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/beric
> > > >ht /119035/bericht/119033/disclaimer
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/b
> > > >er icht/119036/groningen
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss
> > > >/b ericht/119042/persberichtaanleveren
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/arc
> > > >hi ef/bericht/119036/bericht/119038/zuidholland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> > > >5/ bericht/119036/game/hetweer/vandaag
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/beric
> > > >ht /119035/game/archief/donderdag
> > > >
> > > > > fetching
> > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/beric
> > > >ht /119034/archief/zeeland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> > > >ri cht/119041/bericht/119047/lifestyle
> > > >
> > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/beric
> > > >ht
> > > >
> /119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.h
> > > >tml
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/
> > > >be richt/119038/game/lennythelizard
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/arc
> > > >hi
> > > >
> ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defec
> > > >t.h tml
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> > > >5/ game/bericht/119035/noordbrabant
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss
> > > >/b ericht/119036/
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/ar
> > > >ch ief/bericht/119043/game/bioballboom
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/11
> > > >90 33/archief/bericht/119046/wetenschap
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/ar
> > > >ch ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/ga
> > > >me /archief/rss/
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/het
> > > >we er/game/archief/overijssel
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/1190
> > > >38 /bericht/119048/binnenland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/11904
> > > >2/ bericht/119038/game/auto
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archi
> > > >ef /bericht/119049/zeeland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/
> > > >ar chief/meewerken
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/11903
> > > >5/ game/bericht/119034/gelderland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/gam
> > > >e/ bericht/119042/game/binnenland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/arc
> > > >hi ef/bericht/119035/bericht/119035/gelderland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht
> > > >/1 19038/archief/lifestyle
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/be
> > > >ri cht/119041/hetweer/archief/woensdag
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/11
> > > >90 42/archief/bericht/119047/lifestyle
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/be
> > > >ri cht/119034/bericht/119047/glossy
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/beric
> > > >ht /119038/bericht/119045/glossy
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/11
> > > >90 36/game/bericht/119042/archief/zaterdag
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/11903
> > > >5/ archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > > >
> > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/11
> > > >90 37/archief/bericht/119046/economie
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/1
> > > >19 033/bericht/119037/overijssel
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/beric
> > > >ht /119036/bericht/119037/
> > > >
> > > > > -----Original message-----
> > > > > From: Markus Jelsma <ma...@buyways.nl>
> > > > > Sent: Wed 22-09-2010 20:47
> > > > > To: user@nutch.apache.org;
> > > > > Subject: RE: Re: Funky duplicate url's
> > > > >
> > > > > Thanks! I've already implemented a similar (but not as generic)
> regex
> > > > > to
> > > >
> > > > catch these url's. But it is, of course, not a proper solution to
> solve
> > > > a parsing problem with subsequent regex's. Please, correct me if i'm
> > > > wrong, but i'm quite sure those url's are not to be found in the HTML
> > > > sources. I'd better to be fixed where the problem seems to be.
> > > >
> > > > > I'll test your regex but i'd still like to know where the exact
> > > > > problem
> > > >
> > > > lies and hopefully fix or help fixing it.
> > > >
> > > > > Thanks
> > > > >
> > > > > -----Original message-----
> > > > > From: AJ Chen <aj...@web2express.org>
> > > > > Sent: Wed 22-09-2010 20:29
> > > > > To: user@nutch.apache.org;
> > > > > Subject: Re: Funky duplicate url's
> > > > >
> > > > > the conf/regex-urlfilter.txt file has an exclusion rule that should
> > > > > skip these viral urls.
> > > > >
> > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > > > > break loops
> > > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > > > >
> > > > > -aj
> > > > >
> > > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > > > <markus.jelsma@buyways.nl
> > > > >
> > > > >wrote:
> > > > >> Well, using a regex to catch these troublemakers isn't going to be
> > > >
> > > > useful.
> > > >
> > > > >> Although i caught the first faulty url's, there can be many more
> and
> > > >
> > > > it's
> > > >
> > > > >> unpredictable; here's just a random pick from the list of errors:
> > > >
> > > >
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.
> > > >is
> > > > /Key-Sectors/Data-Centers-in-Iceland/
> www.invest.is/Key-Sectors/Data-Cen
> > > >ter
> > > > s-in-Iceland/
> www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inve
> > > >st.
> > > > is/Key-Sectors/Data-Centers-in-Iceland/
> www.invest.is/Key-Sectors/Data-C
> > > >ent
> > > >
> > > >
> > > >
> > > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > > >
> > > > >> Here's another very disturbing url it's trying to fetch:
> > > >
> > > >
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/200
> > > >5/
> > > > 02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida
> > > >_li
> > > > censes_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovony
> > > >x/h
> > > > ttp/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther
> > > >egi
> > > >
> ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200
> > > >5/0
> > > > 2/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_
> > > >lic
> > > > enses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx
> > > >/ht
> > > > tp/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there
> > > >gis
> > > >
> ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005
> > > >/02
> > > > /04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_l
> > > >ice
> > > > nses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> > > >htt
> > > > p/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg
> > > >ist
> > > >
> er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/
> > > >02/
> > > > 04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_li
> > > >cen
> > > > ses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> > > >ttp
> > > > /
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> > > >ste
> > > >
> r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> > > >2/0
> > > > 4/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_lic
> > > >ens
> > > > es_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> > > >tp/
> > > >
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> > > >ter
> > > > .com/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02
> > > >/04
> > > > /elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_lice
> > > >nse
> > > > s_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> > > >p/w
> > > >
> ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> > > >er.
> > > > com/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/
> > > >04/
> > > > elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licen
> > > >ses
> > > > _ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> > > >/ww
> > > >
> w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> > > >r.c
> > > > om/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/0
> > > >4/e
> > > > lpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licens
> > > >es_
> > > > ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> > > >www
> > > > .
> theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> > > >.co
> > > > m/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04
> > > >/el
> > > > pida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_license
> > > >s_o
> > > >
> > > >
> > > >vonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> > > >w.
> > > >
> theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> > > >com
> > > > /2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/
> > > >elp
> > > > ida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses
> > > >_ov
> > > >
> > > >onyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> > > >.t
> > > >
> heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> > > >om/
> > > > 2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/e
> > > >lpi
> > > > da_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_
> > > >ovo
> > > >
> > > >nyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> > > >th
> > > >
> eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> > > >m/2
> > > > 005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/el
> > > >pid
> > > > a_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_o
> > > >von
> > > >
> > > >yx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> > > >he
> > > >
> register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> > > >/20
> > > > 05/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elp
> > > >ida
> > > > _licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ov
> > > >ony x/
> > > >
> > > > >> I'm seems these bad url's are somehow found by the parser and get
> > > >
> > > > fetched
> > > >
> > > > >> the next time, and the next time making the url grow longer and
> > > > >> longer
> > > >
> > > > for
> > > >
> > > > >> each fetch and parse and updateDB cycle.
> > > >
> > > >
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1
> > > >99
> > > > 9/article1513468.ece/
> www.microsoft.com/office/www.microsoft.com/office/
> > > >www
> > > > .
> microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> > > >/ww
> > > >
> w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> > > >e/w
> > > >
> ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> > > >ce/
> > > >
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> > > >ice
> > > > /
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> > > >fic
> > > > e/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> > > >ffi
> > > > ce/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/
> > > >off
> > > > ice/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > > >/of
> > > >
> > > >fice/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > > >/o
> > > >
> > > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> > > >
> > > > >> This doesn't look good at all. Anyone got a suggestion or some
> > > > >> pointer?
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> -----Original message-----
> > > > >> From: Markus Jelsma <ma...@buyways.nl>
> > > > >> Sent: Wed 22-09-2010 12:12
> > > > >> To: user@nutch.apache.org;
> > > > >> Subject: Funky duplicate url's
> > > > >>
> > > > >> Hi,
> > > > >>
> > > > >>
> > > > >>
> > > > >> This is not about deduplication, but about preventing certain
> url's
> > > > >> to
> > > >
> > > > end
> > > >
> > > > >> up in the CrawlDB. I'm crawling a news site for testing purposes,
> it
> > > > >> has
> > > >
> > > > the
> > > >
> > > > >> usual categories etc. News item pages feature a gray text block
> > > > >> that's
> > > >
> > > > got
> > > >
> > > > >> some url's as well. See
> > > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> > > >
> > > > example.
> > > >
> > > > >> The problem is, the parser somehow manages to concatenate the href
> > > > >> with
> > > >
> > > > the
> > > >
> > > > >> inner anchor text (which happens to be an url as you can see). So,
> > > > >> subsequent fetches are completely messed up, i'm almost only
> > > > >> fetching duplicates:
> > > > >>
> > > > >>
> > > > >>
> > > > >> fetching
> > > >
> > > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > > >ni
> > > > euws/economie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/w
> > > >ww.
> > > >
> trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuw
> > > >s/e
> > > > conomie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro
> > > >uw.
> > > > nl/opinie/weblogs/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblo
> > > >gs/
> > > >
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/n
> > > >ieu ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> > > >
> > > > >> fetching
> > > >
> > > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > > >ni
> > > > euws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/
> > > >www
> > > > .
> trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> > > >ie/
> > > > weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > > >ouw
> > > > .nl/nieuws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/ec
> > > >ono
> > > > mie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.
> > > >nl/ nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > > >
> > > > >> fetching
> > > >
> > > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > > >op
> > > > inie/weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/
> > > >www
> > > > .
> trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> > > >ie/
> > > > weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > > >ouw
> > > > .nl/nieuws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/we
> > > >blo
> > > >
> > > >
> > > >gs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl
> > > >/o
> > > >
> > > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > > >
> > > > >> This is not desired behavior, as you'd expect. The question is,
> > > > >> where to fix and how to fix it? Is it a problem with the parser?
> Or
> > > > >> is it fixable using some freaky url filter for this one?
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Cheers,
> > > > >
> > > > > --
> > > > > AJ Chen, PhD
> > > > > Chair, Semantic Web SIG, sdforum.org
> > > > > http://web2express.org
> > > > > twitter @web2express
> > > > > Palo Alto, CA, USA
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

Re: Funky duplicate url's, getting much worse!

Posted by Markus Jelsma <ma...@buyways.nl>.

Thanks!

We're back with the base URL issue. The stuff i `found` in the 
TestOutlinkExtractor was my own doing. No patch here. Using the ParserChecker 
it was clear that the problem came up because the http:// URL schema was not 
present in some href's. The problem is also present when using an ordinary 
browser and it can be solved by using the regex AJ supplied.

The problem with the blikopnieuws site (relative URL's without base URL) 
remains, though. Check this link http://www.blikopnieuws.nl/nieuwsblok
On the right side you'll see a latest news block with (in the browser) proper 
URL's. Check the source and you'll see relative URL's. It, of course, also 
stops working the the browser when you have a trailing slash.

Now use the parser checker:
bin/nutch org.apache.nutch.parse.ParserChecker 
http://www.blikopnieuws.nl/nieuwsblok

And you'll see that Nutch uses http://www.blikopnieuws.nl/nieuwsblok/ as base 
URL for relative URL's, just as the browser does. Everything works as expected 
because of the relative URL's.

The problem is, the website is itself not consistent. It mostly features the 
URL in the footer without trailing slash but from some unknown page i got the 
same URL with the trailing slash. From there on, everything starts to go 
wrong.

To conclude, i got fooled! But how can we in the future prevent this from 
happening? I could use url filtering but that would mean the index already 
contains garbage because i cannot filter what i don't know.

Cheers,

On Wednesday 29 September 2010 11:25:55 Julien Nioche wrote:
> Don't know how to run a single test but if you do ant test you should be
>  able to find the logs for each individual class in ./build/test with a
>  separate log for TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt that
>  will be easier that going through a single huge file
> 
> J.
> 
> 
> On 29 September 2010 10:11, Markus Jelsma <ma...@buyways.nl> wrote:
> Yes but i need a little more testing. Anyone knows how i can only test that
> class? I currently use ant -v test -l logfile and need to dig through the
>  log file, also, it takes too long because of other tests.
> 
> On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> > Hi guys,
> >
> > IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
> > you please open a JIRA and attach a patch for the TestOutlinkExtractor so
> > that we can reproduce the problem?
> >
> > Thanks
> >
> > Julien
> >
> > > Hello Mathijs,
> > >
> > >
> > >
> > > I inspected the code base and found that the problem is most likely in
> > > the parse-tika code where the text is being extracted and the
> > > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > > expression that can output a lot of garbage. I've added a test to the
> > > TestOutlinkExtractor where it's clear that at least one URL does not
> > > pass but it does not point me in the right direction for solving the
> > > relative path problem.
> > >
> > >
> > >
> > > Unless someone knows, i'll try to find out how the OutlinkExtractor
> > > works with the current base URL because just a plain relative URL in
> > > the test will obviously fail.
> > >
> > >
> > >
> > > Thanks for the pointer =)
> > >
> > >
> > >
> > > Cheers,
> > >
> > > -----Original message-----
> > > From: Mathijs Homminga <ma...@knowlogy.nl>
> > > Sent: Tue 28-09-2010 21:01
> > > To: user@nutch.apache.org;
> > > Subject: Re: Funky duplicate url's, getting much worse!
> > >
> > > Hi Marcus,
> > >
> > > I remember Nutch had some troubles with honoring the page's BASE tag
> > > when resolving relative outlinks.
> > > However, I don't see this BASE tag being used in the HTML pages you
> > > provide so that's might not be it.
> > >
> > > Mathijs
> > >
> > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > > Anyone? Where is a proper solution for this issue? As expected, the
> > > > regex
> > >
> > > won't catch all imaginable kinds of funky URL's that somehow ended up
> > > in the CrawlDB. Before the weekend, i added another news site to the
> > > tests i conduct and let it run continuously. Unfortunately, the
> > > generator now comes up with all kinds of completely useless URL's,
> > > although they do exist but that's just the web application ignoring
> > > most parts of the URL's.
> > >
> > > > This is the URL that should be considered as proper URL:
> > > >
> > > > http://www.blikopnieuws.nl/nieuwsblok
> > > >
> > > >
> > > >
> > > > Here are two URL's that are completely useless:
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> > >ri cht/119033/bericht/119047/economie
> > >
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/11
> > >90 35/archief/bericht/119038/archief/
> > >
> > > > It is very hard to use deduplication on these simply because the
> > > > content
> > >
> > > is actually changes too much as time progresses - the latest news block
> > > for example. It is therefore a necessity to keep these URL's from
> > > ending up in the CrawlDB and so not to waste disk space and update time
> > > of the CrawlDB and and huge load of bandwidth - i'm in my current fetch
> > > probably going to waste at least a few GB's.
> > >
> > > > Looking at the HTML source, it looks like the parser cannot properly
> > >
> > > handle relative URL's. It is, of course, quite ugly for a site to do
> > > this but the parser must not fool itself and come up with URL's that
> > > really aren't there. Combined with the issue i began the thread with i
> > > believe the following two problems are present - the parser returns
> > > imaginary (false)
> > >
> > > URL's because of:
> > > > 1. relative href's;
> > > >
> > > > 2. URL's in anchors (that is the XML element's body) next to the rhef
> > >
> > > attribute.
> > >
> > > > Please help in finding the source of the problem (Tika? Nutch?) and
> > > > how
> > >
> > > to proceed in having it fixed so other users won't waste bandwidth,
> > > disk space and CPU cycles =)
> > >
> > > > Oh, here's a snippet of the fetch job that's currently running, also,
> > >
> > > notice the news item with the 119039 ID, it's the same as above
> > > although that copy/paste was 15 minutes ago. Most item ID's you see
> > > below continue to return in the current log output.
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/11904
> > >2/ hetweer/game/persberichtaanleveren
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/beric
> > >ht /119036/game/tipons
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/beric
> > >ht /119035/bericht/119033/disclaimer
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/b
> > >er icht/119036/groningen
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss
> > >/b ericht/119042/persberichtaanleveren
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/arc
> > >hi ef/bericht/119036/bericht/119038/zuidholland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> > >5/ bericht/119036/game/hetweer/vandaag
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/beric
> > >ht /119035/game/archief/donderdag
> > >
> > > > fetching
> > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/beric
> > >ht /119034/archief/zeeland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> > >ri cht/119041/bericht/119047/lifestyle
> > >
> > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/beric
> > >ht
> > > /119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.h
> > >tml
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/
> > >be richt/119038/game/lennythelizard
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/arc
> > >hi
> > > ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defec
> > >t.h tml
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> > >5/ game/bericht/119035/noordbrabant
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss
> > >/b ericht/119036/
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/ar
> > >ch ief/bericht/119043/game/bioballboom
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/11
> > >90 33/archief/bericht/119046/wetenschap
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/ar
> > >ch ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/ga
> > >me /archief/rss/
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/het
> > >we er/game/archief/overijssel
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/1190
> > >38 /bericht/119048/binnenland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/11904
> > >2/ bericht/119038/game/auto
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archi
> > >ef /bericht/119049/zeeland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/
> > >ar chief/meewerken
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/11903
> > >5/ game/bericht/119034/gelderland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/gam
> > >e/ bericht/119042/game/binnenland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/arc
> > >hi ef/bericht/119035/bericht/119035/gelderland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht
> > >/1 19038/archief/lifestyle
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/be
> > >ri cht/119041/hetweer/archief/woensdag
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/11
> > >90 42/archief/bericht/119047/lifestyle
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/be
> > >ri cht/119034/bericht/119047/glossy
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/beric
> > >ht /119038/bericht/119045/glossy
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/11
> > >90 36/game/bericht/119042/archief/zaterdag
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/11903
> > >5/ archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > >
> > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/11
> > >90 37/archief/bericht/119046/economie
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/1
> > >19 033/bericht/119037/overijssel
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/beric
> > >ht /119036/bericht/119037/
> > >
> > > > -----Original message-----
> > > > From: Markus Jelsma <ma...@buyways.nl>
> > > > Sent: Wed 22-09-2010 20:47
> > > > To: user@nutch.apache.org;
> > > > Subject: RE: Re: Funky duplicate url's
> > > >
> > > > Thanks! I've already implemented a similar (but not as generic) regex
> > > > to
> > >
> > > catch these url's. But it is, of course, not a proper solution to solve
> > > a parsing problem with subsequent regex's. Please, correct me if i'm
> > > wrong, but i'm quite sure those url's are not to be found in the HTML
> > > sources. I'd better to be fixed where the problem seems to be.
> > >
> > > > I'll test your regex but i'd still like to know where the exact
> > > > problem
> > >
> > > lies and hopefully fix or help fixing it.
> > >
> > > > Thanks
> > > >
> > > > -----Original message-----
> > > > From: AJ Chen <aj...@web2express.org>
> > > > Sent: Wed 22-09-2010 20:29
> > > > To: user@nutch.apache.org;
> > > > Subject: Re: Funky duplicate url's
> > > >
> > > > the conf/regex-urlfilter.txt file has an exclusion rule that should
> > > > skip these viral urls.
> > > >
> > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > > > break loops
> > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > > >
> > > > -aj
> > > >
> > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > > <markus.jelsma@buyways.nl
> > > >
> > > >wrote:
> > > >> Well, using a regex to catch these troublemakers isn't going to be
> > >
> > > useful.
> > >
> > > >> Although i caught the first faulty url's, there can be many more and
> > >
> > > it's
> > >
> > > >> unpredictable; here's just a random pick from the list of errors:
> > >
> > > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.
> > >is
> > > /Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Cen
> > >ter
> > > s-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inve
> > >st.
> > > is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-C
> > >ent
> > >
> > >
> > >
> > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > >
> > > >> Here's another very disturbing url it's trying to fetch:
> > >
> > > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/200
> > >5/
> > > 02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida
> > >_li
> > > censes_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovony
> > >x/h
> > > ttp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther
> > >egi
> > > ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200
> > >5/0
> > > 2/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_
> > >lic
> > > enses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx
> > >/ht
> > > tp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there
> > >gis
> > > ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005
> > >/02
> > > /04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_l
> > >ice
> > > nses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> > >htt
> > > p/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg
> > >ist
> > > er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/
> > >02/
> > > 04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_li
> > >cen
> > > ses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> > >ttp
> > > /www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> > >ste
> > > r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> > >2/0
> > > 4/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lic
> > >ens
> > > es_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> > >tp/
> > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> > >ter
> > > .com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02
> > >/04
> > > /elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lice
> > >nse
> > > s_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> > >p/w
> > > ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> > >er.
> > > com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/
> > >04/
> > > elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licen
> > >ses
> > > _ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> > >/ww
> > > w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> > >r.c
> > > om/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0
> > >4/e
> > > lpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licens
> > >es_
> > > ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> > >www
> > > .theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> > >.co
> > > m/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04
> > >/el
> > > pida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_license
> > >s_o
> > >
> > >
> > >vonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> > >w.
> > > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> > >com
> > > /2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/
> > >elp
> > > ida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses
> > >_ov
> > >
> > >onyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> > >.t
> > > heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> > >om/
> > > 2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/e
> > >lpi
> > > da_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_
> > >ovo
> > >
> > >nyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> > >th
> > > eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> > >m/2
> > > 005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/el
> > >pid
> > > a_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_o
> > >von
> > >
> > >yx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> > >he
> > > register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> > >/20
> > > 05/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elp
> > >ida
> > > _licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ov
> > >ony x/
> > >
> > > >> I'm seems these bad url's are somehow found by the parser and get
> > >
> > > fetched
> > >
> > > >> the next time, and the next time making the url grow longer and
> > > >> longer
> > >
> > > for
> > >
> > > >> each fetch and parse and updateDB cycle.
> > >
> > > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1
> > >99
> > > 9/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/
> > >www
> > > .microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> > >/ww
> > > w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> > >e/w
> > > ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> > >ce/
> > > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> > >ice
> > > /www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> > >fic
> > > e/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> > >ffi
> > > ce/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/
> > >off
> > > ice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > >/of
> > >
> > >fice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > >/o
> > >
> > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> > >
> > > >> This doesn't look good at all. Anyone got a suggestion or some
> > > >> pointer?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> -----Original message-----
> > > >> From: Markus Jelsma <ma...@buyways.nl>
> > > >> Sent: Wed 22-09-2010 12:12
> > > >> To: user@nutch.apache.org;
> > > >> Subject: Funky duplicate url's
> > > >>
> > > >> Hi,
> > > >>
> > > >>
> > > >>
> > > >> This is not about deduplication, but about preventing certain url's
> > > >> to
> > >
> > > end
> > >
> > > >> up in the CrawlDB. I'm crawling a news site for testing purposes, it
> > > >> has
> > >
> > > the
> > >
> > > >> usual categories etc. News item pages feature a gray text block
> > > >> that's
> > >
> > > got
> > >
> > > >> some url's as well. See
> > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> > >
> > > example.
> > >
> > > >> The problem is, the parser somehow manages to concatenate the href
> > > >> with
> > >
> > > the
> > >
> > > >> inner anchor text (which happens to be an url as you can see). So,
> > > >> subsequent fetches are completely messed up, i'm almost only
> > > >> fetching duplicates:
> > > >>
> > > >>
> > > >>
> > > >> fetching
> > >
> > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > >ni
> > > euws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/w
> > >ww.
> > > trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuw
> > >s/e
> > > conomie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro
> > >uw.
> > > nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblo
> > >gs/
> > > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/n
> > >ieu ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> > >
> > > >> fetching
> > >
> > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > >ni
> > > euws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/
> > >www
> > > .trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> > >ie/
> > > weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > >ouw
> > > .nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/ec
> > >ono
> > > mie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.
> > >nl/ nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > >
> > > >> fetching
> > >
> > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > >op
> > > inie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/
> > >www
> > > .trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> > >ie/
> > > weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > >ouw
> > > .nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/we
> > >blo
> > >
> > >
> > >gs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl
> > >/o
> > >
> > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > >
> > > >> This is not desired behavior, as you'd expect. The question is,
> > > >> where to fix and how to fix it? Is it a problem with the parser? Or
> > > >> is it fixable using some freaky url filter for this one?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Cheers,
> > > >
> > > > --
> > > > AJ Chen, PhD
> > > > Chair, Semantic Web SIG, sdforum.org
> > > > http://web2express.org
> > > > twitter @web2express
> > > > Palo Alto, CA, USA
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Funky duplicate url's, getting much worse!

Posted by Julien Nioche <li...@gmail.com>.

Don't know how to run a single test but if you do *ant test *you should be
able to find the logs for each individual class in ./build/test with a
separate log for *TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt*
that will be easier that going through a single huge file

J.


On 29 September 2010 10:11, Markus Jelsma <ma...@buyways.nl> wrote:

> Yes but i need a little more testing. Anyone knows how i can only test that
> class? I currently use ant -v test -l logfile and need to dig through the
> log
> file, also, it takes too long because of other tests.
>
>
> On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> > Hi guys,
> >
> > IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
> > you please open a JIRA and attach a patch for the TestOutlinkExtractor so
> > that we can reproduce the problem?
> >
> > Thanks
> >
> > Julien
> >
> > > Hello Mathijs,
> > >
> > >
> > >
> > > I inspected the code base and found that the problem is most likely in
> > > the parse-tika code where the text is being extracted and the
> > > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > > expression that can output a lot of garbage. I've added a test to the
> > > TestOutlinkExtractor where it's clear that at least one URL does not
> pass
> > > but it does not point me in the right direction for solving the
> relative
> > > path problem.
> > >
> > >
> > >
> > > Unless someone knows, i'll try to find out how the OutlinkExtractor
> works
> > > with the current base URL because just a plain relative URL in the test
> > > will obviously fail.
> > >
> > >
> > >
> > > Thanks for the pointer =)
> > >
> > >
> > >
> > > Cheers,
> > >
> > > -----Original message-----
> > > From: Mathijs Homminga <ma...@knowlogy.nl>
> > > Sent: Tue 28-09-2010 21:01
> > > To: user@nutch.apache.org;
> > > Subject: Re: Funky duplicate url's, getting much worse!
> > >
> > > Hi Marcus,
> > >
> > > I remember Nutch had some troubles with honoring the page's BASE tag
> when
> > > resolving relative outlinks.
> > > However, I don't see this BASE tag being used in the HTML pages you
> > > provide so that's might not be it.
> > >
> > > Mathijs
> > >
> > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > > Anyone? Where is a proper solution for this issue? As expected, the
> > > > regex
> > >
> > > won't catch all imaginable kinds of funky URL's that somehow ended up
> in
> > > the CrawlDB. Before the weekend, i added another news site to the tests
> i
> > > conduct and let it run continuously. Unfortunately, the generator now
> > > comes up with all kinds of completely useless URL's, although they do
> > > exist but that's just the web application ignoring most parts of the
> > > URL's.
> > >
> > > > This is the URL that should be considered as proper URL:
> > > >
> > > > http://www.blikopnieuws.nl/nieuwsblok
> > > >
> > > >
> > > >
> > > > Here are two URL's that are completely useless:
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri
> > >cht/119033/bericht/119047/economie
> > >
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/1190
> > >35/archief/bericht/119038/archief/
> > >
> > > > It is very hard to use deduplication on these simply because the
> > > > content
> > >
> > > is actually changes too much as time progresses - the latest news block
> > > for example. It is therefore a necessity to keep these URL's from
> ending
> > > up in the CrawlDB and so not to waste disk space and update time of the
> > > CrawlDB and and huge load of bandwidth - i'm in my current fetch
> probably
> > > going to waste at least a few GB's.
> > >
> > > > Looking at the HTML source, it looks like the parser cannot properly
> > >
> > > handle relative URL's. It is, of course, quite ugly for a site to do
> this
> > > but the parser must not fool itself and come up with URL's that really
> > > aren't there. Combined with the issue i began the thread with i believe
> > > the following two problems are present - the parser returns imaginary
> > > (false)
> > >
> > > URL's because of:
> > > > 1. relative href's;
> > > >
> > > > 2. URL's in anchors (that is the XML element's body) next to the rhef
> > >
> > > attribute.
> > >
> > > > Please help in finding the source of the problem (Tika? Nutch?) and
> how
> > >
> > > to proceed in having it fixed so other users won't waste bandwidth,
> disk
> > > space and CPU cycles =)
> > >
> > > > Oh, here's a snippet of the fetch job that's currently running, also,
> > >
> > > notice the news item with the 119039 ID, it's the same as above
> although
> > > that copy/paste was 15 minutes ago. Most item ID's you see below
> continue
> > > to return in the current log output.
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/
> > >hetweer/game/persberichtaanleveren
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht
> > >/119036/game/tipons
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht
> > >/119035/bericht/119033/disclaimer
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/ber
> > >icht/119036/groningen
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/b
> > >ericht/119042/persberichtaanleveren
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archi
> > >ef/bericht/119036/bericht/119038/zuidholland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/
> > >bericht/119036/game/hetweer/vandaag
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht
> > >/119035/game/archief/donderdag
> > >
> > > > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht
> > >/119034/archief/zeeland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri
> > >cht/119041/bericht/119047/lifestyle
> > >
> > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht
> >
> >/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/be
> > >richt/119038/game/lennythelizard
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archi
> >
> >ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.h
> > >tml
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/
> > >game/bericht/119035/noordbrabant
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/b
> > >ericht/119036/
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/arch
> > >ief/bericht/119043/game/bioballboom
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/1190
> > >33/archief/bericht/119046/wetenschap
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/arch
> > >ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game
> > >/archief/rss/
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetwe
> > >er/game/archief/overijssel
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038
> > >/bericht/119048/binnenland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/
> > >bericht/119038/game/auto
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief
> > >/bericht/119049/zeeland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/ar
> > >chief/meewerken
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/
> > >game/bericht/119034/gelderland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/
> > >bericht/119042/game/binnenland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archi
> > >ef/bericht/119035/bericht/119035/gelderland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/1
> > >19038/archief/lifestyle
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/beri
> > >cht/119041/hetweer/archief/woensdag
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/1190
> > >42/archief/bericht/119047/lifestyle
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/beri
> > >cht/119034/bericht/119047/glossy
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht
> > >/119038/bericht/119045/glossy
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/1190
> > >36/game/bericht/119042/archief/zaterdag
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/
> > >archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > >
> > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/1190
> > >37/archief/bericht/119046/economie
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119
> > >033/bericht/119037/overijssel
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht
> > >/119036/bericht/119037/
> > >
> > > > -----Original message-----
> > > > From: Markus Jelsma <ma...@buyways.nl>
> > > > Sent: Wed 22-09-2010 20:47
> > > > To: user@nutch.apache.org;
> > > > Subject: RE: Re: Funky duplicate url's
> > > >
> > > > Thanks! I've already implemented a similar (but not as generic) regex
> > > > to
> > >
> > > catch these url's. But it is, of course, not a proper solution to solve
> a
> > > parsing problem with subsequent regex's. Please, correct me if i'm
> wrong,
> > > but i'm quite sure those url's are not to be found in the HTML sources.
> > > I'd better to be fixed where the problem seems to be.
> > >
> > > > I'll test your regex but i'd still like to know where the exact
> problem
> > >
> > > lies and hopefully fix or help fixing it.
> > >
> > > > Thanks
> > > >
> > > > -----Original message-----
> > > > From: AJ Chen <aj...@web2express.org>
> > > > Sent: Wed 22-09-2010 20:29
> > > > To: user@nutch.apache.org;
> > > > Subject: Re: Funky duplicate url's
> > > >
> > > > the conf/regex-urlfilter.txt file has an exclusion rule that should
> > > > skip these viral urls.
> > > >
> > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > > > break loops
> > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > > >
> > > > -aj
> > > >
> > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > > <markus.jelsma@buyways.nl
> > > >
> > > >wrote:
> > > >> Well, using a regex to catch these troublemakers isn't going to be
> > >
> > > useful.
> > >
> > > >> Although i caught the first faulty url's, there can be many more and
> > >
> > > it's
> > >
> > > >> unpredictable; here's just a random pick from the list of errors:
> > >
> > >
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is
> > >/Key-Sectors/Data-Centers-in-Iceland/
> www.invest.is/Key-Sectors/Data-Center
> > >s-in-Iceland/
> www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.
> > >is/Key-Sectors/Data-Centers-in-Iceland/
> www.invest.is/Key-Sectors/Data-Cent
> > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > >
> > > >> Here's another very disturbing url it's trying to fetch:
> > >
> > >
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/
> > >02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_li
> > >censes_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> > >ttp/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> > >
> ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> > >2/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_lic
> > >enses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> > >tp/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> > >
> ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02
> > >/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_lice
> > >nses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> > >p/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> > >
> er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/
> > >04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licen
> > >ses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> > >/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> > >
> r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0
> > >4/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licens
> > >es_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> > >
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> > >.com/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04
> > >/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_license
> > >s_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/w
> > >
> ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> > >com/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/
> > >elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses
> > >_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> > >
> w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> > >om/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/e
> > >lpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_
> > >ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> > >.
> theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> > >m/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/el
> > >pida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_o
> > >vonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> > >
> theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> > >/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elp
> > >ida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ov
> > >onyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> > >
> heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/
> > >2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpi
> > >da_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovo
> > >nyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.th
> > >
> eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2
> > >005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpid
> > >a_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovon
> > >yx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.the
> > >
> register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/20
> > >05/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida
> > >_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovony
> > >x/
> > >
> > > >> I'm seems these bad url's are somehow found by the parser and get
> > >
> > > fetched
> > >
> > > >> the next time, and the next time making the url grow longer and
> longer
> > >
> > > for
> > >
> > > >> each fetch and parse and updateDB cycle.
> > >
> > >
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_199
> > >9/article1513468.ece/
> www.microsoft.com/office/www.microsoft.com/office/www
> > >.
> microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/ww
> > >
> w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/w
> > >
> ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/
> > >
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> > >/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> > >e/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> > >ce/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> > >ice/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> > >fice/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> > >
> > > >> This doesn't look good at all. Anyone got a suggestion or some
> > > >> pointer?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> -----Original message-----
> > > >> From: Markus Jelsma <ma...@buyways.nl>
> > > >> Sent: Wed 22-09-2010 12:12
> > > >> To: user@nutch.apache.org;
> > > >> Subject: Funky duplicate url's
> > > >>
> > > >> Hi,
> > > >>
> > > >>
> > > >>
> > > >> This is not about deduplication, but about preventing certain url's
> to
> > >
> > > end
> > >
> > > >> up in the CrawlDB. I'm crawling a news site for testing purposes, it
> > > >> has
> > >
> > > the
> > >
> > > >> usual categories etc. News item pages feature a gray text block
> that's
> > >
> > > got
> > >
> > > >> some url's as well. See
> > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> > >
> > > example.
> > >
> > > >> The problem is, the parser somehow manages to concatenate the href
> > > >> with
> > >
> > > the
> > >
> > > >> inner anchor text (which happens to be an url as you can see). So,
> > > >> subsequent fetches are completely messed up, i'm almost only
> fetching
> > > >> duplicates:
> > > >>
> > > >>
> > > >>
> > > >> fetching
> > >
> > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> > >euws/economie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.
> > >
> trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/e
> > >conomie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.
> > >nl/opinie/weblogs/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/
> > >
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieu
> > >ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> > >
> > > >> fetching
> > >
> > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> > >euws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www
> > >.
> trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/
> > >weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> > >.nl/nieuws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/econo
> > >mie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/
> > >nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > >
> > > >> fetching
> > >
> > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/op
> > >inie/weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www
> > >.
> trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/
> > >weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> > >.nl/nieuws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblo
> > >gs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/o
> > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > >
> > > >> This is not desired behavior, as you'd expect. The question is,
> where
> > > >> to fix and how to fix it? Is it a problem with the parser? Or is it
> > > >> fixable using some freaky url filter for this one?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Cheers,
> > > >
> > > > --
> > > > AJ Chen, PhD
> > > > Chair, Semantic Web SIG, sdforum.org
> > > > http://web2express.org
> > > > twitter @web2express
> > > > Palo Alto, CA, USA
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Funky duplicate url's, getting much worse!

Posted by Markus Jelsma <ma...@buyways.nl>.

Yes but i need a little more testing. Anyone knows how i can only test that 
class? I currently use ant -v test -l logfile and need to dig through the log 
file, also, it takes too long because of other tests.


On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> Hi guys,
> 
> IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
> you please open a JIRA and attach a patch for the TestOutlinkExtractor so
> that we can reproduce the problem?
> 
> Thanks
> 
> Julien
> 
> > Hello Mathijs,
> >
> >
> >
> > I inspected the code base and found that the problem is most likely in
> > the parse-tika code where the text is being extracted and the
> > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > expression that can output a lot of garbage. I've added a test to the
> > TestOutlinkExtractor where it's clear that at least one URL does not pass
> > but it does not point me in the right direction for solving the relative
> > path problem.
> >
> >
> >
> > Unless someone knows, i'll try to find out how the OutlinkExtractor works
> > with the current base URL because just a plain relative URL in the test
> > will obviously fail.
> >
> >
> >
> > Thanks for the pointer =)
> >
> >
> >
> > Cheers,
> >
> > -----Original message-----
> > From: Mathijs Homminga <ma...@knowlogy.nl>
> > Sent: Tue 28-09-2010 21:01
> > To: user@nutch.apache.org;
> > Subject: Re: Funky duplicate url's, getting much worse!
> >
> > Hi Marcus,
> >
> > I remember Nutch had some troubles with honoring the page's BASE tag when
> > resolving relative outlinks.
> > However, I don't see this BASE tag being used in the HTML pages you
> > provide so that's might not be it.
> >
> > Mathijs
> >
> > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > Anyone? Where is a proper solution for this issue? As expected, the
> > > regex
> >
> > won't catch all imaginable kinds of funky URL's that somehow ended up in
> > the CrawlDB. Before the weekend, i added another news site to the tests i
> > conduct and let it run continuously. Unfortunately, the generator now
> > comes up with all kinds of completely useless URL's, although they do
> > exist but that's just the web application ignoring most parts of the
> > URL's.
> >
> > > This is the URL that should be considered as proper URL:
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok
> > >
> > >
> > >
> > > Here are two URL's that are completely useless:
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri
> >cht/119033/bericht/119047/economie
> >
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/1190
> >35/archief/bericht/119038/archief/
> >
> > > It is very hard to use deduplication on these simply because the
> > > content
> >
> > is actually changes too much as time progresses - the latest news block
> > for example. It is therefore a necessity to keep these URL's from ending
> > up in the CrawlDB and so not to waste disk space and update time of the
> > CrawlDB and and huge load of bandwidth - i'm in my current fetch probably
> > going to waste at least a few GB's.
> >
> > > Looking at the HTML source, it looks like the parser cannot properly
> >
> > handle relative URL's. It is, of course, quite ugly for a site to do this
> > but the parser must not fool itself and come up with URL's that really
> > aren't there. Combined with the issue i began the thread with i believe
> > the following two problems are present - the parser returns imaginary
> > (false)
> >
> > URL's because of:
> > > 1. relative href's;
> > >
> > > 2. URL's in anchors (that is the XML element's body) next to the rhef
> >
> > attribute.
> >
> > > Please help in finding the source of the problem (Tika? Nutch?) and how
> >
> > to proceed in having it fixed so other users won't waste bandwidth, disk
> > space and CPU cycles =)
> >
> > > Oh, here's a snippet of the fetch job that's currently running, also,
> >
> > notice the news item with the 119039 ID, it's the same as above although
> > that copy/paste was 15 minutes ago. Most item ID's you see below continue
> > to return in the current log output.
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/
> >hetweer/game/persberichtaanleveren
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht
> >/119036/game/tipons
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht
> >/119035/bericht/119033/disclaimer
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/ber
> >icht/119036/groningen
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/b
> >ericht/119042/persberichtaanleveren
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archi
> >ef/bericht/119036/bericht/119038/zuidholland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/
> >bericht/119036/game/hetweer/vandaag
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht
> >/119035/game/archief/donderdag
> >
> > > fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht
> >/119034/archief/zeeland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri
> >cht/119041/bericht/119047/lifestyle
> >
> > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht
> >/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/be
> >richt/119038/game/lennythelizard
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archi
> >ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.h
> >tml
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/
> >game/bericht/119035/noordbrabant
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/b
> >ericht/119036/
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/arch
> >ief/bericht/119043/game/bioballboom
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/1190
> >33/archief/bericht/119046/wetenschap
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/arch
> >ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game
> >/archief/rss/
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetwe
> >er/game/archief/overijssel
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038
> >/bericht/119048/binnenland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/
> >bericht/119038/game/auto
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief
> >/bericht/119049/zeeland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/ar
> >chief/meewerken
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/
> >game/bericht/119034/gelderland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/
> >bericht/119042/game/binnenland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archi
> >ef/bericht/119035/bericht/119035/gelderland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/1
> >19038/archief/lifestyle
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/beri
> >cht/119041/hetweer/archief/woensdag
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/1190
> >42/archief/bericht/119047/lifestyle
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/beri
> >cht/119034/bericht/119047/glossy
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht
> >/119038/bericht/119045/glossy
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/1190
> >36/game/bericht/119042/archief/zaterdag
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/
> >archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> >
> > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/1190
> >37/archief/bericht/119046/economie
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119
> >033/bericht/119037/overijssel
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht
> >/119036/bericht/119037/
> >
> > > -----Original message-----
> > > From: Markus Jelsma <ma...@buyways.nl>
> > > Sent: Wed 22-09-2010 20:47
> > > To: user@nutch.apache.org;
> > > Subject: RE: Re: Funky duplicate url's
> > >
> > > Thanks! I've already implemented a similar (but not as generic) regex
> > > to
> >
> > catch these url's. But it is, of course, not a proper solution to solve a
> > parsing problem with subsequent regex's. Please, correct me if i'm wrong,
> > but i'm quite sure those url's are not to be found in the HTML sources.
> > I'd better to be fixed where the problem seems to be.
> >
> > > I'll test your regex but i'd still like to know where the exact problem
> >
> > lies and hopefully fix or help fixing it.
> >
> > > Thanks
> > >
> > > -----Original message-----
> > > From: AJ Chen <aj...@web2express.org>
> > > Sent: Wed 22-09-2010 20:29
> > > To: user@nutch.apache.org;
> > > Subject: Re: Funky duplicate url's
> > >
> > > the conf/regex-urlfilter.txt file has an exclusion rule that should
> > > skip these viral urls.
> > >
> > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > > break loops
> > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > >
> > > -aj
> > >
> > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > <markus.jelsma@buyways.nl
> > >
> > >wrote:
> > >> Well, using a regex to catch these troublemakers isn't going to be
> >
> > useful.
> >
> > >> Although i caught the first faulty url's, there can be many more and
> >
> > it's
> >
> > >> unpredictable; here's just a random pick from the list of errors:
> >
> > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is
> >/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Center
> >s-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.
> >is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Cent
> >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> >
> > >> Here's another very disturbing url it's trying to fetch:
> >
> > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/
> >02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_li
> >censes_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> >ttp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> >ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> >2/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lic
> >enses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> >tp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> >ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02
> >/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lice
> >nses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> >p/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> >er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/
> >04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licen
> >ses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> >/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> >r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0
> >4/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licens
> >es_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> >www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> >.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04
> >/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_license
> >s_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/w
> >ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> >com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/
> >elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses
> >_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> >w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> >om/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/e
> >lpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_
> >ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> >.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> >m/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/el
> >pida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_o
> >vonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> >theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> >/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elp
> >ida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ov
> >onyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> >heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/
> >2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpi
> >da_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovo
> >nyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.th
> >eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2
> >005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpid
> >a_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovon
> >yx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.the
> >register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/20
> >05/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida
> >_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovony
> >x/
> >
> > >> I'm seems these bad url's are somehow found by the parser and get
> >
> > fetched
> >
> > >> the next time, and the next time making the url grow longer and longer
> >
> > for
> >
> > >> each fetch and parse and updateDB cycle.
> >
> > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_199
> >9/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www
> >.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/ww
> >w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/w
> >ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/
> >www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> >/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> >e/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> >ce/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> >ice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> >fice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> >
> > >> This doesn't look good at all. Anyone got a suggestion or some
> > >> pointer?
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> From: Markus Jelsma <ma...@buyways.nl>
> > >> Sent: Wed 22-09-2010 12:12
> > >> To: user@nutch.apache.org;
> > >> Subject: Funky duplicate url's
> > >>
> > >> Hi,
> > >>
> > >>
> > >>
> > >> This is not about deduplication, but about preventing certain url's to
> >
> > end
> >
> > >> up in the CrawlDB. I'm crawling a news site for testing purposes, it
> > >> has
> >
> > the
> >
> > >> usual categories etc. News item pages feature a gray text block that's
> >
> > got
> >
> > >> some url's as well. See
> > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> >
> > example.
> >
> > >> The problem is, the parser somehow manages to concatenate the href
> > >> with
> >
> > the
> >
> > >> inner anchor text (which happens to be an url as you can see). So,
> > >> subsequent fetches are completely messed up, i'm almost only fetching
> > >> duplicates:
> > >>
> > >>
> > >>
> > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> >euws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.
> >trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/e
> >conomie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.
> >nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/
> >www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieu
> >ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> >
> > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> >euws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www
> >.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/
> >weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> >.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/econo
> >mie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/
> >nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> >
> > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/op
> >inie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www
> >.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/
> >weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> >.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblo
> >gs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/o
> >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> >
> > >> This is not desired behavior, as you'd expect. The question is, where
> > >> to fix and how to fix it? Is it a problem with the parser? Or is it
> > >> fixable using some freaky url filter for this one?
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Cheers,
> > >
> > > --
> > > AJ Chen, PhD
> > > Chair, Semantic Web SIG, sdforum.org
> > > http://web2express.org
> > > twitter @web2express
> > > Palo Alto, CA, USA
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Re: Funky duplicate url's, getting much worse!

Posted by Julien Nioche <li...@gmail.com>.

Hi guys,

IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
you please open a JIRA and attach a patch for the TestOutlinkExtractor so
that we can reproduce the problem?

Thanks

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 28 September 2010 21:14, Markus Jelsma <ma...@buyways.nl> wrote:

> Hello Mathijs,
>
>
>
> I inspected the code base and found that the problem is most likely in the
> parse-tika code where the text is being extracted and the OutlinkExtractor
> is called. The OutlinkExtractor uses a regular expression that can output a
> lot of garbage. I've added a test to the TestOutlinkExtractor where it's
> clear that at least one URL does not pass but it does not point me in the
> right direction for solving the relative path problem.
>
>
>
> Unless someone knows, i'll try to find out how the OutlinkExtractor works
> with the current base URL because just a plain relative URL in the test will
> obviously fail.
>
>
>
> Thanks for the pointer =)
>
>
>
> Cheers,
>
> -----Original message-----
> From: Mathijs Homminga <ma...@knowlogy.nl>
> Sent: Tue 28-09-2010 21:01
> To: user@nutch.apache.org;
> Subject: Re: Funky duplicate url's, getting much worse!
>
> Hi Marcus,
>
> I remember Nutch had some troubles with honoring the page's BASE tag when
> resolving relative outlinks.
> However, I don't see this BASE tag being used in the HTML pages you provide
> so that's might not be it.
>
> Mathijs
>
>
> On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
>
> > Anyone? Where is a proper solution for this issue? As expected, the regex
> won't catch all imaginable kinds of funky URL's that somehow ended up in the
> CrawlDB. Before the weekend, i added another news site to the tests i
> conduct and let it run continuously. Unfortunately, the generator now comes
> up with all kinds of completely useless URL's, although they do exist but
> that's just the web application ignoring most parts of the URL's.
> >
> >
> >
> > This is the URL that should be considered as proper URL:
> >
> > http://www.blikopnieuws.nl/nieuwsblok
> >
> >
> >
> > Here are two URL's that are completely useless:
> >
> >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie
> >
> >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/
> >
> >
> >
> > It is very hard to use deduplication on these simply because the content
> is actually changes too much as time progresses - the latest news block for
> example. It is therefore a necessity to keep these URL's from ending up in
> the CrawlDB and so not to waste disk space and update time of the CrawlDB
> and and huge load of bandwidth - i'm in my current fetch probably going to
> waste at least a few GB's.
> >
> >
> >
> > Looking at the HTML source, it looks like the parser cannot properly
> handle relative URL's. It is, of course, quite ugly for a site to do this
> but the parser must not fool itself and come up with URL's that really
> aren't there. Combined with the issue i began the thread with i believe the
> following two problems are present - the parser returns imaginary (false)
> URL's because of:
> >
> > 1. relative href's;
> >
> > 2. URL's in anchors (that is the XML element's body) next to the rhef
> attribute.
> >
> >
> >
> > Please help in finding the source of the problem (Tika? Nutch?) and how
> to proceed in having it fixed so other users won't waste bandwidth, disk
> space and CPU cycles =)
> >
> >
> >
> >
> >
> >
> >
> > Oh, here's a snippet of the fetch job that's currently running, also,
> notice the news item with the 119039 ID, it's the same as above although
> that copy/paste was 15 minutes ago. Most item ID's you see below continue to
> return in the current log output.
> >
> >
> >
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag
> > fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle
> > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/
> >
> >
> > -----Original message-----
> > From: Markus Jelsma <ma...@buyways.nl>
> > Sent: Wed 22-09-2010 20:47
> > To: user@nutch.apache.org;
> > Subject: RE: Re: Funky duplicate url's
> >
> > Thanks! I've already implemented a similar (but not as generic) regex to
> catch these url's. But it is, of course, not a proper solution to solve a
> parsing problem with subsequent regex's. Please, correct me if i'm wrong,
> but i'm quite sure those url's are not to be found in the HTML sources. I'd
> better to be fixed where the problem seems to be.
> >
> >
> >
> > I'll test your regex but i'd still like to know where the exact problem
> lies and hopefully fix or help fixing it.
> >
> >
> >
> > Thanks
> >
> > -----Original message-----
> > From: AJ Chen <aj...@web2express.org>
> > Sent: Wed 22-09-2010 20:29
> > To: user@nutch.apache.org;
> > Subject: Re: Funky duplicate url's
> >
> > the conf/regex-urlfilter.txt file has an exclusion rule that should skip
> > these viral urls.
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >
> > -aj
> >
> > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <markus.jelsma@buyways.nl
> >wrote:
> >
> >> Well, using a regex to catch these troublemakers isn't going to be
> useful.
> >> Although i caught the first faulty url's, there can be many more and
> it's
> >> unpredictable; here's just a random pick from the list of errors:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> >>
> >>
> >>
> >>
> >>
> >> Here's another very disturbing url it's trying to fetch:
> >>
> >>
> >>
> >>
> >>
> >>
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> >>
> >>
> >>
> >>
> >>
> >> I'm seems these bad url's are somehow found by the parser and get
> fetched
> >> the next time, and the next time making the url grow longer and longer
> for
> >> each fetch and parse and updateDB cycle.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
> >>
> >>
> >>
> >>
> >>
> >> This doesn't look good at all. Anyone got a suggestion or some pointer?
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original message-----
> >> From: Markus Jelsma <ma...@buyways.nl>
> >> Sent: Wed 22-09-2010 12:12
> >> To: user@nutch.apache.org;
> >> Subject: Funky duplicate url's
> >>
> >> Hi,
> >>
> >>
> >>
> >> This is not about deduplication, but about preventing certain url's to
> end
> >> up in the CrawlDB. I'm crawling a news site for testing purposes, it has
> the
> >> usual categories etc. News item pages feature a gray text block that's
> got
> >> some url's as well. See
> >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> example.
> >>
> >>
> >>
> >> The problem is, the parser somehow manages to concatenate the href with
> the
> >> inner anchor text (which happens to be an url as you can see). So,
> >> subsequent fetches are completely messed up, i'm almost only fetching
> >> duplicates:
> >>
> >>
> >>
> >> fetching
> >>
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> >> fetching
> >>
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> >> fetching
> >>
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> >>
> >>
> >>
> >> This is not desired behavior, as you'd expect. The question is, where to
> >> fix and how to fix it? Is it a problem with the parser? Or is it fixable
> >> using some freaky url filter for this one?
> >>
> >>
> >>
> >>
> >>
> >> Cheers,
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > AJ Chen, PhD
> > Chair, Semantic Web SIG, sdforum.org
> > http://web2express.org
> > twitter @web2express
> > Palo Alto, CA, USA
>
>

RE: Re: Funky duplicate url's, getting much worse!

Posted by Markus Jelsma <ma...@buyways.nl>.

Hello Mathijs,

 

I inspected the code base and found that the problem is most likely in the parse-tika code where the text is being extracted and the OutlinkExtractor is called. The OutlinkExtractor uses a regular expression that can output a lot of garbage. I've added a test to the TestOutlinkExtractor where it's clear that at least one URL does not pass but it does not point me in the right direction for solving the relative path problem.

 

Unless someone knows, i'll try to find out how the OutlinkExtractor works with the current base URL because just a plain relative URL in the test will obviously fail.

 

Thanks for the pointer =)

 

Cheers,
 
-----Original message-----
From: Mathijs Homminga <ma...@knowlogy.nl>
Sent: Tue 28-09-2010 21:01
To: user@nutch.apache.org; 
Subject: Re: Funky duplicate url's, getting much worse!

Hi Marcus,

I remember Nutch had some troubles with honoring the page's BASE tag when resolving relative outlinks.
However, I don't see this BASE tag being used in the HTML pages you provide so that's might not be it.

Mathijs


On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:

> Anyone? Where is a proper solution for this issue? As expected, the regex won't catch all imaginable kinds of funky URL's that somehow ended up in the CrawlDB. Before the weekend, i added another news site to the tests i conduct and let it run continuously. Unfortunately, the generator now comes up with all kinds of completely useless URL's, although they do exist but that's just the web application ignoring most parts of the URL's.
> 
>  
> 
> This is the URL that should be considered as proper URL:
> 
> http://www.blikopnieuws.nl/nieuwsblok
> 
>  
> 
> Here are two URL's that are completely useless:
> 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie
> 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/
> 
>  
> 
> It is very hard to use deduplication on these simply because the content is actually changes too much as time progresses - the latest news block for example. It is therefore a necessity to keep these URL's from ending up in the CrawlDB and so not to waste disk space and update time of the CrawlDB and and huge load of bandwidth - i'm in my current fetch probably going to waste at least a few GB's.
> 
>  
> 
> Looking at the HTML source, it looks like the parser cannot properly handle relative URL's. It is, of course, quite ugly for a site to do this but the parser must not fool itself and come up with URL's that really aren't there. Combined with the issue i began the thread with i believe the following two problems are present - the parser returns imaginary (false) URL's because of:
> 
> 1. relative href's;
> 
> 2. URL's in anchors (that is the XML element's body) next to the rhef attribute.
> 
>  
> 
> Please help in finding the source of the problem (Tika? Nutch?) and how to proceed in having it fixed so other users won't waste bandwidth, disk space and CPU cycles =)
> 
>  
> 
>  
> 
>  
> 
> Oh, here's a snippet of the fetch job that's currently running, also, notice the news item with the 119039 ID, it's the same as above although that copy/paste was 15 minutes ago. Most item ID's you see below continue to return in the current log output.
> 
>  
> 
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle
> -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/
> 
>  
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Wed 22-09-2010 20:47
> To: user@nutch.apache.org; 
> Subject: RE: Re: Funky duplicate url's
> 
> Thanks! I've already implemented a similar (but not as generic) regex to catch these url's. But it is, of course, not a proper solution to solve a parsing problem with subsequent regex's. Please, correct me if i'm wrong, but i'm quite sure those url's are not to be found in the HTML sources. I'd better to be fixed where the problem seems to be.
> 
>  
> 
> I'll test your regex but i'd still like to know where the exact problem lies and hopefully fix or help fixing it.
> 
>  
> 
> Thanks
>  
> -----Original message-----
> From: AJ Chen <aj...@web2express.org>
> Sent: Wed 22-09-2010 20:29
> To: user@nutch.apache.org; 
> Subject: Re: Funky duplicate url's
> 
> the conf/regex-urlfilter.txt file has an exclusion rule that should skip
> these viral urls.
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> -aj
> 
> On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <ma...@buyways.nl>wrote:
> 
>> Well, using a regex to catch these troublemakers isn't going to be useful.
>> Although i caught the first faulty url's, there can be many more and it's
>> unpredictable; here's just a random pick from the list of errors:
>> 
>> 
>> 
>> 
>> 
>> 
>> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
>> 
>> 
>> 
>> 
>> 
>> Here's another very disturbing url it's trying to fetch:
>> 
>> 
>> 
>> 
>> 
>> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
>> 
>> 
>> 
>> 
>> 
>> I'm seems these bad url's are somehow found by the parser and get fetched
>> the next time, and the next time making the url grow longer and longer for
>> each fetch and parse and updateDB cycle.
>> 
>> 
>> 
>> 
>> 
>> 
>> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
>> 
>> 
>> 
>> 
>> 
>> This doesn't look good at all. Anyone got a suggestion or some pointer?
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original message-----
>> From: Markus Jelsma <ma...@buyways.nl>
>> Sent: Wed 22-09-2010 12:12
>> To: user@nutch.apache.org;
>> Subject: Funky duplicate url's
>> 
>> Hi,
>> 
>> 
>> 
>> This is not about deduplication, but about preventing certain url's to end
>> up in the CrawlDB. I'm crawling a news site for testing purposes, it has the
>> usual categories etc. News item pages feature a gray text block that's got
>> some url's as well. See
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
>> 
>> 
>> 
>> The problem is, the parser somehow manages to concatenate the href with the
>> inner anchor text (which happens to be an url as you can see). So,
>> subsequent fetches are completely messed up, i'm almost only fetching
>> duplicates:
>> 
>> 
>> 
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
>> 
>> 
>> 
>> This is not desired behavior, as you'd expect. The question is, where to
>> fix and how to fix it? Is it a problem with the parser? Or is it fixable
>> using some freaky url filter for this one?
>> 
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA

Re: Funky duplicate url's, getting much worse!

Posted by Mathijs Homminga <ma...@knowlogy.nl>.

Hi Marcus,

I remember Nutch had some troubles with honoring the page's BASE tag when resolving relative outlinks.
However, I don't see this BASE tag being used in the HTML pages you provide so that's might not be it.

Mathijs


On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:

> Anyone? Where is a proper solution for this issue? As expected, the regex won't catch all imaginable kinds of funky URL's that somehow ended up in the CrawlDB. Before the weekend, i added another news site to the tests i conduct and let it run continuously. Unfortunately, the generator now comes up with all kinds of completely useless URL's, although they do exist but that's just the web application ignoring most parts of the URL's.
> 
>  
> 
> This is the URL that should be considered as proper URL:
> 
> http://www.blikopnieuws.nl/nieuwsblok
> 
>  
> 
> Here are two URL's that are completely useless:
> 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie
> 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/
> 
>  
> 
> It is very hard to use deduplication on these simply because the content is actually changes too much as time progresses - the latest news block for example. It is therefore a necessity to keep these URL's from ending up in the CrawlDB and so not to waste disk space and update time of the CrawlDB and and huge load of bandwidth - i'm in my current fetch probably going to waste at least a few GB's.
> 
>  
> 
> Looking at the HTML source, it looks like the parser cannot properly handle relative URL's. It is, of course, quite ugly for a site to do this but the parser must not fool itself and come up with URL's that really aren't there. Combined with the issue i began the thread with i believe the following two problems are present - the parser returns imaginary (false) URL's because of:
> 
> 1. relative href's;
> 
> 2. URL's in anchors (that is the XML element's body) next to the rhef attribute.
> 
>  
> 
> Please help in finding the source of the problem (Tika? Nutch?) and how to proceed in having it fixed so other users won't waste bandwidth, disk space and CPU cycles =)
> 
>  
> 
>  
> 
>  
> 
> Oh, here's a snippet of the fetch job that's currently running, also, notice the news item with the 119039 ID, it's the same as above although that copy/paste was 15 minutes ago. Most item ID's you see below continue to return in the current log output.
> 
>  
> 
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle
> -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag
> fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel
> fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/
> 
>  
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Wed 22-09-2010 20:47
> To: user@nutch.apache.org; 
> Subject: RE: Re: Funky duplicate url's
> 
> Thanks! I've already implemented a similar (but not as generic) regex to catch these url's. But it is, of course, not a proper solution to solve a parsing problem with subsequent regex's. Please, correct me if i'm wrong, but i'm quite sure those url's are not to be found in the HTML sources. I'd better to be fixed where the problem seems to be.
> 
>  
> 
> I'll test your regex but i'd still like to know where the exact problem lies and hopefully fix or help fixing it.
> 
>  
> 
> Thanks
>  
> -----Original message-----
> From: AJ Chen <aj...@web2express.org>
> Sent: Wed 22-09-2010 20:29
> To: user@nutch.apache.org; 
> Subject: Re: Funky duplicate url's
> 
> the conf/regex-urlfilter.txt file has an exclusion rule that should skip
> these viral urls.
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> -aj
> 
> On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <ma...@buyways.nl>wrote:
> 
>> Well, using a regex to catch these troublemakers isn't going to be useful.
>> Although i caught the first faulty url's, there can be many more and it's
>> unpredictable; here's just a random pick from the list of errors:
>> 
>> 
>> 
>> 
>> 
>> 
>> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
>> 
>> 
>> 
>> 
>> 
>> Here's another very disturbing url it's trying to fetch:
>> 
>> 
>> 
>> 
>> 
>> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
>> 
>> 
>> 
>> 
>> 
>> I'm seems these bad url's are somehow found by the parser and get fetched
>> the next time, and the next time making the url grow longer and longer for
>> each fetch and parse and updateDB cycle.
>> 
>> 
>> 
>> 
>> 
>> 
>> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
>> 
>> 
>> 
>> 
>> 
>> This doesn't look good at all. Anyone got a suggestion or some pointer?
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original message-----
>> From: Markus Jelsma <ma...@buyways.nl>
>> Sent: Wed 22-09-2010 12:12
>> To: user@nutch.apache.org;
>> Subject: Funky duplicate url's
>> 
>> Hi,
>> 
>> 
>> 
>> This is not about deduplication, but about preventing certain url's to end
>> up in the CrawlDB. I'm crawling a news site for testing purposes, it has the
>> usual categories etc. News item pages feature a gray text block that's got
>> some url's as well. See
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
>> 
>> 
>> 
>> The problem is, the parser somehow manages to concatenate the href with the
>> inner anchor text (which happens to be an url as you can see). So,
>> subsequent fetches are completely messed up, i'm almost only fetching
>> duplicates:
>> 
>> 
>> 
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
>> 
>> 
>> 
>> This is not desired behavior, as you'd expect. The question is, where to
>> fix and how to fix it? Is it a problem with the parser? Or is it fixable
>> using some freaky url filter for this one?
>> 
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA

Re: Funky duplicate url's, getting much worse!

Posted by Markus Jelsma <ma...@buyways.nl>.

Anyone? Where is a proper solution for this issue? As expected, the regex won't catch all imaginable kinds of funky URL's that somehow ended up in the CrawlDB. Before the weekend, i added another news site to the tests i conduct and let it run continuously. Unfortunately, the generator now comes up with all kinds of completely useless URL's, although they do exist but that's just the web application ignoring most parts of the URL's.

 

This is the URL that should be considered as proper URL:

http://www.blikopnieuws.nl/nieuwsblok

 

Here are two URL's that are completely useless:

http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie

http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/

 

It is very hard to use deduplication on these simply because the content is actually changes too much as time progresses - the latest news block for example. It is therefore a necessity to keep these URL's from ending up in the CrawlDB and so not to waste disk space and update time of the CrawlDB and and huge load of bandwidth - i'm in my current fetch probably going to waste at least a few GB's.

 

Looking at the HTML source, it looks like the parser cannot properly handle relative URL's. It is, of course, quite ugly for a site to do this but the parser must not fool itself and come up with URL's that really aren't there. Combined with the issue i began the thread with i believe the following two problems are present - the parser returns imaginary (false) URL's because of:

1. relative href's;

2. URL's in anchors (that is the XML element's body) next to the rhef attribute.

 

Please help in finding the source of the problem (Tika? Nutch?) and how to proceed in having it fixed so other users won't waste bandwidth, disk space and CPU cycles =)

 

 

 

Oh, here's a snippet of the fetch job that's currently running, also, notice the news item with the 119039 ID, it's the same as above although that copy/paste was 15 minutes ago. Most item ID's you see below continue to return in the current log output.

 

fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren
fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer
fetching http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag
fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle
-activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
fetching http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/
fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap
fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel
fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland
fetching http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland
fetching http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle
fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy
fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy
fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag
fetching http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
-activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
fetching http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie
fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel
fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/

 
-----Original message-----
From: Markus Jelsma <ma...@buyways.nl>
Sent: Wed 22-09-2010 20:47
To: user@nutch.apache.org; 
Subject: RE: Re: Funky duplicate url's

Thanks! I've already implemented a similar (but not as generic) regex to catch these url's. But it is, of course, not a proper solution to solve a parsing problem with subsequent regex's. Please, correct me if i'm wrong, but i'm quite sure those url's are not to be found in the HTML sources. I'd better to be fixed where the problem seems to be.

 

I'll test your regex but i'd still like to know where the exact problem lies and hopefully fix or help fixing it.

 

Thanks
 
-----Original message-----
From: AJ Chen <aj...@web2express.org>
Sent: Wed 22-09-2010 20:29
To: user@nutch.apache.org; 
Subject: Re: Funky duplicate url's

the conf/regex-urlfilter.txt file has an exclusion rule that should skip
these viral urls.

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

-aj

On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <ma...@buyways.nl>wrote:

> Well, using a regex to catch these troublemakers isn't going to be useful.
> Although i caught the first faulty url's, there can be many more and it's
> unpredictable; here's just a random pick from the list of errors:
>
>
>
>
>
>
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
>
>
>
>
>
> Here's another very disturbing url it's trying to fetch:
>
>
>
>
>
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
>
>
>
>
>
> I'm seems these bad url's are somehow found by the parser and get fetched
> the next time, and the next time making the url grow longer and longer for
> each fetch and parse and updateDB cycle.
>
>
>
>
>
>
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
>
>
>
>
>
> This doesn't look good at all. Anyone got a suggestion or some pointer?
>
>
>
>
>
>
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Wed 22-09-2010 12:12
> To: user@nutch.apache.org;
> Subject: Funky duplicate url's
>
> Hi,
>
>
>
> This is not about deduplication, but about preventing certain url's to end
> up in the CrawlDB. I'm crawling a news site for testing purposes, it has the
> usual categories etc. News item pages feature a gray text block that's got
> some url's as well. See
> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
>
>
>
> The problem is, the parser somehow manages to concatenate the href with the
> inner anchor text (which happens to be an url as you can see). So,
> subsequent fetches are completely messed up, i'm almost only fetching
> duplicates:
>
>
>
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
>
>
>
> This is not desired behavior, as you'd expect. The question is, where to
> fix and how to fix it? Is it a problem with the parser? Or is it fixable
> using some freaky url filter for this one?
>
>
>
>
>
> Cheers,
>
>
>
>
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

RE: Re: Funky duplicate url's

Posted by Markus Jelsma <ma...@buyways.nl>.

Thanks! I've already implemented a similar (but not as generic) regex to catch these url's. But it is, of course, not a proper solution to solve a parsing problem with subsequent regex's. Please, correct me if i'm wrong, but i'm quite sure those url's are not to be found in the HTML sources. I'd better to be fixed where the problem seems to be.

 

I'll test your regex but i'd still like to know where the exact problem lies and hopefully fix or help fixing it.

 

Thanks
 
-----Original message-----
From: AJ Chen <aj...@web2express.org>
Sent: Wed 22-09-2010 20:29
To: user@nutch.apache.org; 
Subject: Re: Funky duplicate url's

the conf/regex-urlfilter.txt file has an exclusion rule that should skip
these viral urls.

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

-aj

On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <ma...@buyways.nl>wrote:

> Well, using a regex to catch these troublemakers isn't going to be useful.
> Although i caught the first faulty url's, there can be many more and it's
> unpredictable; here's just a random pick from the list of errors:
>
>
>
>
>
>
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
>
>
>
>
>
> Here's another very disturbing url it's trying to fetch:
>
>
>
>
>
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
>
>
>
>
>
> I'm seems these bad url's are somehow found by the parser and get fetched
> the next time, and the next time making the url grow longer and longer for
> each fetch and parse and updateDB cycle.
>
>
>
>
>
>
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
>
>
>
>
>
> This doesn't look good at all. Anyone got a suggestion or some pointer?
>
>
>
>
>
>
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Wed 22-09-2010 12:12
> To: user@nutch.apache.org;
> Subject: Funky duplicate url's
>
> Hi,
>
>
>
> This is not about deduplication, but about preventing certain url's to end
> up in the CrawlDB. I'm crawling a news site for testing purposes, it has the
> usual categories etc. News item pages feature a gray text block that's got
> some url's as well. See
> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
>
>
>
> The problem is, the parser somehow manages to concatenate the href with the
> inner anchor text (which happens to be an url as you can see). So,
> subsequent fetches are completely messed up, i'm almost only fetching
> duplicates:
>
>
>
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
>
>
>
> This is not desired behavior, as you'd expect. The question is, where to
> fix and how to fix it? Is it a problem with the parser? Or is it fixable
> using some freaky url filter for this one?
>
>
>
>
>
> Cheers,
>
>
>
>
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: Funky duplicate url's

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

That pattern works nicely with some repeating URL's but not all. I did manage 
to find a pattern that looks for repeating substrings and modified it to match 3 
out of 4 example URL's, the 4th URL got caught by your pattern so everything 
seems fine.

The problem is, i'm not too familliar with regex' and the differences between 
PCRE and Java variants.

In PHP i came up with:

/(?=((.+)(.?\2{8,})+))/'

Which detects substrings with a minimum length of 8 characters, it detects the 
following URL's:

http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
http://www.nrc.nl/dossiers/orkanen/slachtoffers_hulp/article1636844.ece/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/02krugman.html

The problem is, the pattern fails to match in Java! Is there anyone here with 
any insights in modifying the pattern to work in Java's regex lib?

Cheers,

On Wednesday 22 September 2010 20:29:01 AJ Chen wrote:
> the conf/regex-urlfilter.txt file has an exclusion rule that should skip
> these viral urls.
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> -aj
> 
> On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma 
<ma...@buyways.nl>wrote:
> > Well, using a regex to catch these troublemakers isn't going to be
> > useful. Although i caught the first faulty url's, there can be many more
> > and it's unpredictable; here's just a random pick from the list of
> > errors:
> > 
> > 
> > 
> > 
> > 
> > 
> > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is
> > /Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Cente
> > rs-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inves
> > t.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-C
> > enters-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > 
> > 
> > 
> > 
> > 
> > Here's another very disturbing url it's trying to fetch:
> > 
> > 
> > 
> > 
> > 
> > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/
> > 02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_l
> > icenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx
> > /http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther
> > egister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/20
> > 05/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpid
> > a_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovo
> > nyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> > heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> > /2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/el
> > pida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_
> > ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> > w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> > com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04
> > /elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licens
> > es_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> > /www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> > er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02
> > /04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lic
> > enses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> > ttp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg
> > ister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005
> > /02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_
> > licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovony
> > x/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.the
> > register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2
> > 005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpi
> > da_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ov
> > onyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> > m/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/e
> > lpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses
> > _ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/w
> > ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> > .com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0
> > 4/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licen
> > ses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> > p/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> > ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> > 2/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_li
> > censes_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> > http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there
> > gister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200
> > 5/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida
> > _licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovon
> > yx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.th
> > eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/
> > 2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elp
> > ida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_o
> > vonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> > .theregister.com/2005/02/04/elpida_licenses_ovonyx/
> > 
> > 
> > 
> > 
> > 
> > I'm seems these bad url's are somehow found by the parser and get fetched
> > the next time, and the next time making the url grow longer and longer
> > for each fetch and parse and updateDB cycle.
> > 
> > 
> > 
> > 
> > 
> > 
> > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_199
> > 9/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/ww
> > w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> > e/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> > ice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> > ffice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > /office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.c
> > om/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft
> > .com/office/www.microsoft.com/office/www.microsoft.com/office/www.microso
> > ft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.micro
> > soft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivir
> > us
> > 
> > 
> > 
> > 
> > 
> > This doesn't look good at all. Anyone got a suggestion or some pointer?
> > 
> > 
> > 
> > 
> > 
> > 
> > -----Original message-----
> > From: Markus Jelsma <ma...@buyways.nl>
> > Sent: Wed 22-09-2010 12:12
> > To: user@nutch.apache.org;
> > Subject: Funky duplicate url's
> > 
> > Hi,
> > 
> > 
> > 
> > This is not about deduplication, but about preventing certain url's to
> > end up in the CrawlDB. I'm crawling a news site for testing purposes, it
> > has the usual categories etc. News item pages feature a gray text block
> > that's got some url's as well. See
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
> > 
> > 
> > 
> > The problem is, the parser somehow manages to concatenate the href with
> > the inner anchor text (which happens to be an url as you can see). So,
> > subsequent fetches are completely messed up, i'm almost only fetching
> > duplicates:
> > 
> > 
> > 
> > fetching
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> > euws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www
> > .trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws
> > /economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro
> > uw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/webl
> > ogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl
> > /nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece fetching
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> > euws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/ww
> > w.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opini
> > e/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > ouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/e
> > conomie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trou
> > w.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > fetching
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/op
> > inie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/ww
> > w.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opini
> > e/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > ouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/w
> > eblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> > .nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > 
> > 
> > 
> > This is not desired behavior, as you'd expect. The question is, where to
> > fix and how to fix it? Is it a problem with the parser? Or is it fixable
> > using some freaky url filter for this one?
> > 
> > 
> > 
> > 
> > 
> > Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Re: Funky duplicate url's

Posted by AJ Chen <aj...@web2express.org>.

the conf/regex-urlfilter.txt file has an exclusion rule that should skip
these viral urls.

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

-aj

On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <ma...@buyways.nl>wrote:

> Well, using a regex to catch these troublemakers isn't going to be useful.
> Although i caught the first faulty url's, there can be many more and it's
> unpredictable; here's just a random pick from the list of errors:
>
>
>
>
>
>
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
>
>
>
>
>
> Here's another very disturbing url it's trying to fetch:
>
>
>
>
>
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
>
>
>
>
>
> I'm seems these bad url's are somehow found by the parser and get fetched
> the next time, and the next time making the url grow longer and longer for
> each fetch and parse and updateDB cycle.
>
>
>
>
>
>
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
>
>
>
>
>
> This doesn't look good at all. Anyone got a suggestion or some pointer?
>
>
>
>
>
>
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Wed 22-09-2010 12:12
> To: user@nutch.apache.org;
> Subject: Funky duplicate url's
>
> Hi,
>
>
>
> This is not about deduplication, but about preventing certain url's to end
> up in the CrawlDB. I'm crawling a news site for testing purposes, it has the
> usual categories etc. News item pages feature a gray text block that's got
> some url's as well. See
> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
>
>
>
> The problem is, the parser somehow manages to concatenate the href with the
> inner anchor text (which happens to be an url as you can see). So,
> subsequent fetches are completely messed up, i'm almost only fetching
> duplicates:
>
>
>
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
>
>
>
> This is not desired behavior, as you'd expect. The question is, where to
> fix and how to fix it? Is it a problem with the parser? Or is it fixable
> using some freaky url filter for this one?
>
>
>
>
>
> Cheers,
>
>
>
>
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

RE: Funky duplicate url's

Posted by Markus Jelsma <ma...@buyways.nl>.

Well, using a regex to catch these troublemakers isn't going to be useful. Although i caught the first faulty url's, there can be many more and it's unpredictable; here's just a random pick from the list of errors:

 

 

http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/

 

 

Here's another very disturbing url it's trying to fetch:

 


http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/

 

 

I'm seems these bad url's are somehow found by the parser and get fetched the next time, and the next time making the url grow longer and longer for each fetch and parse and updateDB cycle.

 

 

http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus

 

 

This doesn't look good at all. Anyone got a suggestion or some pointer? 

 

 

 
-----Original message-----
From: Markus Jelsma <ma...@buyways.nl>
Sent: Wed 22-09-2010 12:12
To: user@nutch.apache.org; 
Subject: Funky duplicate url's

Hi,

 

This is not about deduplication, but about preventing certain url's to end up in the CrawlDB. I'm crawling a news site for testing purposes, it has the usual categories etc. News item pages feature a gray text block that's got some url's as well. See http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.

 

The problem is, the parser somehow manages to concatenate the href with the inner anchor text (which happens to be an url as you can see). So, subsequent fetches are completely messed up, i'm almost only fetching duplicates:

 

fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece

 

This is not desired behavior, as you'd expect. The question is, where to fix and how to fix it? Is it a problem with the parser? Or is it fixable using some freaky url filter for this one?

 

 

Cheers,