You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Zabini <an...@actimage.com> on 2014/04/25 15:56:08 UTC

No outlink after a redirect

Hi,

I have build a plugin to do some POST request using HttpClient 4.1 and I
have allowed to follow redirection.
Everything goes well but I don't have any outlink from the content.

Where should I look for to solve this problem?

Here is some stack trace
2014-04-25 14:32:28,342 INFO  crawl.LinkDb - LinkDb: starting at 2014-04-25
14:32:28
2014-04-25 14:32:28,345 INFO  crawl.LinkDb - LinkDb: linkdb:
redirectCrawlTest4/linkdb
2014-04-25 14:32:28,345 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2014-04-25 14:32:28,346 INFO  crawl.LinkDb - LinkDb: URL filter: true
2014-04-25 14:32:28,346 INFO  crawl.LinkDb - LinkDb: internal links will be
ignored.
2014-04-25 14:32:28,346 INFO  crawl.LinkDb - LinkDb: adding segment:
redirectCrawlTest4/segments/20140425143157
2014-04-25 14:32:30,889 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'linkdb', using default
2014-04-25 14:32:31,801 INFO  crawl.LinkDb - LinkDb: finished at 2014-04-25
14:32:31, elapsed: 00:00:03

Here is the dump from readdb
Recno:: 0
URL::
http://www.cadremploi.fr/emploi/fr.cadremploi.publi.page.recherche_offres.RechercheOffresCtrl/1571601239

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Apr 25 14:31:43 CEST 2014
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
 	method=post
	_ngt_=1398429111854

param=chk_fct:20500,mth:Rechercher,redirect:%2Femploi%2Frecherche_offres,provenance:2

ParseData::
Version: 5
Status: success(1,0)
Title: 
Outlinks: 0
Content Metadata: Expires=Sat, 26 Jul 1997 05:00:00 GMT _fst_=33
nutch.segment.name=20140425143157 Connection=close Server=Apache-Coyote/1.1
X-Cache=MISS from dumbledore-1 Cache-Control=no-store,no-cache
Pragma=no-cache nutch.content.digest=a74e846cc14dbd73369183d3def67e9c
Date=Fri, 25 Apr 2014 12:32:05 GMT Vary=Accept-Encoding
nutch.crawl.score=1.0 Via=1.0 www.cadremploi.fr Content-Type=text/html 
Parse Metadata: CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252 


this line worried me:
2014-04-25 14:32:30,889 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'linkdb', using default
But I don't know how to solve it

Thanks for the help,
Zabini



--
View this message in context: http://lucene.472066.n3.nabble.com/No-outlink-after-a-redirect-tp4133122.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: No outlink after a redirect

Posted by Zabini <an...@actimage.com>.
After further investigation,

It appears that the problem is not a problem.
It just that the website does not allow to follow the links with the meta
<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">

Best Regards,
Zabini



--
View this message in context: http://lucene.472066.n3.nabble.com/No-outlink-after-a-redirect-tp4133122p4133428.html
Sent from the Nutch - User mailing list archive at Nabble.com.