You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by feeyung <fe...@hotmail.com> on 2012/12/19 13:45:58 UTC

No urls injected when use Nutch to crawler a HTTPs website

Hello Experts,

I was build a web crawler for internal trac website of our team, which is
using username/password to control access. I have read and practice the wiki
of authentication scope, the httpclient should probably be configured
correctly, since I use ParserChecker to validate the connection with our
website, it explicitly list all outlinks and urls. However, questions came
from actually doing crawler work. 

The seed address I use: my.host.com/trac/wiki 
The Regex Filter I use: +^https://([a-z0-9]*\.)*my.host.com/trac/wiki

After enable httpclient and httpclient.auth DEBUG level in log4j, I did not
grep any WARN or failures about httpclient or authentication. 

The error message is similar with others' post, but I did not find out a
clear solution for this situation:
2012-12-19 04:10:11,608 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2012-12-19 04:10:11,608 INFO  crawl.Crawl - Stopping at depth=1 - no more
URLs to fetch.


Any clues? 


Besides, I tries several other websites and found crawl yahoo also has this
problem but not www.acm.org. Our website and Yahoo have the same problem.
The suspicious part is Content metatdata generated by ParserChecker both
have path=/. Does this disturb the urls set generation?

Thank you!




--
View this message in context: http://lucene.472066.n3.nabble.com/No-urls-injected-when-use-Nutch-to-crawler-a-HTTPs-website-tp4028008.html
Sent from the Nutch - User mailing list archive at Nabble.com.