You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/08/01 21:43:02 UTC

Sitemap function in 2.x version?

Dear fellow Nutch users/developers,

I've been trying to use Nutch 2 sitemap function to crawl and index all 
pages on the sitemap indices. It seems that integration with 
CommonCrawler sitemap tools only exist in 2.x branch. But after I got it 
to work with Hbase 1.2.3, it didn't fetch, parse and index the sitemap 
indices and sitemaps at all.

I also looked into the code a bit and everything seems to make sense, 
except I couldn't further trace the data flow beyond Toolrunner.run() in 
the FetchReducer. I'm testing it on Linux with the "crawl" script in 
/bin, so I'm not sure if how I can debug this. Please let me know if 
there's any further information that I can provide you with to help 
troubleshoot this issue. Thanks in advance!

Best regards,

Michael