You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Reyes, Mark" <Ma...@bpiedu.com> on 2013/11/01 19:03:25 UTC

Exclude urls without 'www' from Nutch 1.7 crawl

I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.

Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both 'www' and '' urls such as:

www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1.html
mywebsite.com/page1.html

My understanding is that the url filtering (regex-urlfilter.txt) needs modification. Are there any regex/nutch experts that could suggest a solution?

Here is the code on paste bin,
http://pastebin.com/Cp6vUxPR

Also on stack overflow,
http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl

I’ve also been notified of suggestions to augment the domain-urlfilter.txt file.

If so, is it just pasting ‘www.mywebsite.com’ as the last line of the text file or is it something more invested?

Thank you,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.