You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Scott Lundgren <sl...@qsfllc.com> on 2015/03/26 01:03:14 UTC
url-regexfilter & directory based sites
If my seeds file contains only http://www.bizjournals.com/triangle/ and url-regexfilter.txt contains
# whitelist
+^https?://www.bizjournals.com/triangle/blog/techflash<http://www.bizjournals.com/triangle/blog/techflash>/.*
# blacklist
-^https?://www.bizjournals.com/.*<http://www.bizjournals.com/.*>
will nutch crawl http://www.bizjournals.com/triangle/blog/techflash/ ?
The problem I’m trying to solve is that I want nutch to crawl http://www.bizjournals.com/triangle/news/ and http://www.bizjournals.com/triangle/blog/techflash/ but ignore other URLs within the site such as http://www.bizjournals.com/boston/, http://www.bizjournals.com/, and http://www.bizjournals.com/triangle/blog/
Does the whitelist patterns overrule the patterns in the blacklist ? Do I need a more complex regex pattern that will allow the subdirectories that I’m interested in crawling while preventing the parent directories of those subdirectories ?
Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<ma...@qsfllc.com>
QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226
Our Portfolio of Commercial Real Estate Solutions:
• <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
• Fairview Real Estate Solutions<http://www.fairviewres.com/>
• Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
• Tax Credit Asset Management<http://www.tcamre.com/>
• Radian Generation<http://www.radiangeneration.com/>
• EntityKeeper<http://www.entitykeeper.com/>™
• Crowd With Ease<http://www.crowdwithease.com>™
• FullCapitalStack<http://www.fullcapitalstack.com>™
• CrowdRabbit<http://www.crowdrabbit.com>™