You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Scott Lundgren <sl...@qsfllc.com> on 2015/04/07 16:04:22 UTC
URL Structure & Rounds/Crawl Depth
Is Nutch’s Rounds/Crawl Depth relative to the URLs in seed. txt?
For example if my seed.txt is http://www.bizjournals.com/triangle/ and I want to make sure that I’m crawling http://www.bizjournals.com/triangle/prnewswire/press_releases/.* and http://www.bizjournals.com/triangle/blog/techflash/.* does my rounds need to be set to 2 (i.e.: everything under /prnewswire/press_releases/ is crawled ) or 3 (/triangle/prnewswire/press_releases/)
Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<ma...@qsfllc.com>
QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226
Our Portfolio of Commercial Real Estate Solutions:
• <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
• Fairview Real Estate Solutions<http://www.fairviewres.com/>
• Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
• Tax Credit Asset Management<http://www.tcamre.com/>
• Radian Generation<http://www.radiangeneration.com/>
• EntityKeeper<http://www.entitykeeper.com/>™
• Crowd With Ease<http://www.crowdwithease.com>™
• FullCapitalStack<http://www.fullcapitalstack.com>™
• CrowdRabbit<http://www.crowdrabbit.com>™
Re: URL Structure & Rounds/Crawl Depth
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Scott,
cycles/rounds/depth is roughly equivalent to the number of hops/links to reach
a document starting from one of the seeds. It has nothing in common with the
depth in the server's file system hierarchy. If there is a link from
http://www.bizjournals.com/triangle/
to e.g.
http://www.bizjournals.com/triangle/blog/techflash/story.html
the latter document is crawled in the second round.
The easiest way to limit by directory depth are regex URL filters.
Sebastian
On 04/07/2015 04:04 PM, Scott Lundgren wrote:
> Is Nutch’s Rounds/Crawl Depth relative to the URLs in seed. txt?
>
> For example if my seed.txt is http://www.bizjournals.com/triangle/ and I want to make sure that I’m crawling http://www.bizjournals.com/triangle/prnewswire/press_releases/.* and http://www.bizjournals.com/triangle/blog/techflash/.* does my rounds need to be set to 2 (i.e.: everything under /prnewswire/press_releases/ is crawled ) or 3 (/triangle/prnewswire/press_releases/)
>
> Scott Lundgren
> Software Engineer
> (704) 973-7388
> slundgren@qsfllc.com<ma...@qsfllc.com>
>
> QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
> 11121 Carmel Commons Boulevard | Suite 250
> Charlotte, North Carolina 28226
>
> Our Portfolio of Commercial Real Estate Solutions:
> • <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
> • Fairview Real Estate Solutions<http://www.fairviewres.com/>
> • Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
> • Tax Credit Asset Management<http://www.tcamre.com/>
> • Radian Generation<http://www.radiangeneration.com/>
> • EntityKeeper<http://www.entitykeeper.com/>™
> • Crowd With Ease<http://www.crowdwithease.com>™
> • FullCapitalStack<http://www.fullcapitalstack.com>™
> • CrowdRabbit<http://www.crowdrabbit.com>™
>