You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Scott Lundgren <sl...@qsfllc.com> on 2015/04/07 16:04:22 UTC

URL Structure & Rounds/Crawl Depth

Is Nutch’s  Rounds/Crawl Depth relative to the URLs in seed. txt?

For example if my seed.txt is http://www.bizjournals.com/triangle/ and I want to make sure that I’m crawling http://www.bizjournals.com/triangle/prnewswire/press_releases/.* and http://www.bizjournals.com/triangle/blog/techflash/.* does my rounds need to be set to 2 (i.e.: everything under /prnewswire/press_releases/ is crawled ) or 3 (/triangle/prnewswire/press_releases/)

Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<ma...@qsfllc.com>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

Re: URL Structure & Rounds/Crawl Depth

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Scott,

cycles/rounds/depth is roughly equivalent to the number of hops/links to reach
a document starting from one of the seeds. It has nothing in common with the
depth in the server's file system hierarchy. If there is a link from
 http://www.bizjournals.com/triangle/
to e.g.
 http://www.bizjournals.com/triangle/blog/techflash/story.html
the latter document is crawled in the second round.

The easiest way to limit by directory depth are regex URL filters.

Sebastian

On 04/07/2015 04:04 PM, Scott Lundgren wrote:
> Is Nutch’s  Rounds/Crawl Depth relative to the URLs in seed. txt?
> 
> For example if my seed.txt is http://www.bizjournals.com/triangle/ and I want to make sure that I’m crawling http://www.bizjournals.com/triangle/prnewswire/press_releases/.* and http://www.bizjournals.com/triangle/blog/techflash/.* does my rounds need to be set to 2 (i.e.: everything under /prnewswire/press_releases/ is crawled ) or 3 (/triangle/prnewswire/press_releases/)
> 
> Scott Lundgren
> Software Engineer
> (704) 973-7388
> slundgren@qsfllc.com<ma...@qsfllc.com>
> 
> QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
> 11121 Carmel Commons Boulevard | Suite 250
> Charlotte, North Carolina 28226
> 
> Our Portfolio of Commercial Real Estate Solutions:
> •        <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
> •        Fairview Real Estate Solutions<http://www.fairviewres.com/>
> •        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
> •        Tax Credit Asset Management<http://www.tcamre.com/>
> •        Radian Generation<http://www.radiangeneration.com/>
> •        EntityKeeper<http://www.entitykeeper.com/>™
> •        Crowd With Ease<http://www.crowdwithease.com>™
> •        FullCapitalStack<http://www.fullcapitalstack.com>™
> •        CrowdRabbit<http://www.crowdrabbit.com>™
>