You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ryan Suarez <ry...@sheridancollege.ca> on 2019/04/09 19:08:57 UTC

Tracing crawled sites

Greetings,

We are running nutch v1.5 with SOLR v7.3.1

I would like to determine how a specific site was crawled.  What were
the parent links that the nutch crawler followed all the way back to
the root?  

Could someone let me know what is the best way to accomplish this?

regards,
Ryan

Re: Tracing crawled sites

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.

Hi Ryan,

you may have a look at the plugin scoring-depth.
It tracks the depth (links away from one of the seeds)
of a crawled page and could be modified to write also
the parents (maybe only the first) into the CrawlDatum
metadata.

Best,
Sebastian

On 4/9/19 9:08 PM, Ryan Suarez wrote:
> Greetings,
> 
> We are running nutch v1.5 with SOLR v7.3.1
> 
> I would like to determine how a specific site was crawled.  What were
> the parent links that the nutch crawler followed all the way back to
> the root?  
> 
> Could someone let me know what is the best way to accomplish this?
> 
> regards,
> Ryan
>