You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ryan Suarez <ry...@sheridancollege.ca> on 2019/04/09 19:08:57 UTC
Tracing crawled sites
Greetings,
We are running nutch v1.5 with SOLR v7.3.1
I would like to determine how a specific site was crawled. What were
the parent links that the nutch crawler followed all the way back to
the root?
Could someone let me know what is the best way to accomplish this?
regards,
Ryan
Re: Tracing crawled sites
Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Ryan,
you may have a look at the plugin scoring-depth.
It tracks the depth (links away from one of the seeds)
of a crawled page and could be modified to write also
the parents (maybe only the first) into the CrawlDatum
metadata.
Best,
Sebastian
On 4/9/19 9:08 PM, Ryan Suarez wrote:
> Greetings,
>
> We are running nutch v1.5 with SOLR v7.3.1
>
> I would like to determine how a specific site was crawled. What were
> the parent links that the nutch crawler followed all the way back to
> the root?
>
> Could someone let me know what is the best way to accomplish this?
>
> regards,
> Ryan
>