You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Chaushu, Shani" <sh...@intel.com> on 2015/07/02 10:01:15 UTC

Parent URL

Hi,
I'm using Nutch 1.9 with Solr 4.10
There is any way so see in solr for each page the parent/root page they came from?

Thanks,
Shani

---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: Parent URL

Posted by Julien Nioche <li...@gmail.com>.
Hi Shani

Tracking the seed URL which led to a given page is easy : you can add a
custom metadata to the seeds being the seed URL itself e.g.
*http://www.guardian.co.uk <http://www.guardian.co.uk>
 seed=http://www.guardian.co.uk <http://www.guardian.co.uk>*
then specify 'seed' as a value for the config urlmeta.tags and configure it
for indexing. Look at the urlmeta plugin for that.

A slightly more elegant way of doing the same is to have a custom
scoringFilter






*final Text seedK = new Text("seed");// add seed metadata which is then
transfered to the outlinkspublic void injectedScore(Text url, CrawlDatum
datum)throws ScoringFilterException {datum.getMetaData().put(seedK, new
Text(url)); }*
which would create the custom metadata during the injection.

Tracking the immediate parent could also be done with a ScoringFilter,
probably in the method distributeScoreToOutlinks.

HTH

Julien


On 2 July 2015 at 09:01, Chaushu, Shani <sh...@intel.com> wrote:

> Hi,
> I'm using Nutch 1.9 with Solr 4.10
> There is any way so see in solr for each page the parent/root page they
> came from?
>
> Thanks,
> Shani
>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: [MASSMAIL]Parent URL

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.
If by parent/root page you're talking about the URL that points to the page (kind of a referrer): page A has a link to page B ( A -> B) you can index in Solr the inlinks of each page. This can be done with a simple indexing filter, perhaps [1] can be helpful, but it depends on what you're trying to do, if you want to track the seed URL that you put in your seed file and led to you the page you're indexing, then the Julien answer is better suited. 

Hope it helps,

[1] https://github.com/jorgelbg/links-extractor/

----- Original Message -----
From: "Shani Chaushu" <sh...@intel.com>
To: user@nutch.apache.org
Sent: Thursday, July 2, 2015 4:01:15 AM
Subject: [MASSMAIL]Parent URL

Hi,
I'm using Nutch 1.9 with Solr 4.10
There is any way so see in solr for each page the parent/root page they came from?

Thanks,
Shani

---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.