You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Harry Nutch <ha...@gmail.com> on 2010/06/19 10:29:03 UTC

Passing content from parent page to outlink page

Hi,

I have a scenario where some specific content I'd like to store with a
sub-page, is contained in the parent-page that had the outlink to this
sub-page.
Is there a way I could pass parsed-content from parent page to the outlinked
page, which I can later use while indexing outlinked page?

Thanks,
Harry

Re: Passing content from parent page to outlink page

Posted by Dennis Kubes <ku...@apache.org>.
I think something like you are describing would require  a new Tool.   
In concept it would be similar to the LinkRank tool where outlink score 
is transferred from parent to child.  You can use the WebGraph to get 
the outlinks from a page.  You can get the content or text from segments 
Content or ParseText respectively.  I can see having something like this:

   1. MR job one
         1. outlink and content as input
         2. No mapper.
         3. Reducer outputs outlink and content
   2. MR job two
         1. Job one output and X used as input
               1. Here X would be whatever input you want to change from
                  the parent content.  This could be Contents,
                  ParseData, CrawlDb, etc.  It would be keyed off of the
                  outlink url and be the child page.
               2. An important thing to consider is that there would be
                  multiple "parent" pages to a single child page. 
                  Anybody who has an outlink.
         2. Probably no mapper
         3. Reducer take job one output and your X and does something
            with it

I can see altering the child page Content in segments based on parent, 
storing something in the child ParseData from segments, or altering the 
CrawlDb.  The action itself is up to you.  The end result would then 
flow through the Nutch job stream and end up in the Indexer.

Dennis

On 06/19/2010 03:29 AM, Harry Nutch wrote:
> Hi,
>
> I have a scenario where some specific content I'd like to store with a
> sub-page, is contained in the parent-page that had the outlink to this
> sub-page.
> Is there a way I could pass parsed-content from parent page to the outlinked
> page, which I can later use while indexing outlinked page?
>
> Thanks,
> Harry
>
>    

Re: Passing content from parent page to outlink page

Posted by Harry Nutch <ha...@gmail.com>.
Thanks Julien.


On Mon, Jun 21, 2010 at 5:52 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Harry,
>
> you could implement a custom ScoringFilter and pass the info you are
> interested in to the subpages in the method distributeScoreToOutlinks.
>
> I haven't tested it but it should work
>
> HTH
>
> Julien
>
> On 21 June 2010 13:08, Harry Nutch <ha...@gmail.com> wrote:
>
> > I needed this feature and  would appreciate any pointers. I couldn't find
> > anything on the mailing-list or the documentation.
> >
> > Thanks
> > Harry
> >
> >
> > On Sat, Jun 19, 2010 at 1:59 PM, Harry Nutch <ha...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I have a scenario where some specific content I'd like to store with a
> > > sub-page, is contained in the parent-page that had the outlink to this
> > > sub-page.
> > > Is there a way I could pass parsed-content from parent page to the
> > > outlinked page, which I can later use while indexing outlinked page?
> > >
> > > Thanks,
> > > Harry
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>

Re: Passing content from parent page to outlink page

Posted by Julien Nioche <li...@gmail.com>.
Harry,

you could implement a custom ScoringFilter and pass the info you are
interested in to the subpages in the method distributeScoreToOutlinks.

I haven't tested it but it should work

HTH

Julien

On 21 June 2010 13:08, Harry Nutch <ha...@gmail.com> wrote:

> I needed this feature and  would appreciate any pointers. I couldn't find
> anything on the mailing-list or the documentation.
>
> Thanks
> Harry
>
>
> On Sat, Jun 19, 2010 at 1:59 PM, Harry Nutch <ha...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a scenario where some specific content I'd like to store with a
> > sub-page, is contained in the parent-page that had the outlink to this
> > sub-page.
> > Is there a way I could pass parsed-content from parent page to the
> > outlinked page, which I can later use while indexing outlinked page?
> >
> > Thanks,
> > Harry
> >
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Passing content from parent page to outlink page

Posted by Harry Nutch <ha...@gmail.com>.
I needed this feature and  would appreciate any pointers. I couldn't find
anything on the mailing-list or the documentation.

Thanks
Harry


On Sat, Jun 19, 2010 at 1:59 PM, Harry Nutch <ha...@gmail.com> wrote:

> Hi,
>
> I have a scenario where some specific content I'd like to store with a
> sub-page, is contained in the parent-page that had the outlink to this
> sub-page.
> Is there a way I could pass parsed-content from parent page to the
> outlinked page, which I can later use while indexing outlinked page?
>
> Thanks,
> Harry
>