You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Harry Nutch <ha...@gmail.com> on 2010/06/19 10:29:03 UTC
Passing content from parent page to outlink page
Hi,
I have a scenario where some specific content I'd like to store with a
sub-page, is contained in the parent-page that had the outlink to this
sub-page.
Is there a way I could pass parsed-content from parent page to the outlinked
page, which I can later use while indexing outlinked page?
Thanks,
Harry
Re: Passing content from parent page to outlink page
Posted by Dennis Kubes <ku...@apache.org>.
I think something like you are describing would require a new Tool.
In concept it would be similar to the LinkRank tool where outlink score
is transferred from parent to child. You can use the WebGraph to get
the outlinks from a page. You can get the content or text from segments
Content or ParseText respectively. I can see having something like this:
1. MR job one
1. outlink and content as input
2. No mapper.
3. Reducer outputs outlink and content
2. MR job two
1. Job one output and X used as input
1. Here X would be whatever input you want to change from
the parent content. This could be Contents,
ParseData, CrawlDb, etc. It would be keyed off of the
outlink url and be the child page.
2. An important thing to consider is that there would be
multiple "parent" pages to a single child page.
Anybody who has an outlink.
2. Probably no mapper
3. Reducer take job one output and your X and does something
with it
I can see altering the child page Content in segments based on parent,
storing something in the child ParseData from segments, or altering the
CrawlDb. The action itself is up to you. The end result would then
flow through the Nutch job stream and end up in the Indexer.
Dennis
On 06/19/2010 03:29 AM, Harry Nutch wrote:
> Hi,
>
> I have a scenario where some specific content I'd like to store with a
> sub-page, is contained in the parent-page that had the outlink to this
> sub-page.
> Is there a way I could pass parsed-content from parent page to the outlinked
> page, which I can later use while indexing outlinked page?
>
> Thanks,
> Harry
>
>
Re: Passing content from parent page to outlink page
Posted by Harry Nutch <ha...@gmail.com>.
Thanks Julien.
On Mon, Jun 21, 2010 at 5:52 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:
> Harry,
>
> you could implement a custom ScoringFilter and pass the info you are
> interested in to the subpages in the method distributeScoreToOutlinks.
>
> I haven't tested it but it should work
>
> HTH
>
> Julien
>
> On 21 June 2010 13:08, Harry Nutch <ha...@gmail.com> wrote:
>
> > I needed this feature and would appreciate any pointers. I couldn't find
> > anything on the mailing-list or the documentation.
> >
> > Thanks
> > Harry
> >
> >
> > On Sat, Jun 19, 2010 at 1:59 PM, Harry Nutch <ha...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I have a scenario where some specific content I'd like to store with a
> > > sub-page, is contained in the parent-page that had the outlink to this
> > > sub-page.
> > > Is there a way I could pass parsed-content from parent page to the
> > > outlinked page, which I can later use while indexing outlinked page?
> > >
> > > Thanks,
> > > Harry
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
Re: Passing content from parent page to outlink page
Posted by Julien Nioche <li...@gmail.com>.
Harry,
you could implement a custom ScoringFilter and pass the info you are
interested in to the subpages in the method distributeScoreToOutlinks.
I haven't tested it but it should work
HTH
Julien
On 21 June 2010 13:08, Harry Nutch <ha...@gmail.com> wrote:
> I needed this feature and would appreciate any pointers. I couldn't find
> anything on the mailing-list or the documentation.
>
> Thanks
> Harry
>
>
> On Sat, Jun 19, 2010 at 1:59 PM, Harry Nutch <ha...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a scenario where some specific content I'd like to store with a
> > sub-page, is contained in the parent-page that had the outlink to this
> > sub-page.
> > Is there a way I could pass parsed-content from parent page to the
> > outlinked page, which I can later use while indexing outlinked page?
> >
> > Thanks,
> > Harry
> >
>
--
DigitalPebble Ltd
Open Source Solutions for Text Engineering
http://www.digitalpebble.com
Re: Passing content from parent page to outlink page
Posted by Harry Nutch <ha...@gmail.com>.
I needed this feature and would appreciate any pointers. I couldn't find
anything on the mailing-list or the documentation.
Thanks
Harry
On Sat, Jun 19, 2010 at 1:59 PM, Harry Nutch <ha...@gmail.com> wrote:
> Hi,
>
> I have a scenario where some specific content I'd like to store with a
> sub-page, is contained in the parent-page that had the outlink to this
> sub-page.
> Is there a way I could pass parsed-content from parent page to the
> outlinked page, which I can later use while indexing outlinked page?
>
> Thanks,
> Harry
>