You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by X3C TECH <te...@x3chaos.com> on 2012/08/03 13:43:00 UTC

Custom Meta Plugin

Hello,
I wanted to know at what point does Nutch stop keeping the HTML page? My
issue is I need to be able to extract certain info from a page, for example:
<username>
<description>
<photo>
<profile link>
there may be multiple profiles on each page, and my understanding is
currently Nutch has an issue with multiple page fields with the same name.
My thinking was based on https://issues.apache.org/jira/browse/NUTCH-978. I
was thinking of intercepting an HTML page and converting it to XML before
parsing. I'm assuming that this would fit between fetch and parse. Few
questions I have though:
1. Am I correct in thinking that Fetch is the last process that keeps a
full HTML page with all tags, etc intact?
2. Does Nutch parse XML (I did't see an explicit plugin for that)? And if
so, are there any issues known for multiple fields with the same name in
the XML tree? I see that Tika has one, but it seems to parse just like an
HTML page
3. Does the Plugin Tutorial still apply to Nutch 2.0 or is it only for
previous versions?

Thanks for your help

Iggy

Re: Custom Meta Plugin

Posted by X3C TECH <te...@x3chaos.com>.
Ferdy,
Great!! Thanks for your reply!!


On Fri, Aug 3, 2012 at 8:55 AM, Ferdy Galema <fe...@kalooga.com>wrote:

> Hi,
>
> About the fetch process, this is not necessarily the last place that holds
> the entire DOM representation of a page. (If this is what you mean with
> full page). Actually it is only done when parsing during fetch is set to
> true, otherwise it is not loaded at all. A separate (re)parser job is able
> to load the DOM too.
>
> Ferdy
>
> On Fri, Aug 3, 2012 at 1:43 PM, X3C TECH <te...@x3chaos.com> wrote:
>
> > Hello,
> > I wanted to know at what point does Nutch stop keeping the HTML page? My
> > issue is I need to be able to extract certain info from a page, for
> > example:
> > <username>
> > <description>
> > <photo>
> > <profile link>
> > there may be multiple profiles on each page, and my understanding is
> > currently Nutch has an issue with multiple page fields with the same
> name.
> > My thinking was based on https://issues.apache.org/jira/browse/NUTCH-978
> .
> > I
> > was thinking of intercepting an HTML page and converting it to XML before
> > parsing. I'm assuming that this would fit between fetch and parse. Few
> > questions I have though:
> > 1. Am I correct in thinking that Fetch is the last process that keeps a
> > full HTML page with all tags, etc intact?
> > 2. Does Nutch parse XML (I did't see an explicit plugin for that)? And if
> > so, are there any issues known for multiple fields with the same name in
> > the XML tree? I see that Tika has one, but it seems to parse just like an
> > HTML page
> > 3. Does the Plugin Tutorial still apply to Nutch 2.0 or is it only for
> > previous versions?
> >
> > Thanks for your help
> >
> > Iggy
> >
>

Re: Custom Meta Plugin

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

About the fetch process, this is not necessarily the last place that holds
the entire DOM representation of a page. (If this is what you mean with
full page). Actually it is only done when parsing during fetch is set to
true, otherwise it is not loaded at all. A separate (re)parser job is able
to load the DOM too.

Ferdy

On Fri, Aug 3, 2012 at 1:43 PM, X3C TECH <te...@x3chaos.com> wrote:

> Hello,
> I wanted to know at what point does Nutch stop keeping the HTML page? My
> issue is I need to be able to extract certain info from a page, for
> example:
> <username>
> <description>
> <photo>
> <profile link>
> there may be multiple profiles on each page, and my understanding is
> currently Nutch has an issue with multiple page fields with the same name.
> My thinking was based on https://issues.apache.org/jira/browse/NUTCH-978.
> I
> was thinking of intercepting an HTML page and converting it to XML before
> parsing. I'm assuming that this would fit between fetch and parse. Few
> questions I have though:
> 1. Am I correct in thinking that Fetch is the last process that keeps a
> full HTML page with all tags, etc intact?
> 2. Does Nutch parse XML (I did't see an explicit plugin for that)? And if
> so, are there any issues known for multiple fields with the same name in
> the XML tree? I see that Tika has one, but it seems to parse just like an
> HTML page
> 3. Does the Plugin Tutorial still apply to Nutch 2.0 or is it only for
> previous versions?
>
> Thanks for your help
>
> Iggy
>