You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by a a <mb...@msn.com> on 2011/02/01 15:25:20 UTC
RE: parse-html plugin
hi,
is my question so difficult ?
no one have an idea ?
thx
mehdi
> From: mbellil@msn.com
> To: user@nutch.apache.org
> Subject: RE: parse-html plugin
> Date: Mon, 31 Jan 2011 16:05:22 +0000
>
>
> Hi All,
>
> any idea ?
>
>
>
> mehdi
>
>
>
>
> > From: mbellil@msn.com
> > To: user@nutch.apache.org
> > Subject: parse-html plugin
> > Date: Thu, 27 Jan 2011 18:58:36 +0000
> >
> >
> > hi,
> > In the class HtmlParser I changed the 'text' variable to index only a part of my html page, and since i did lost lot off outlinks !
> >
> > ...
> > utils.getText(sb,extractIndexableContent(root)); //added on 26-01-2011 to extract only text inside <col_centre>
> > // utils.getText(sb, root); // extract text --- disabled on 26-01-2011-
> >
> > text = sb.toString();
> > ...
> >
> > i beleived that outlinks are not obtained from the text variable ?! in the same class we could see how outlinks are extracted !
> >
> >
> > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
> > URL baseTag = utils.getBase(root);
> > if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > outlinks = l.toArray(new Outlink[l.size()]);
> >
> >
> >
> > can you plz tell me what i did wrong.
> >
> >
> > mehdi
> >
> >
> >
>
Re: parse-html plugin
Posted by Markus Jelsma <ma...@openindex.io>.
Yes, understanding the parser's internals is not very easy. Try adding log
lines so you can understand it better. You can use the ParserChecker to test.
On Tuesday 01 February 2011 15:25:20 a a wrote:
> hi,
>
> is my question so difficult ?
> no one have an idea ?
>
> thx
>
>
> mehdi
>
> > From: mbellil@msn.com
> > To: user@nutch.apache.org
> > Subject: RE: parse-html plugin
> > Date: Mon, 31 Jan 2011 16:05:22 +0000
> >
> >
> > Hi All,
> >
> > any idea ?
> >
> >
> >
> > mehdi
> >
> > > From: mbellil@msn.com
> > > To: user@nutch.apache.org
> > > Subject: parse-html plugin
> > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > >
> > >
> > > hi,
> > > In the class HtmlParser I changed the 'text' variable to index only a
> > > part of my html page, and since i did lost lot off outlinks !
> > >
> > > ...
> > >
> > > utils.getText(sb,extractIndexableContent(root)); //added on
> > > 26-01-2011 to extract only text inside <col_centre>
> > >
> > > // utils.getText(sb, root); // extract text --- disabled
> > > on 26-01-2011-
> > >
> > > text = sb.toString();
> > >
> > > ...
> > >
> > > i beleived that outlinks are not obtained from the text variable ?! in
> > > the same class we could see how outlinks are extracted !
> > >
> > >
> > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
> > >
> > > URL baseTag = utils.getBase(root);
> > > if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > outlinks = l.toArray(new Outlink[l.size()]);
> > >
> > > can you plz tell me what i did wrong.
> > >
> > >
> > > mehdi
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: parse-html plugin
Posted by Markus Jelsma <ma...@openindex.io>.
dumpText will just output the parsed data or whatever HTML elements you
selected to be the parsed data, but i haven't tested this myself. The same
goes for configuration changes. The docblock tells us it'll just run it but you
might just want to check the parser settings in the configuration.
> Hi,
>
> Just wondering what does the dumpText mean in the ParseChecker?
>
> On the same grounds, incase I am writing a custom filter that extends the
> HtmlParseFilter..do I have to make any configuration changes for nutch?
>
> Thanks,
> Abi
>
> On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
<ma...@openindex.io>wrote:
> > I'm not really sure but i believe you must overwrite the already parsed
> > data
> > yourself in your filter.
> >
> > On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > Thx for your reply :)
> > >
> > > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going
> > > to overwrite to ParseResult varaible of the original plugin
> > > parser-html ?
> > >
> > > is it not going to spend more time doing twice the operation of
> >
> > extracting
> >
> > > the html source code of each url to parse it (first time the original
> > > parse-html plugin and the seconde time my new plugin ) ??
> > >
> > > thx a lot
> > >
> > > mehdi
> > >
> > > > From: markus.jelsma@openindex.io
> > > > To: user@nutch.apache.org
> > > > Subject: Re: parse-html plugin
> > > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > > CC: mbellil@msn.com
> > > >
> > > > Oh, i forgot. You could extend
> > > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > > > whatever you need and store it in the ParseResult object.
> > > >
> > > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > > hi,
> > > > >
> > > > > is my question so difficult ?
> > > > > no one have an idea ?
> > > > >
> > > > > thx
> > > > >
> > > > >
> > > > > mehdi
> > > > >
> > > > > > From: mbellil@msn.com
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: RE: parse-html plugin
> > > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > > >
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > any idea ?
> > > > > >
> > > > > >
> > > > > >
> > > > > > mehdi
> > > > > >
> > > > > > > From: mbellil@msn.com
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: parse-html plugin
> > > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > > > >
> > > > > > >
> > > > > > > hi,
> > > > > > > In the class HtmlParser I changed the 'text' variable to index
> >
> > only
> >
> > > > > > > a part of my html page, and since i did lost lot off outlinks !
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > utils.getText(sb,extractIndexableContent(root)); //added on
> > > > > > > 26-01-2011 to extract only text inside <col_centre>
> > > > > > >
> > > > > > > // utils.getText(sb, root); // extract text ---
> > > > > > > disabled on 26-01-2011-
> > > > > > >
> > > > > > > text = sb.toString();
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > i beleived that outlinks are not obtained from the text
> > > > > > > variable
> >
> > ?!
> >
> > > > > > > in the same class we could see how outlinks are extracted !
> > > > > > >
> > > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract
> > > > > > > outlinks
> > > > > > >
> > > > > > > URL baseTag = utils.getBase(root);
> > > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > > > > > links...");
> >
> > }
> >
> > > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > > > > > outlinks = l.toArray(new Outlink[l.size()]);
> > > > > > >
> > > > > > > can you plz tell me what i did wrong.
> > > > > > >
> > > > > > >
> > > > > > > mehdi
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
Re: parse-html plugin
Posted by Markus Jelsma <ma...@openindex.io>.
If i'm not mistaken its not a plugin but an extension point. Maybe it doesn't
need configuration but only inclusion on the class path?
> i'm realy confused :)
>
> so in my nutch-site.xml i have to call my new plugin after parse-html one,
> like this
>
> parse-(text|html|msword|pdf|MY_NEW_HtmlParsefilter _PLUGIN)
>
> how about parse-text? it has also a parseresult object as the parse-html ?
> which one is used ?
>
> thx
>
>
> mehdi
>
> > Date: Wed, 2 Feb 2011 13:31:40 +0800
> > Subject: Re: parse-html plugin
> > From: ab1sh3k@gmail.com
> > To: user@nutch.apache.org
> >
> > Hi,
> >
> > I am not sure if my guess would be right hopefully some one will have to
> >
> > correct me if I am a wrong, I am just a beginner.
> >
> > I believe you would be implementing your own HtmlParseFilter as a
> > plug-in
> >
> > in which case the order in which the plug-in is executed has a call on
> > impact. I see some implementation on ordered filters in the
> > HtmlParseFilters class. If my assumption on this is correct, you may
> > want to order it as per your requirements.
> >
> > However, I am not really sure what determines the order or whether it
> > will
> >
> > take double(more) time for phase by phase filtering. Even I am looking
> > out for an answer to this :)
> >
> > Thanks,
> > Abi
> >
> > On Wed, Feb 2, 2011 at 11:28 AM, a a <mb...@msn.com> wrote:
> > > i want to know if some one did this job before , mabe he could tell us
> > > if it will take more time (double time) when using another
> > > HtmlParsefilter to overwrite the original ParseResult object
> > > produced by the parse-html plugin.
> > >
> > > thx
> > >
> > >
> > > mehdi
> > >
> > > > From: markus.jelsma@openindex.io
> > > > To: user@nutch.apache.org
> > > > Subject: Re: parse-html plugin
> > > > Date: Wed, 2 Feb 2011 02:46:47 +0100
> > > > CC: ab1sh3k@gmail.com
> > > >
> > > > Oh well, please come back with your experience and results on this
> > > > issue
> > >
> > > in
> > >
> > > > this thread. More users will benefit =)
> > > >
> > > > > I am sorry, forgive my ignorance. I got the answer for it :) Thanks
> > > > > for your time
> > > > >
> > > > > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com>
> > >
> > > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Just wondering what does the dumpText mean in the ParseChecker?
> > > > > >
> > > > > > On the same grounds, incase I am writing a custom filter that
> > >
> > > extends
> > >
> > > > > > the
> > > > > >
> > > > > > HtmlParseFilter..do I have to make any configuration changes for
> > >
> > > nutch?
> > >
> > > > > > Thanks,
> > > > > > Abi
> > > > > >
> > > > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
> > > >
> > > > <ma...@openindex.io>wrote:
> > > > > >> I'm not really sure but i believe you must overwrite the already
> > >
> > > parsed
> > >
> > > > > >> data
> > > > > >> yourself in your filter.
> > > > > >>
> > > > > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > > > >> > Thx for your reply :)
> > > > > >> >
> > > > > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is
> > > > > >> > it
> > >
> > > going
> > >
> > > > > >> > to overwrite to ParseResult varaible of the original plugin
> > > > > >> > parser-html ?
> > > > > >> >
> > > > > >> > is it not going to spend more time doing twice the operation
> > > > > >> > of
> > > > > >>
> > > > > >> extracting
> > > > > >>
> > > > > >> > the html source code of each url to parse it (first time the
> > >
> > > original
> > >
> > > > > >> > parse-html plugin and the seconde time my new plugin ) ??
> > > > > >> >
> > > > > >> > thx a lot
> > > > > >> >
> > > > > >> > mehdi
> > > > > >> >
> > > > > >> > > From: markus.jelsma@openindex.io
> > > > > >> > > To: user@nutch.apache.org
> > > > > >> > > Subject: Re: parse-html plugin
> > > > > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > > > >> > > CC: mbellil@msn.com
> > > > > >> > >
> > > > > >> > > Oh, i forgot. You could extend
> > > > > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can
> > > > > >> > > retrieve whatever you need and store it in the
> > > > > >>
> > > > > >> ParseResult
> > > > > >>
> > > > > >> > > object.
> > > > > >> > >
> > > > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > > >> > > > hi,
> > > > > >> > > >
> > > > > >> > > > is my question so difficult ?
> > > > > >> > > > no one have an idea ?
> > > > > >> > > >
> > > > > >> > > > thx
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > mehdi
> > > > > >> > > >
> > > > > >> > > > > From: mbellil@msn.com
> > > > > >> > > > > To: user@nutch.apache.org
> > > > > >> > > > > Subject: RE: parse-html plugin
> > > > > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > Hi All,
> > > > > >> > > > >
> > > > > >> > > > > any idea ?
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > mehdi
> > > > > >> > > > >
> > > > > >> > > > > > From: mbellil@msn.com
> > > > > >> > > > > > To: user@nutch.apache.org
> > > > > >> > > > > > Subject: parse-html plugin
> > > > > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > hi,
> > > > > >> > > > > > In the class HtmlParser I changed the 'text' variable
> > > > > >> > > > > > to
> > >
> > > index
> > >
> > > > > >> only
> > > > > >>
> > > > > >> > > > > > a part of my html page, and since i did lost lot off
> > >
> > > outlinks
> > >
> > > > > >> > > > > > !
> > > > > >> > > > > >
> > > > > >> > > > > > ...
> > > > > >> > > > > >
> > > > > >> > > > > > utils.getText(sb,extractIndexableContent(root));
> > > > > >> > > > > > //added
> > >
> > > on
> > >
> > > > > >> > > > > > 26-01-2011 to extract only text inside <col_centre>
> > > > > >> > > > > >
> > > > > >> > > > > > // utils.getText(sb, root); // extract text
> > >
> > > ---
> > >
> > > > > >> > > > > > disabled on 26-01-2011-
> > > > > >> > > > > >
> > > > > >> > > > > > text = sb.toString();
> > > > > >> > > > > >
> > > > > >> > > > > > ...
> > > > > >> > > > > >
> > > > > >> > > > > > i beleived that outlinks are not obtained from the
> > > > > >> > > > > > text variable
> > > > > >>
> > > > > >> ?!
> > > > > >>
> > > > > >> > > > > > in the same class we could see how outlinks are
> > > > > >> > > > > > extracted
> > >
> > > !
> > >
> > > > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); //
> > >
> > > extract
> > >
> > > > > >> > > > > > outlinks
> > > > > >> > > > > >
> > > > > >> > > > > > URL baseTag = utils.getBase(root);
> > > > > >> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > > > >> > > > > > links...");
> > > > > >>
> > > > > >> }
> > > > > >>
> > > > > >> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l,
> > >
> > > root);
> > >
> > > > > >> > > > > > outlinks = l.toArray(new Outlink[l.size()]);
> > > > > >> > > > > >
> > > > > >> > > > > > can you plz tell me what i did wrong.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > mehdi
> > > > > >>
> > > > > >> --
> > > > > >> Markus Jelsma - CTO - Openindex
> > > > > >> http://www.linkedin.com/in/markus17
> > > > > >> 050-8536620 / 06-50258350
RE: parse-html plugin
Posted by a a <mb...@msn.com>.
i'm realy confused :)
so in my nutch-site.xml i have to call my new plugin after parse-html one, like this
parse-(text|html|msword|pdf|MY_NEW_HtmlParsefilter _PLUGIN)
how about parse-text? it has also a parseresult object as the parse-html ? which one is used ?
thx
mehdi
> Date: Wed, 2 Feb 2011 13:31:40 +0800
> Subject: Re: parse-html plugin
> From: ab1sh3k@gmail.com
> To: user@nutch.apache.org
>
> Hi,
>
> I am not sure if my guess would be right hopefully some one will have to
> correct me if I am a wrong, I am just a beginner.
>
> I believe you would be implementing your own HtmlParseFilter as a plug-in
> in which case the order in which the plug-in is executed has a call on
> impact. I see some implementation on ordered filters in the HtmlParseFilters
> class. If my assumption on this is correct, you may want to order it as per
> your requirements.
>
> However, I am not really sure what determines the order or whether it will
> take double(more) time for phase by phase filtering. Even I am looking out
> for an answer to this :)
>
> Thanks,
> Abi
>
>
> On Wed, Feb 2, 2011 at 11:28 AM, a a <mb...@msn.com> wrote:
>
> >
> > i want to know if some one did this job before , mabe he could tell us if
> > it will take more time (double time) when using another HtmlParsefilter to
> > overwrite the original ParseResult object produced by the parse-html
> > plugin.
> >
> > thx
> >
> >
> > mehdi
> >
> >
> >
> >
> > > From: markus.jelsma@openindex.io
> > > To: user@nutch.apache.org
> > > Subject: Re: parse-html plugin
> > > Date: Wed, 2 Feb 2011 02:46:47 +0100
> > > CC: ab1sh3k@gmail.com
> > >
> > > Oh well, please come back with your experience and results on this issue
> > in
> > > this thread. More users will benefit =)
> > >
> > > > I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> > > > your time
> > > >
> > > > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com>
> > wrote:
> > > > > Hi,
> > > > >
> > > > > Just wondering what does the dumpText mean in the ParseChecker?
> > > > >
> > > > > On the same grounds, incase I am writing a custom filter that
> > extends
> > > > > the
> > > > >
> > > > > HtmlParseFilter..do I have to make any configuration changes for
> > nutch?
> > > > >
> > > > > Thanks,
> > > > > Abi
> > > > >
> > > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
> > > <ma...@openindex.io>wrote:
> > > > >> I'm not really sure but i believe you must overwrite the already
> > parsed
> > > > >> data
> > > > >> yourself in your filter.
> > > > >>
> > > > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > > >> > Thx for your reply :)
> > > > >> >
> > > > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it
> > going
> > > > >> > to overwrite to ParseResult varaible of the original plugin
> > > > >> > parser-html ?
> > > > >> >
> > > > >> > is it not going to spend more time doing twice the operation of
> > > > >>
> > > > >> extracting
> > > > >>
> > > > >> > the html source code of each url to parse it (first time the
> > original
> > > > >> > parse-html plugin and the seconde time my new plugin ) ??
> > > > >> >
> > > > >> > thx a lot
> > > > >> >
> > > > >> > mehdi
> > > > >> >
> > > > >> > > From: markus.jelsma@openindex.io
> > > > >> > > To: user@nutch.apache.org
> > > > >> > > Subject: Re: parse-html plugin
> > > > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > > >> > > CC: mbellil@msn.com
> > > > >> > >
> > > > >> > > Oh, i forgot. You could extend
> > > > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > > > >> > > whatever you need and store it in the
> > > > >>
> > > > >> ParseResult
> > > > >>
> > > > >> > > object.
> > > > >> > >
> > > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > >> > > > hi,
> > > > >> > > >
> > > > >> > > > is my question so difficult ?
> > > > >> > > > no one have an idea ?
> > > > >> > > >
> > > > >> > > > thx
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > mehdi
> > > > >> > > >
> > > > >> > > > > From: mbellil@msn.com
> > > > >> > > > > To: user@nutch.apache.org
> > > > >> > > > > Subject: RE: parse-html plugin
> > > > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > Hi All,
> > > > >> > > > >
> > > > >> > > > > any idea ?
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > mehdi
> > > > >> > > > >
> > > > >> > > > > > From: mbellil@msn.com
> > > > >> > > > > > To: user@nutch.apache.org
> > > > >> > > > > > Subject: parse-html plugin
> > > > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > hi,
> > > > >> > > > > > In the class HtmlParser I changed the 'text' variable to
> > index
> > > > >>
> > > > >> only
> > > > >>
> > > > >> > > > > > a part of my html page, and since i did lost lot off
> > outlinks
> > > > >> > > > > > !
> > > > >> > > > > >
> > > > >> > > > > > ...
> > > > >> > > > > >
> > > > >> > > > > > utils.getText(sb,extractIndexableContent(root)); //added
> > on
> > > > >> > > > > > 26-01-2011 to extract only text inside <col_centre>
> > > > >> > > > > >
> > > > >> > > > > > // utils.getText(sb, root); // extract text
> > ---
> > > > >> > > > > > disabled on 26-01-2011-
> > > > >> > > > > >
> > > > >> > > > > > text = sb.toString();
> > > > >> > > > > >
> > > > >> > > > > > ...
> > > > >> > > > > >
> > > > >> > > > > > i beleived that outlinks are not obtained from the text
> > > > >> > > > > > variable
> > > > >>
> > > > >> ?!
> > > > >>
> > > > >> > > > > > in the same class we could see how outlinks are extracted
> > !
> > > > >> > > > > >
> > > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); //
> > extract
> > > > >> > > > > > outlinks
> > > > >> > > > > >
> > > > >> > > > > > URL baseTag = utils.getBase(root);
> > > > >> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > > >> > > > > > links...");
> > > > >>
> > > > >> }
> > > > >>
> > > > >> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l,
> > root);
> > > > >> > > > > > outlinks = l.toArray(new Outlink[l.size()]);
> > > > >> > > > > >
> > > > >> > > > > > can you plz tell me what i did wrong.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > mehdi
> > > > >>
> > > > >> --
> > > > >> Markus Jelsma - CTO - Openindex
> > > > >> http://www.linkedin.com/in/markus17
> > > > >> 050-8536620 / 06-50258350
> >
> >
Re: parse-html plugin
Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi,
I am not sure if my guess would be right hopefully some one will have to
correct me if I am a wrong, I am just a beginner.
I believe you would be implementing your own HtmlParseFilter as a plug-in
in which case the order in which the plug-in is executed has a call on
impact. I see some implementation on ordered filters in the HtmlParseFilters
class. If my assumption on this is correct, you may want to order it as per
your requirements.
However, I am not really sure what determines the order or whether it will
take double(more) time for phase by phase filtering. Even I am looking out
for an answer to this :)
Thanks,
Abi
On Wed, Feb 2, 2011 at 11:28 AM, a a <mb...@msn.com> wrote:
>
> i want to know if some one did this job before , mabe he could tell us if
> it will take more time (double time) when using another HtmlParsefilter to
> overwrite the original ParseResult object produced by the parse-html
> plugin.
>
> thx
>
>
> mehdi
>
>
>
>
> > From: markus.jelsma@openindex.io
> > To: user@nutch.apache.org
> > Subject: Re: parse-html plugin
> > Date: Wed, 2 Feb 2011 02:46:47 +0100
> > CC: ab1sh3k@gmail.com
> >
> > Oh well, please come back with your experience and results on this issue
> in
> > this thread. More users will benefit =)
> >
> > > I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> > > your time
> > >
> > > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com>
> wrote:
> > > > Hi,
> > > >
> > > > Just wondering what does the dumpText mean in the ParseChecker?
> > > >
> > > > On the same grounds, incase I am writing a custom filter that
> extends
> > > > the
> > > >
> > > > HtmlParseFilter..do I have to make any configuration changes for
> nutch?
> > > >
> > > > Thanks,
> > > > Abi
> > > >
> > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> > > >> I'm not really sure but i believe you must overwrite the already
> parsed
> > > >> data
> > > >> yourself in your filter.
> > > >>
> > > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > >> > Thx for your reply :)
> > > >> >
> > > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it
> going
> > > >> > to overwrite to ParseResult varaible of the original plugin
> > > >> > parser-html ?
> > > >> >
> > > >> > is it not going to spend more time doing twice the operation of
> > > >>
> > > >> extracting
> > > >>
> > > >> > the html source code of each url to parse it (first time the
> original
> > > >> > parse-html plugin and the seconde time my new plugin ) ??
> > > >> >
> > > >> > thx a lot
> > > >> >
> > > >> > mehdi
> > > >> >
> > > >> > > From: markus.jelsma@openindex.io
> > > >> > > To: user@nutch.apache.org
> > > >> > > Subject: Re: parse-html plugin
> > > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > >> > > CC: mbellil@msn.com
> > > >> > >
> > > >> > > Oh, i forgot. You could extend
> > > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > > >> > > whatever you need and store it in the
> > > >>
> > > >> ParseResult
> > > >>
> > > >> > > object.
> > > >> > >
> > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > >> > > > hi,
> > > >> > > >
> > > >> > > > is my question so difficult ?
> > > >> > > > no one have an idea ?
> > > >> > > >
> > > >> > > > thx
> > > >> > > >
> > > >> > > >
> > > >> > > > mehdi
> > > >> > > >
> > > >> > > > > From: mbellil@msn.com
> > > >> > > > > To: user@nutch.apache.org
> > > >> > > > > Subject: RE: parse-html plugin
> > > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > Hi All,
> > > >> > > > >
> > > >> > > > > any idea ?
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > mehdi
> > > >> > > > >
> > > >> > > > > > From: mbellil@msn.com
> > > >> > > > > > To: user@nutch.apache.org
> > > >> > > > > > Subject: parse-html plugin
> > > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > hi,
> > > >> > > > > > In the class HtmlParser I changed the 'text' variable to
> index
> > > >>
> > > >> only
> > > >>
> > > >> > > > > > a part of my html page, and since i did lost lot off
> outlinks
> > > >> > > > > > !
> > > >> > > > > >
> > > >> > > > > > ...
> > > >> > > > > >
> > > >> > > > > > utils.getText(sb,extractIndexableContent(root)); //added
> on
> > > >> > > > > > 26-01-2011 to extract only text inside <col_centre>
> > > >> > > > > >
> > > >> > > > > > // utils.getText(sb, root); // extract text
> ---
> > > >> > > > > > disabled on 26-01-2011-
> > > >> > > > > >
> > > >> > > > > > text = sb.toString();
> > > >> > > > > >
> > > >> > > > > > ...
> > > >> > > > > >
> > > >> > > > > > i beleived that outlinks are not obtained from the text
> > > >> > > > > > variable
> > > >>
> > > >> ?!
> > > >>
> > > >> > > > > > in the same class we could see how outlinks are extracted
> !
> > > >> > > > > >
> > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); //
> extract
> > > >> > > > > > outlinks
> > > >> > > > > >
> > > >> > > > > > URL baseTag = utils.getBase(root);
> > > >> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > >> > > > > > links...");
> > > >>
> > > >> }
> > > >>
> > > >> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l,
> root);
> > > >> > > > > > outlinks = l.toArray(new Outlink[l.size()]);
> > > >> > > > > >
> > > >> > > > > > can you plz tell me what i did wrong.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > mehdi
> > > >>
> > > >> --
> > > >> Markus Jelsma - CTO - Openindex
> > > >> http://www.linkedin.com/in/markus17
> > > >> 050-8536620 / 06-50258350
>
>
RE: parse-html plugin
Posted by a a <mb...@msn.com>.
i want to know if some one did this job before , mabe he could tell us if it will take more time (double time) when using another HtmlParsefilter to overwrite the original ParseResult object produced by the parse-html plugin.
thx
mehdi
> From: markus.jelsma@openindex.io
> To: user@nutch.apache.org
> Subject: Re: parse-html plugin
> Date: Wed, 2 Feb 2011 02:46:47 +0100
> CC: ab1sh3k@gmail.com
>
> Oh well, please come back with your experience and results on this issue in
> this thread. More users will benefit =)
>
> > I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> > your time
> >
> > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com> wrote:
> > > Hi,
> > >
> > > Just wondering what does the dumpText mean in the ParseChecker?
> > >
> > > On the same grounds, incase I am writing a custom filter that extends
> > > the
> > >
> > > HtmlParseFilter..do I have to make any configuration changes for nutch?
> > >
> > > Thanks,
> > > Abi
> > >
> > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
> > >> I'm not really sure but i believe you must overwrite the already parsed
> > >> data
> > >> yourself in your filter.
> > >>
> > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > >> > Thx for your reply :)
> > >> >
> > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going
> > >> > to overwrite to ParseResult varaible of the original plugin
> > >> > parser-html ?
> > >> >
> > >> > is it not going to spend more time doing twice the operation of
> > >>
> > >> extracting
> > >>
> > >> > the html source code of each url to parse it (first time the original
> > >> > parse-html plugin and the seconde time my new plugin ) ??
> > >> >
> > >> > thx a lot
> > >> >
> > >> > mehdi
> > >> >
> > >> > > From: markus.jelsma@openindex.io
> > >> > > To: user@nutch.apache.org
> > >> > > Subject: Re: parse-html plugin
> > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > >> > > CC: mbellil@msn.com
> > >> > >
> > >> > > Oh, i forgot. You could extend
> > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > >> > > whatever you need and store it in the
> > >>
> > >> ParseResult
> > >>
> > >> > > object.
> > >> > >
> > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > >> > > > hi,
> > >> > > >
> > >> > > > is my question so difficult ?
> > >> > > > no one have an idea ?
> > >> > > >
> > >> > > > thx
> > >> > > >
> > >> > > >
> > >> > > > mehdi
> > >> > > >
> > >> > > > > From: mbellil@msn.com
> > >> > > > > To: user@nutch.apache.org
> > >> > > > > Subject: RE: parse-html plugin
> > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > >> > > > >
> > >> > > > >
> > >> > > > > Hi All,
> > >> > > > >
> > >> > > > > any idea ?
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > mehdi
> > >> > > > >
> > >> > > > > > From: mbellil@msn.com
> > >> > > > > > To: user@nutch.apache.org
> > >> > > > > > Subject: parse-html plugin
> > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > hi,
> > >> > > > > > In the class HtmlParser I changed the 'text' variable to index
> > >>
> > >> only
> > >>
> > >> > > > > > a part of my html page, and since i did lost lot off outlinks
> > >> > > > > > !
> > >> > > > > >
> > >> > > > > > ...
> > >> > > > > >
> > >> > > > > > utils.getText(sb,extractIndexableContent(root)); //added on
> > >> > > > > > 26-01-2011 to extract only text inside <col_centre>
> > >> > > > > >
> > >> > > > > > // utils.getText(sb, root); // extract text ---
> > >> > > > > > disabled on 26-01-2011-
> > >> > > > > >
> > >> > > > > > text = sb.toString();
> > >> > > > > >
> > >> > > > > > ...
> > >> > > > > >
> > >> > > > > > i beleived that outlinks are not obtained from the text
> > >> > > > > > variable
> > >>
> > >> ?!
> > >>
> > >> > > > > > in the same class we could see how outlinks are extracted !
> > >> > > > > >
> > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract
> > >> > > > > > outlinks
> > >> > > > > >
> > >> > > > > > URL baseTag = utils.getBase(root);
> > >> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > >> > > > > > links...");
> > >>
> > >> }
> > >>
> > >> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > >> > > > > > outlinks = l.toArray(new Outlink[l.size()]);
> > >> > > > > >
> > >> > > > > > can you plz tell me what i did wrong.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > mehdi
> > >>
> > >> --
> > >> Markus Jelsma - CTO - Openindex
> > >> http://www.linkedin.com/in/markus17
> > >> 050-8536620 / 06-50258350
Re: parse-html plugin
Posted by Markus Jelsma <ma...@openindex.io>.
Oh well, please come back with your experience and results on this issue in
this thread. More users will benefit =)
> I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> your time
>
> On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com> wrote:
> > Hi,
> >
> > Just wondering what does the dumpText mean in the ParseChecker?
> >
> > On the same grounds, incase I am writing a custom filter that extends
> > the
> >
> > HtmlParseFilter..do I have to make any configuration changes for nutch?
> >
> > Thanks,
> > Abi
> >
> > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
<ma...@openindex.io>wrote:
> >> I'm not really sure but i believe you must overwrite the already parsed
> >> data
> >> yourself in your filter.
> >>
> >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> >> > Thx for your reply :)
> >> >
> >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going
> >> > to overwrite to ParseResult varaible of the original plugin
> >> > parser-html ?
> >> >
> >> > is it not going to spend more time doing twice the operation of
> >>
> >> extracting
> >>
> >> > the html source code of each url to parse it (first time the original
> >> > parse-html plugin and the seconde time my new plugin ) ??
> >> >
> >> > thx a lot
> >> >
> >> > mehdi
> >> >
> >> > > From: markus.jelsma@openindex.io
> >> > > To: user@nutch.apache.org
> >> > > Subject: Re: parse-html plugin
> >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> >> > > CC: mbellil@msn.com
> >> > >
> >> > > Oh, i forgot. You could extend
> >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> >> > > whatever you need and store it in the
> >>
> >> ParseResult
> >>
> >> > > object.
> >> > >
> >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> >> > > > hi,
> >> > > >
> >> > > > is my question so difficult ?
> >> > > > no one have an idea ?
> >> > > >
> >> > > > thx
> >> > > >
> >> > > >
> >> > > > mehdi
> >> > > >
> >> > > > > From: mbellil@msn.com
> >> > > > > To: user@nutch.apache.org
> >> > > > > Subject: RE: parse-html plugin
> >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> >> > > > >
> >> > > > >
> >> > > > > Hi All,
> >> > > > >
> >> > > > > any idea ?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > mehdi
> >> > > > >
> >> > > > > > From: mbellil@msn.com
> >> > > > > > To: user@nutch.apache.org
> >> > > > > > Subject: parse-html plugin
> >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> >> > > > > >
> >> > > > > >
> >> > > > > > hi,
> >> > > > > > In the class HtmlParser I changed the 'text' variable to index
> >>
> >> only
> >>
> >> > > > > > a part of my html page, and since i did lost lot off outlinks
> >> > > > > > !
> >> > > > > >
> >> > > > > > ...
> >> > > > > >
> >> > > > > > utils.getText(sb,extractIndexableContent(root)); //added on
> >> > > > > > 26-01-2011 to extract only text inside <col_centre>
> >> > > > > >
> >> > > > > > // utils.getText(sb, root); // extract text ---
> >> > > > > > disabled on 26-01-2011-
> >> > > > > >
> >> > > > > > text = sb.toString();
> >> > > > > >
> >> > > > > > ...
> >> > > > > >
> >> > > > > > i beleived that outlinks are not obtained from the text
> >> > > > > > variable
> >>
> >> ?!
> >>
> >> > > > > > in the same class we could see how outlinks are extracted !
> >> > > > > >
> >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract
> >> > > > > > outlinks
> >> > > > > >
> >> > > > > > URL baseTag = utils.getBase(root);
> >> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting
> >> > > > > > links...");
> >>
> >> }
> >>
> >> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> >> > > > > > outlinks = l.toArray(new Outlink[l.size()]);
> >> > > > > >
> >> > > > > > can you plz tell me what i did wrong.
> >> > > > > >
> >> > > > > >
> >> > > > > > mehdi
> >>
> >> --
> >> Markus Jelsma - CTO - Openindex
> >> http://www.linkedin.com/in/markus17
> >> 050-8536620 / 06-50258350
Re: parse-html plugin
Posted by ".: Abhishek :." <ab...@gmail.com>.
I am sorry, forgive my ignorance. I got the answer for it :) Thanks for your
time
On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com> wrote:
> Hi,
>
> Just wondering what does the dumpText mean in the ParseChecker?
>
> On the same grounds, incase I am writing a custom filter that extends the
> HtmlParseFilter..do I have to make any configuration changes for nutch?
>
> Thanks,
> Abi
>
>
> On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma <ma...@openindex.io>wrote:
>
>> I'm not really sure but i believe you must overwrite the already parsed
>> data
>> yourself in your filter.
>>
>> On Tuesday 01 February 2011 18:54:32 a a wrote:
>> > Thx for your reply :)
>> >
>> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to
>> > overwrite to ParseResult varaible of the original plugin parser-html ?
>> >
>> > is it not going to spend more time doing twice the operation of
>> extracting
>> > the html source code of each url to parse it (first time the original
>> > parse-html plugin and the seconde time my new plugin ) ??
>> >
>> > thx a lot
>> >
>> > mehdi
>> >
>> > > From: markus.jelsma@openindex.io
>> > > To: user@nutch.apache.org
>> > > Subject: Re: parse-html plugin
>> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
>> > > CC: mbellil@msn.com
>> > >
>> > > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter.
>> > > Then you can retrieve whatever you need and store it in the
>> ParseResult
>> > > object.
>> > >
>> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
>> > > > hi,
>> > > >
>> > > > is my question so difficult ?
>> > > > no one have an idea ?
>> > > >
>> > > > thx
>> > > >
>> > > >
>> > > > mehdi
>> > > >
>> > > > > From: mbellil@msn.com
>> > > > > To: user@nutch.apache.org
>> > > > > Subject: RE: parse-html plugin
>> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
>> > > > >
>> > > > >
>> > > > > Hi All,
>> > > > >
>> > > > > any idea ?
>> > > > >
>> > > > >
>> > > > >
>> > > > > mehdi
>> > > > >
>> > > > > > From: mbellil@msn.com
>> > > > > > To: user@nutch.apache.org
>> > > > > > Subject: parse-html plugin
>> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
>> > > > > >
>> > > > > >
>> > > > > > hi,
>> > > > > > In the class HtmlParser I changed the 'text' variable to index
>> only
>> > > > > > a part of my html page, and since i did lost lot off outlinks !
>> > > > > >
>> > > > > > ...
>> > > > > >
>> > > > > > utils.getText(sb,extractIndexableContent(root)); //added on
>> > > > > > 26-01-2011 to extract only text inside <col_centre>
>> > > > > >
>> > > > > > // utils.getText(sb, root); // extract text ---
>> > > > > > disabled on 26-01-2011-
>> > > > > >
>> > > > > > text = sb.toString();
>> > > > > >
>> > > > > > ...
>> > > > > >
>> > > > > > i beleived that outlinks are not obtained from the text variable
>> ?!
>> > > > > > in the same class we could see how outlinks are extracted !
>> > > > > >
>> > > > > >
>> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract
>> > > > > > outlinks
>> > > > > >
>> > > > > > URL baseTag = utils.getBase(root);
>> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting links...");
>> }
>> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
>> > > > > > outlinks = l.toArray(new Outlink[l.size()]);
>> > > > > >
>> > > > > > can you plz tell me what i did wrong.
>> > > > > >
>> > > > > >
>> > > > > > mehdi
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
Re: parse-html plugin
Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi,
Just wondering what does the dumpText mean in the ParseChecker?
On the same grounds, incase I am writing a custom filter that extends the
HtmlParseFilter..do I have to make any configuration changes for nutch?
Thanks,
Abi
On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma <ma...@openindex.io>wrote:
> I'm not really sure but i believe you must overwrite the already parsed
> data
> yourself in your filter.
>
> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > Thx for your reply :)
> >
> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to
> > overwrite to ParseResult varaible of the original plugin parser-html ?
> >
> > is it not going to spend more time doing twice the operation of
> extracting
> > the html source code of each url to parse it (first time the original
> > parse-html plugin and the seconde time my new plugin ) ??
> >
> > thx a lot
> >
> > mehdi
> >
> > > From: markus.jelsma@openindex.io
> > > To: user@nutch.apache.org
> > > Subject: Re: parse-html plugin
> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > CC: mbellil@msn.com
> > >
> > > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter.
> > > Then you can retrieve whatever you need and store it in the ParseResult
> > > object.
> > >
> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > hi,
> > > >
> > > > is my question so difficult ?
> > > > no one have an idea ?
> > > >
> > > > thx
> > > >
> > > >
> > > > mehdi
> > > >
> > > > > From: mbellil@msn.com
> > > > > To: user@nutch.apache.org
> > > > > Subject: RE: parse-html plugin
> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > >
> > > > >
> > > > > Hi All,
> > > > >
> > > > > any idea ?
> > > > >
> > > > >
> > > > >
> > > > > mehdi
> > > > >
> > > > > > From: mbellil@msn.com
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: parse-html plugin
> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > > >
> > > > > >
> > > > > > hi,
> > > > > > In the class HtmlParser I changed the 'text' variable to index
> only
> > > > > > a part of my html page, and since i did lost lot off outlinks !
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > utils.getText(sb,extractIndexableContent(root)); //added on
> > > > > > 26-01-2011 to extract only text inside <col_centre>
> > > > > >
> > > > > > // utils.getText(sb, root); // extract text ---
> > > > > > disabled on 26-01-2011-
> > > > > >
> > > > > > text = sb.toString();
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > i beleived that outlinks are not obtained from the text variable
> ?!
> > > > > > in the same class we could see how outlinks are extracted !
> > > > > >
> > > > > >
> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract
> > > > > > outlinks
> > > > > >
> > > > > > URL baseTag = utils.getBase(root);
> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting links...");
> }
> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > > > > outlinks = l.toArray(new Outlink[l.size()]);
> > > > > >
> > > > > > can you plz tell me what i did wrong.
> > > > > >
> > > > > >
> > > > > > mehdi
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
Re: parse-html plugin
Posted by Markus Jelsma <ma...@openindex.io>.
I'm not really sure but i believe you must overwrite the already parsed data
yourself in your filter.
On Tuesday 01 February 2011 18:54:32 a a wrote:
> Thx for your reply :)
>
> so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to
> overwrite to ParseResult varaible of the original plugin parser-html ?
>
> is it not going to spend more time doing twice the operation of extracting
> the html source code of each url to parse it (first time the original
> parse-html plugin and the seconde time my new plugin ) ??
>
> thx a lot
>
> mehdi
>
> > From: markus.jelsma@openindex.io
> > To: user@nutch.apache.org
> > Subject: Re: parse-html plugin
> > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > CC: mbellil@msn.com
> >
> > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter.
> > Then you can retrieve whatever you need and store it in the ParseResult
> > object.
> >
> > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > hi,
> > >
> > > is my question so difficult ?
> > > no one have an idea ?
> > >
> > > thx
> > >
> > >
> > > mehdi
> > >
> > > > From: mbellil@msn.com
> > > > To: user@nutch.apache.org
> > > > Subject: RE: parse-html plugin
> > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > >
> > > >
> > > > Hi All,
> > > >
> > > > any idea ?
> > > >
> > > >
> > > >
> > > > mehdi
> > > >
> > > > > From: mbellil@msn.com
> > > > > To: user@nutch.apache.org
> > > > > Subject: parse-html plugin
> > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > >
> > > > >
> > > > > hi,
> > > > > In the class HtmlParser I changed the 'text' variable to index only
> > > > > a part of my html page, and since i did lost lot off outlinks !
> > > > >
> > > > > ...
> > > > >
> > > > > utils.getText(sb,extractIndexableContent(root)); //added on
> > > > > 26-01-2011 to extract only text inside <col_centre>
> > > > >
> > > > > // utils.getText(sb, root); // extract text ---
> > > > > disabled on 26-01-2011-
> > > > >
> > > > > text = sb.toString();
> > > > >
> > > > > ...
> > > > >
> > > > > i beleived that outlinks are not obtained from the text variable ?!
> > > > > in the same class we could see how outlinks are extracted !
> > > > >
> > > > >
> > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract
> > > > > outlinks
> > > > >
> > > > > URL baseTag = utils.getBase(root);
> > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > > > outlinks = l.toArray(new Outlink[l.size()]);
> > > > >
> > > > > can you plz tell me what i did wrong.
> > > > >
> > > > >
> > > > > mehdi
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
RE: parse-html plugin
Posted by a a <mb...@msn.com>.
Thx for your reply :)
so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to overwrite to ParseResult varaible of the original plugin parser-html ?
is it not going to spend more time doing twice the operation of extracting the html source code of each url to parse it (first time the original parse-html plugin and the seconde time
my new plugin ) ??
thx a lot
mehdi
> From: markus.jelsma@openindex.io
> To: user@nutch.apache.org
> Subject: Re: parse-html plugin
> Date: Tue, 1 Feb 2011 18:42:51 +0100
> CC: mbellil@msn.com
>
> Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter. Then you
> can retrieve whatever you need and store it in the ParseResult object.
>
> On Tuesday 01 February 2011 15:25:20 a a wrote:
> > hi,
> >
> > is my question so difficult ?
> > no one have an idea ?
> >
> > thx
> >
> >
> > mehdi
> >
> > > From: mbellil@msn.com
> > > To: user@nutch.apache.org
> > > Subject: RE: parse-html plugin
> > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > >
> > >
> > > Hi All,
> > >
> > > any idea ?
> > >
> > >
> > >
> > > mehdi
> > >
> > > > From: mbellil@msn.com
> > > > To: user@nutch.apache.org
> > > > Subject: parse-html plugin
> > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > >
> > > >
> > > > hi,
> > > > In the class HtmlParser I changed the 'text' variable to index only a
> > > > part of my html page, and since i did lost lot off outlinks !
> > > >
> > > > ...
> > > >
> > > > utils.getText(sb,extractIndexableContent(root)); //added on
> > > > 26-01-2011 to extract only text inside <col_centre>
> > > >
> > > > // utils.getText(sb, root); // extract text --- disabled
> > > > on 26-01-2011-
> > > >
> > > > text = sb.toString();
> > > >
> > > > ...
> > > >
> > > > i beleived that outlinks are not obtained from the text variable ?! in
> > > > the same class we could see how outlinks are extracted !
> > > >
> > > >
> > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
> > > >
> > > > URL baseTag = utils.getBase(root);
> > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > > outlinks = l.toArray(new Outlink[l.size()]);
> > > >
> > > > can you plz tell me what i did wrong.
> > > >
> > > >
> > > > mehdi
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
Re: parse-html plugin
Posted by Markus Jelsma <ma...@openindex.io>.
Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter. Then you
can retrieve whatever you need and store it in the ParseResult object.
On Tuesday 01 February 2011 15:25:20 a a wrote:
> hi,
>
> is my question so difficult ?
> no one have an idea ?
>
> thx
>
>
> mehdi
>
> > From: mbellil@msn.com
> > To: user@nutch.apache.org
> > Subject: RE: parse-html plugin
> > Date: Mon, 31 Jan 2011 16:05:22 +0000
> >
> >
> > Hi All,
> >
> > any idea ?
> >
> >
> >
> > mehdi
> >
> > > From: mbellil@msn.com
> > > To: user@nutch.apache.org
> > > Subject: parse-html plugin
> > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > >
> > >
> > > hi,
> > > In the class HtmlParser I changed the 'text' variable to index only a
> > > part of my html page, and since i did lost lot off outlinks !
> > >
> > > ...
> > >
> > > utils.getText(sb,extractIndexableContent(root)); //added on
> > > 26-01-2011 to extract only text inside <col_centre>
> > >
> > > // utils.getText(sb, root); // extract text --- disabled
> > > on 26-01-2011-
> > >
> > > text = sb.toString();
> > >
> > > ...
> > >
> > > i beleived that outlinks are not obtained from the text variable ?! in
> > > the same class we could see how outlinks are extracted !
> > >
> > >
> > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
> > >
> > > URL baseTag = utils.getBase(root);
> > > if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > outlinks = l.toArray(new Outlink[l.size()]);
> > >
> > > can you plz tell me what i did wrong.
> > >
> > >
> > > mehdi
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350