You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by a a <mb...@msn.com> on 2011/02/01 15:25:20 UTC

RE: parse-html plugin

hi,

is my question so difficult ?
no one have an idea ?

thx


mehdi




> From: mbellil@msn.com
> To: user@nutch.apache.org
> Subject: RE: parse-html plugin
> Date: Mon, 31 Jan 2011 16:05:22 +0000
> 
> 
> Hi All,
> 
> any  idea ?
> 
> 
> 
> mehdi
> 
> 
> 
> 
> > From: mbellil@msn.com
> > To: user@nutch.apache.org
> > Subject: parse-html plugin
> > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > 
> > 
> > hi,
> > In the class HtmlParser I changed the 'text' variable to index only a part of my html page, and since i did lost lot off outlinks !
> > 
> > ...
> >  utils.getText(sb,extractIndexableContent(root));  //added on 26-01-2011 to extract only text inside <col_centre>
> >   // utils.getText(sb, root);          // extract text   --- disabled on 26-01-2011-
> > 
> >       text = sb.toString();
> > ...
> > 
> > i beleived that outlinks are not obtained from the text variable ?!  in the same class we could see how outlinks are extracted !
> > 
> > 
> > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract outlinks
> >       URL baseTag = utils.getBase(root);
> >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> >       outlinks = l.toArray(new Outlink[l.size()]);
> > 
> > 
> > 
> > can you plz tell me what i did wrong.
> > 
> > 
> > mehdi
> > 
> > 
> >  		 	   		  
>

Re: parse-html plugin

Posted by Markus Jelsma <ma...@openindex.io>.

Yes, understanding the parser's internals is not very easy. Try adding log 
lines so you can understand it better. You can use the ParserChecker to test.

On Tuesday 01 February 2011 15:25:20 a a wrote:
> hi,
> 
> is my question so difficult ?
> no one have an idea ?
> 
> thx
> 
> 
> mehdi
> 
> > From: mbellil@msn.com
> > To: user@nutch.apache.org
> > Subject: RE: parse-html plugin
> > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > 
> > 
> > Hi All,
> > 
> > any  idea ?
> > 
> > 
> > 
> > mehdi
> > 
> > > From: mbellil@msn.com
> > > To: user@nutch.apache.org
> > > Subject: parse-html plugin
> > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > 
> > > 
> > > hi,
> > > In the class HtmlParser I changed the 'text' variable to index only a
> > > part of my html page, and since i did lost lot off outlinks !
> > > 
> > > ...
> > > 
> > >  utils.getText(sb,extractIndexableContent(root));  //added on
> > >  26-01-2011 to extract only text inside <col_centre>
> > >  
> > >   // utils.getText(sb, root);          // extract text   --- disabled
> > >   on 26-01-2011-
> > >   
> > >       text = sb.toString();
> > > 
> > > ...
> > > 
> > > i beleived that outlinks are not obtained from the text variable ?!  in
> > > the same class we could see how outlinks are extracted !
> > > 
> > > 
> > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract outlinks
> > > 
> > >       URL baseTag = utils.getBase(root);
> > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > 
> > > can you plz tell me what i did wrong.
> > > 
> > > 
> > > mehdi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: parse-html plugin

Posted by Markus Jelsma <ma...@openindex.io>.

dumpText will just output the parsed data or whatever HTML elements you 
selected to be the parsed data, but i haven't tested this myself. The same 
goes for configuration changes. The docblock tells us it'll just run it but you 
might just want to check the parser settings in the configuration.

> Hi,
> 
>  Just wondering what does the dumpText mean in the ParseChecker?
> 
>  On the same grounds, incase I am writing a custom filter that extends the
> HtmlParseFilter..do I have to make any configuration changes for nutch?
> 
> Thanks,
> Abi
> 
> On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma 
<ma...@openindex.io>wrote:
> > I'm not really sure but i believe you must overwrite the already parsed
> > data
> > yourself in your filter.
> > 
> > On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > Thx for your reply :)
> > > 
> > > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going
> > > to overwrite to ParseResult  varaible of the original plugin
> > > parser-html ?
> > > 
> > > is it not going to spend more time doing twice the operation of
> > 
> > extracting
> > 
> > > the html source code of each url to parse it  (first time the original
> > > parse-html plugin and the seconde time my new plugin ) ??
> > > 
> > > thx a lot
> > > 
> > > mehdi
> > > 
> > > > From: markus.jelsma@openindex.io
> > > > To: user@nutch.apache.org
> > > > Subject: Re: parse-html plugin
> > > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > > CC: mbellil@msn.com
> > > > 
> > > > Oh, i forgot. You could extend
> > > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > > > whatever you need and store it in the ParseResult object.
> > > > 
> > > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > > hi,
> > > > > 
> > > > > is my question so difficult ?
> > > > > no one have an idea ?
> > > > > 
> > > > > thx
> > > > > 
> > > > > 
> > > > > mehdi
> > > > > 
> > > > > > From: mbellil@msn.com
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: RE: parse-html plugin
> > > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > > > 
> > > > > > 
> > > > > > Hi All,
> > > > > > 
> > > > > > any  idea ?
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > mehdi
> > > > > > 
> > > > > > > From: mbellil@msn.com
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: parse-html plugin
> > > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > > > > 
> > > > > > > 
> > > > > > > hi,
> > > > > > > In the class HtmlParser I changed the 'text' variable to index
> > 
> > only
> > 
> > > > > > > a part of my html page, and since i did lost lot off outlinks !
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > >  utils.getText(sb,extractIndexableContent(root));  //added on
> > > > > > >  26-01-2011 to extract only text inside <col_centre>
> > > > > > >  
> > > > > > >   // utils.getText(sb, root);          // extract text   ---
> > > > > > >   disabled on 26-01-2011-
> > > > > > >   
> > > > > > >       text = sb.toString();
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > i beleived that outlinks are not obtained from the text
> > > > > > > variable
> > 
> > ?!
> > 
> > > > > > >  in the same class we could see how outlinks are extracted !
> > > > > > > 
> > > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract
> > > > > > > outlinks
> > > > > > > 
> > > > > > >       URL baseTag = utils.getBase(root);
> > > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > > > > >       links...");
> > 
> > }
> > 
> > > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > > > > > 
> > > > > > > can you plz tell me what i did wrong.
> > > > > > > 
> > > > > > > 
> > > > > > > mehdi
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

Re: parse-html plugin

Posted by Markus Jelsma <ma...@openindex.io>.

If i'm not mistaken its not a plugin but an extension point. Maybe it doesn't 
need configuration but only inclusion on the class path? 

> i'm realy confused :)
> 
> so in my nutch-site.xml  i have to call my new plugin after parse-html one,
> like this
> 
> parse-(text|html|msword|pdf|MY_NEW_HtmlParsefilter _PLUGIN)
> 
> how about parse-text? it has also a parseresult object as the parse-html ?
> which one is used ?
> 
> thx
> 
> 
> mehdi
> 
> > Date: Wed, 2 Feb 2011 13:31:40 +0800
> > Subject: Re: parse-html plugin
> > From: ab1sh3k@gmail.com
> > To: user@nutch.apache.org
> > 
> > Hi,
> > 
> >  I am not sure if my guess would be right hopefully some one will have to
> > 
> > correct me if I am a wrong, I am just a beginner.
> > 
> >  I believe you would be implementing your own HtmlParseFilter as a
> >  plug-in
> > 
> > in which case the order in which the plug-in is executed has a call on
> > impact. I see some implementation on ordered filters in the
> > HtmlParseFilters class. If my assumption on this is correct, you may
> > want to order it as per your requirements.
> > 
> >  However, I am not really sure what determines the order or whether it
> >  will
> > 
> > take double(more) time for phase by phase filtering. Even I am looking
> > out for an answer to this :)
> > 
> > Thanks,
> > Abi
> > 
> > On Wed, Feb 2, 2011 at 11:28 AM, a a <mb...@msn.com> wrote:
> > > i want to know if some one did this job before , mabe he could tell us
> > > if it will take more time  (double time) when using another
> > > HtmlParsefilter to overwrite  the original ParseResult   object
> > > produced by the parse-html plugin.
> > > 
> > > thx
> > > 
> > > 
> > > mehdi
> > > 
> > > > From: markus.jelsma@openindex.io
> > > > To: user@nutch.apache.org
> > > > Subject: Re: parse-html plugin
> > > > Date: Wed, 2 Feb 2011 02:46:47 +0100
> > > > CC: ab1sh3k@gmail.com
> > > > 
> > > > Oh well, please come back with your experience and results on this
> > > > issue
> > > 
> > > in
> > > 
> > > > this thread. More users will benefit =)
> > > > 
> > > > > I am sorry, forgive my ignorance. I got the answer for it :) Thanks
> > > > > for your time
> > > > > 
> > > > > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com>
> > > 
> > > wrote:
> > > > > > Hi,
> > > > > > 
> > > > > >  Just wondering what does the dumpText mean in the ParseChecker?
> > > > > >  
> > > > > >  On the same grounds, incase I am writing a custom filter that
> > > 
> > > extends
> > > 
> > > > > >  the
> > > > > > 
> > > > > > HtmlParseFilter..do I have to make any configuration changes for
> > > 
> > > nutch?
> > > 
> > > > > > Thanks,
> > > > > > Abi
> > > > > > 
> > > > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
> > > > 
> > > > <ma...@openindex.io>wrote:
> > > > > >> I'm not really sure but i believe you must overwrite the already
> > > 
> > > parsed
> > > 
> > > > > >> data
> > > > > >> yourself in your filter.
> > > > > >> 
> > > > > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > > > >> > Thx for your reply :)
> > > > > >> > 
> > > > > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is
> > > > > >> > it
> > > 
> > > going
> > > 
> > > > > >> > to overwrite to ParseResult  varaible of the original plugin
> > > > > >> > parser-html ?
> > > > > >> > 
> > > > > >> > is it not going to spend more time doing twice the operation
> > > > > >> > of
> > > > > >> 
> > > > > >> extracting
> > > > > >> 
> > > > > >> > the html source code of each url to parse it  (first time the
> > > 
> > > original
> > > 
> > > > > >> > parse-html plugin and the seconde time my new plugin ) ??
> > > > > >> > 
> > > > > >> > thx a lot
> > > > > >> > 
> > > > > >> > mehdi
> > > > > >> > 
> > > > > >> > > From: markus.jelsma@openindex.io
> > > > > >> > > To: user@nutch.apache.org
> > > > > >> > > Subject: Re: parse-html plugin
> > > > > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > > > >> > > CC: mbellil@msn.com
> > > > > >> > > 
> > > > > >> > > Oh, i forgot. You could extend
> > > > > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can
> > > > > >> > > retrieve whatever you need and store it in the
> > > > > >> 
> > > > > >> ParseResult
> > > > > >> 
> > > > > >> > > object.
> > > > > >> > > 
> > > > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > > >> > > > hi,
> > > > > >> > > > 
> > > > > >> > > > is my question so difficult ?
> > > > > >> > > > no one have an idea ?
> > > > > >> > > > 
> > > > > >> > > > thx
> > > > > >> > > > 
> > > > > >> > > > 
> > > > > >> > > > mehdi
> > > > > >> > > > 
> > > > > >> > > > > From: mbellil@msn.com
> > > > > >> > > > > To: user@nutch.apache.org
> > > > > >> > > > > Subject: RE: parse-html plugin
> > > > > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > > >> > > > > 
> > > > > >> > > > > 
> > > > > >> > > > > Hi All,
> > > > > >> > > > > 
> > > > > >> > > > > any  idea ?
> > > > > >> > > > > 
> > > > > >> > > > > 
> > > > > >> > > > > 
> > > > > >> > > > > mehdi
> > > > > >> > > > > 
> > > > > >> > > > > > From: mbellil@msn.com
> > > > > >> > > > > > To: user@nutch.apache.org
> > > > > >> > > > > > Subject: parse-html plugin
> > > > > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > > >> > > > > > 
> > > > > >> > > > > > 
> > > > > >> > > > > > hi,
> > > > > >> > > > > > In the class HtmlParser I changed the 'text' variable
> > > > > >> > > > > > to
> > > 
> > > index
> > > 
> > > > > >> only
> > > > > >> 
> > > > > >> > > > > > a part of my html page, and since i did lost lot off
> > > 
> > > outlinks
> > > 
> > > > > >> > > > > > !
> > > > > >> > > > > > 
> > > > > >> > > > > > ...
> > > > > >> > > > > > 
> > > > > >> > > > > >  utils.getText(sb,extractIndexableContent(root)); 
> > > > > >> > > > > >  //added
> > > 
> > > on
> > > 
> > > > > >> > > > > >  26-01-2011 to extract only text inside <col_centre>
> > > > > >> > > > > >  
> > > > > >> > > > > >   // utils.getText(sb, root);          // extract text
> > > 
> > > ---
> > > 
> > > > > >> > > > > >   disabled on 26-01-2011-
> > > > > >> > > > > >   
> > > > > >> > > > > >       text = sb.toString();
> > > > > >> > > > > > 
> > > > > >> > > > > > ...
> > > > > >> > > > > > 
> > > > > >> > > > > > i beleived that outlinks are not obtained from the
> > > > > >> > > > > > text variable
> > > > > >> 
> > > > > >> ?!
> > > > > >> 
> > > > > >> > > > > >  in the same class we could see how outlinks are
> > > > > >> > > > > >  extracted
> > > 
> > > !
> > > 
> > > > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   //
> > > 
> > > extract
> > > 
> > > > > >> > > > > > outlinks
> > > > > >> > > > > > 
> > > > > >> > > > > >       URL baseTag = utils.getBase(root);
> > > > > >> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > > > >> > > > > >       links...");
> > > > > >> 
> > > > > >> }
> > > > > >> 
> > > > > >> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l,
> > > 
> > > root);
> > > 
> > > > > >> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > > > >> > > > > > 
> > > > > >> > > > > > can you plz tell me what i did wrong.
> > > > > >> > > > > > 
> > > > > >> > > > > > 
> > > > > >> > > > > > mehdi
> > > > > >> 
> > > > > >> --
> > > > > >> Markus Jelsma - CTO - Openindex
> > > > > >> http://www.linkedin.com/in/markus17
> > > > > >> 050-8536620 / 06-50258350

RE: parse-html plugin

Posted by a a <mb...@msn.com>.

i'm realy confused :)

so in my nutch-site.xml  i have to call my new plugin after parse-html one, like this

parse-(text|html|msword|pdf|MY_NEW_HtmlParsefilter _PLUGIN)

how about parse-text? it has also a parseresult object as the parse-html ? which one is used ? 

thx 


mehdi




> Date: Wed, 2 Feb 2011 13:31:40 +0800
> Subject: Re: parse-html plugin
> From: ab1sh3k@gmail.com
> To: user@nutch.apache.org
> 
> Hi,
> 
>  I am not sure if my guess would be right hopefully some one will have to
> correct me if I am a wrong, I am just a beginner.
> 
>  I believe you would be implementing your own HtmlParseFilter as a plug-in
> in which case the order in which the plug-in is executed has a call on
> impact. I see some implementation on ordered filters in the HtmlParseFilters
> class. If my assumption on this is correct, you may want to order it as per
> your requirements.
> 
>  However, I am not really sure what determines the order or whether it will
> take double(more) time for phase by phase filtering. Even I am looking out
> for an answer to this :)
> 
> Thanks,
> Abi
> 
> 
> On Wed, Feb 2, 2011 at 11:28 AM, a a <mb...@msn.com> wrote:
> 
> >
> > i want to know if some one did this job before , mabe he could tell us if
> > it will take more time  (double time) when using another HtmlParsefilter to
> > overwrite  the original ParseResult   object produced by the parse-html
> > plugin.
> >
> > thx
> >
> >
> > mehdi
> >
> >
> >
> >
> > > From: markus.jelsma@openindex.io
> > > To: user@nutch.apache.org
> > > Subject: Re: parse-html plugin
> > > Date: Wed, 2 Feb 2011 02:46:47 +0100
> > > CC: ab1sh3k@gmail.com
> > >
> > > Oh well, please come back with your experience and results on this issue
> > in
> > > this thread. More users will benefit =)
> > >
> > > > I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> > > > your time
> > > >
> > > > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com>
> > wrote:
> > > > > Hi,
> > > > >
> > > > >  Just wondering what does the dumpText mean in the ParseChecker?
> > > > >
> > > > >  On the same grounds, incase I am writing a custom filter that
> > extends
> > > > >  the
> > > > >
> > > > > HtmlParseFilter..do I have to make any configuration changes for
> > nutch?
> > > > >
> > > > > Thanks,
> > > > > Abi
> > > > >
> > > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
> > > <ma...@openindex.io>wrote:
> > > > >> I'm not really sure but i believe you must overwrite the already
> > parsed
> > > > >> data
> > > > >> yourself in your filter.
> > > > >>
> > > > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > > >> > Thx for your reply :)
> > > > >> >
> > > > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it
> > going
> > > > >> > to overwrite to ParseResult  varaible of the original plugin
> > > > >> > parser-html ?
> > > > >> >
> > > > >> > is it not going to spend more time doing twice the operation of
> > > > >>
> > > > >> extracting
> > > > >>
> > > > >> > the html source code of each url to parse it  (first time the
> > original
> > > > >> > parse-html plugin and the seconde time my new plugin ) ??
> > > > >> >
> > > > >> > thx a lot
> > > > >> >
> > > > >> > mehdi
> > > > >> >
> > > > >> > > From: markus.jelsma@openindex.io
> > > > >> > > To: user@nutch.apache.org
> > > > >> > > Subject: Re: parse-html plugin
> > > > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > > >> > > CC: mbellil@msn.com
> > > > >> > >
> > > > >> > > Oh, i forgot. You could extend
> > > > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > > > >> > > whatever you need and store it in the
> > > > >>
> > > > >> ParseResult
> > > > >>
> > > > >> > > object.
> > > > >> > >
> > > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > >> > > > hi,
> > > > >> > > >
> > > > >> > > > is my question so difficult ?
> > > > >> > > > no one have an idea ?
> > > > >> > > >
> > > > >> > > > thx
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > mehdi
> > > > >> > > >
> > > > >> > > > > From: mbellil@msn.com
> > > > >> > > > > To: user@nutch.apache.org
> > > > >> > > > > Subject: RE: parse-html plugin
> > > > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > Hi All,
> > > > >> > > > >
> > > > >> > > > > any  idea ?
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > mehdi
> > > > >> > > > >
> > > > >> > > > > > From: mbellil@msn.com
> > > > >> > > > > > To: user@nutch.apache.org
> > > > >> > > > > > Subject: parse-html plugin
> > > > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > hi,
> > > > >> > > > > > In the class HtmlParser I changed the 'text' variable to
> > index
> > > > >>
> > > > >> only
> > > > >>
> > > > >> > > > > > a part of my html page, and since i did lost lot off
> > outlinks
> > > > >> > > > > > !
> > > > >> > > > > >
> > > > >> > > > > > ...
> > > > >> > > > > >
> > > > >> > > > > >  utils.getText(sb,extractIndexableContent(root));  //added
> > on
> > > > >> > > > > >  26-01-2011 to extract only text inside <col_centre>
> > > > >> > > > > >
> > > > >> > > > > >   // utils.getText(sb, root);          // extract text
> > ---
> > > > >> > > > > >   disabled on 26-01-2011-
> > > > >> > > > > >
> > > > >> > > > > >       text = sb.toString();
> > > > >> > > > > >
> > > > >> > > > > > ...
> > > > >> > > > > >
> > > > >> > > > > > i beleived that outlinks are not obtained from the text
> > > > >> > > > > > variable
> > > > >>
> > > > >> ?!
> > > > >>
> > > > >> > > > > >  in the same class we could see how outlinks are extracted
> > !
> > > > >> > > > > >
> > > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   //
> > extract
> > > > >> > > > > > outlinks
> > > > >> > > > > >
> > > > >> > > > > >       URL baseTag = utils.getBase(root);
> > > > >> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > > >> > > > > >       links...");
> > > > >>
> > > > >> }
> > > > >>
> > > > >> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l,
> > root);
> > > > >> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > > >> > > > > >
> > > > >> > > > > > can you plz tell me what i did wrong.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > mehdi
> > > > >>
> > > > >> --
> > > > >> Markus Jelsma - CTO - Openindex
> > > > >> http://www.linkedin.com/in/markus17
> > > > >> 050-8536620 / 06-50258350
> >
> >

Re: parse-html plugin

Posted by ".: Abhishek :." <ab...@gmail.com>.

Hi,

 I am not sure if my guess would be right hopefully some one will have to
correct me if I am a wrong, I am just a beginner.

 I believe you would be implementing your own HtmlParseFilter as a plug-in
in which case the order in which the plug-in is executed has a call on
impact. I see some implementation on ordered filters in the HtmlParseFilters
class. If my assumption on this is correct, you may want to order it as per
your requirements.

 However, I am not really sure what determines the order or whether it will
take double(more) time for phase by phase filtering. Even I am looking out
for an answer to this :)

Thanks,
Abi


On Wed, Feb 2, 2011 at 11:28 AM, a a <mb...@msn.com> wrote:

>
> i want to know if some one did this job before , mabe he could tell us if
> it will take more time  (double time) when using another HtmlParsefilter to
> overwrite  the original ParseResult   object produced by the parse-html
> plugin.
>
> thx
>
>
> mehdi
>
>
>
>
> > From: markus.jelsma@openindex.io
> > To: user@nutch.apache.org
> > Subject: Re: parse-html plugin
> > Date: Wed, 2 Feb 2011 02:46:47 +0100
> > CC: ab1sh3k@gmail.com
> >
> > Oh well, please come back with your experience and results on this issue
> in
> > this thread. More users will benefit =)
> >
> > > I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> > > your time
> > >
> > > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com>
> wrote:
> > > > Hi,
> > > >
> > > >  Just wondering what does the dumpText mean in the ParseChecker?
> > > >
> > > >  On the same grounds, incase I am writing a custom filter that
> extends
> > > >  the
> > > >
> > > > HtmlParseFilter..do I have to make any configuration changes for
> nutch?
> > > >
> > > > Thanks,
> > > > Abi
> > > >
> > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> > > >> I'm not really sure but i believe you must overwrite the already
> parsed
> > > >> data
> > > >> yourself in your filter.
> > > >>
> > > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > >> > Thx for your reply :)
> > > >> >
> > > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it
> going
> > > >> > to overwrite to ParseResult  varaible of the original plugin
> > > >> > parser-html ?
> > > >> >
> > > >> > is it not going to spend more time doing twice the operation of
> > > >>
> > > >> extracting
> > > >>
> > > >> > the html source code of each url to parse it  (first time the
> original
> > > >> > parse-html plugin and the seconde time my new plugin ) ??
> > > >> >
> > > >> > thx a lot
> > > >> >
> > > >> > mehdi
> > > >> >
> > > >> > > From: markus.jelsma@openindex.io
> > > >> > > To: user@nutch.apache.org
> > > >> > > Subject: Re: parse-html plugin
> > > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > >> > > CC: mbellil@msn.com
> > > >> > >
> > > >> > > Oh, i forgot. You could extend
> > > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > > >> > > whatever you need and store it in the
> > > >>
> > > >> ParseResult
> > > >>
> > > >> > > object.
> > > >> > >
> > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > >> > > > hi,
> > > >> > > >
> > > >> > > > is my question so difficult ?
> > > >> > > > no one have an idea ?
> > > >> > > >
> > > >> > > > thx
> > > >> > > >
> > > >> > > >
> > > >> > > > mehdi
> > > >> > > >
> > > >> > > > > From: mbellil@msn.com
> > > >> > > > > To: user@nutch.apache.org
> > > >> > > > > Subject: RE: parse-html plugin
> > > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > Hi All,
> > > >> > > > >
> > > >> > > > > any  idea ?
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > mehdi
> > > >> > > > >
> > > >> > > > > > From: mbellil@msn.com
> > > >> > > > > > To: user@nutch.apache.org
> > > >> > > > > > Subject: parse-html plugin
> > > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > hi,
> > > >> > > > > > In the class HtmlParser I changed the 'text' variable to
> index
> > > >>
> > > >> only
> > > >>
> > > >> > > > > > a part of my html page, and since i did lost lot off
> outlinks
> > > >> > > > > > !
> > > >> > > > > >
> > > >> > > > > > ...
> > > >> > > > > >
> > > >> > > > > >  utils.getText(sb,extractIndexableContent(root));  //added
> on
> > > >> > > > > >  26-01-2011 to extract only text inside <col_centre>
> > > >> > > > > >
> > > >> > > > > >   // utils.getText(sb, root);          // extract text
> ---
> > > >> > > > > >   disabled on 26-01-2011-
> > > >> > > > > >
> > > >> > > > > >       text = sb.toString();
> > > >> > > > > >
> > > >> > > > > > ...
> > > >> > > > > >
> > > >> > > > > > i beleived that outlinks are not obtained from the text
> > > >> > > > > > variable
> > > >>
> > > >> ?!
> > > >>
> > > >> > > > > >  in the same class we could see how outlinks are extracted
> !
> > > >> > > > > >
> > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   //
> extract
> > > >> > > > > > outlinks
> > > >> > > > > >
> > > >> > > > > >       URL baseTag = utils.getBase(root);
> > > >> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > >> > > > > >       links...");
> > > >>
> > > >> }
> > > >>
> > > >> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l,
> root);
> > > >> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > >> > > > > >
> > > >> > > > > > can you plz tell me what i did wrong.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > mehdi
> > > >>
> > > >> --
> > > >> Markus Jelsma - CTO - Openindex
> > > >> http://www.linkedin.com/in/markus17
> > > >> 050-8536620 / 06-50258350
>
>

RE: parse-html plugin

Posted by a a <mb...@msn.com>.

i want to know if some one did this job before , mabe he could tell us if it will take more time  (double time) when using another HtmlParsefilter to overwrite  the original ParseResult   object produced by the parse-html plugin.

thx


mehdi




> From: markus.jelsma@openindex.io
> To: user@nutch.apache.org
> Subject: Re: parse-html plugin
> Date: Wed, 2 Feb 2011 02:46:47 +0100
> CC: ab1sh3k@gmail.com
> 
> Oh well, please come back with your experience and results on this issue in 
> this thread. More users will benefit =)
> 
> > I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> > your time
> > 
> > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com> wrote:
> > > Hi,
> > > 
> > >  Just wondering what does the dumpText mean in the ParseChecker?
> > >  
> > >  On the same grounds, incase I am writing a custom filter that extends
> > >  the
> > > 
> > > HtmlParseFilter..do I have to make any configuration changes for nutch?
> > > 
> > > Thanks,
> > > Abi
> > > 
> > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma 
> <ma...@openindex.io>wrote:
> > >> I'm not really sure but i believe you must overwrite the already parsed
> > >> data
> > >> yourself in your filter.
> > >> 
> > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > >> > Thx for your reply :)
> > >> > 
> > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going
> > >> > to overwrite to ParseResult  varaible of the original plugin
> > >> > parser-html ?
> > >> > 
> > >> > is it not going to spend more time doing twice the operation of
> > >> 
> > >> extracting
> > >> 
> > >> > the html source code of each url to parse it  (first time the original
> > >> > parse-html plugin and the seconde time my new plugin ) ??
> > >> > 
> > >> > thx a lot
> > >> > 
> > >> > mehdi
> > >> > 
> > >> > > From: markus.jelsma@openindex.io
> > >> > > To: user@nutch.apache.org
> > >> > > Subject: Re: parse-html plugin
> > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > >> > > CC: mbellil@msn.com
> > >> > > 
> > >> > > Oh, i forgot. You could extend
> > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > >> > > whatever you need and store it in the
> > >> 
> > >> ParseResult
> > >> 
> > >> > > object.
> > >> > > 
> > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > >> > > > hi,
> > >> > > > 
> > >> > > > is my question so difficult ?
> > >> > > > no one have an idea ?
> > >> > > > 
> > >> > > > thx
> > >> > > > 
> > >> > > > 
> > >> > > > mehdi
> > >> > > > 
> > >> > > > > From: mbellil@msn.com
> > >> > > > > To: user@nutch.apache.org
> > >> > > > > Subject: RE: parse-html plugin
> > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > >> > > > > 
> > >> > > > > 
> > >> > > > > Hi All,
> > >> > > > > 
> > >> > > > > any  idea ?
> > >> > > > > 
> > >> > > > > 
> > >> > > > > 
> > >> > > > > mehdi
> > >> > > > > 
> > >> > > > > > From: mbellil@msn.com
> > >> > > > > > To: user@nutch.apache.org
> > >> > > > > > Subject: parse-html plugin
> > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > >> > > > > > 
> > >> > > > > > 
> > >> > > > > > hi,
> > >> > > > > > In the class HtmlParser I changed the 'text' variable to index
> > >> 
> > >> only
> > >> 
> > >> > > > > > a part of my html page, and since i did lost lot off outlinks
> > >> > > > > > !
> > >> > > > > > 
> > >> > > > > > ...
> > >> > > > > > 
> > >> > > > > >  utils.getText(sb,extractIndexableContent(root));  //added on
> > >> > > > > >  26-01-2011 to extract only text inside <col_centre>
> > >> > > > > >  
> > >> > > > > >   // utils.getText(sb, root);          // extract text   ---
> > >> > > > > >   disabled on 26-01-2011-
> > >> > > > > >   
> > >> > > > > >       text = sb.toString();
> > >> > > > > > 
> > >> > > > > > ...
> > >> > > > > > 
> > >> > > > > > i beleived that outlinks are not obtained from the text
> > >> > > > > > variable
> > >> 
> > >> ?!
> > >> 
> > >> > > > > >  in the same class we could see how outlinks are extracted !
> > >> > > > > > 
> > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract
> > >> > > > > > outlinks
> > >> > > > > > 
> > >> > > > > >       URL baseTag = utils.getBase(root);
> > >> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > >> > > > > >       links...");
> > >> 
> > >> }
> > >> 
> > >> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > >> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > >> > > > > > 
> > >> > > > > > can you plz tell me what i did wrong.
> > >> > > > > > 
> > >> > > > > > 
> > >> > > > > > mehdi
> > >> 
> > >> --
> > >> Markus Jelsma - CTO - Openindex
> > >> http://www.linkedin.com/in/markus17
> > >> 050-8536620 / 06-50258350

Re: parse-html plugin

Posted by Markus Jelsma <ma...@openindex.io>.

Oh well, please come back with your experience and results on this issue in 
this thread. More users will benefit =)

> I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> your time
> 
> On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com> wrote:
> > Hi,
> > 
> >  Just wondering what does the dumpText mean in the ParseChecker?
> >  
> >  On the same grounds, incase I am writing a custom filter that extends
> >  the
> > 
> > HtmlParseFilter..do I have to make any configuration changes for nutch?
> > 
> > Thanks,
> > Abi
> > 
> > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma 
<ma...@openindex.io>wrote:
> >> I'm not really sure but i believe you must overwrite the already parsed
> >> data
> >> yourself in your filter.
> >> 
> >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> >> > Thx for your reply :)
> >> > 
> >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going
> >> > to overwrite to ParseResult  varaible of the original plugin
> >> > parser-html ?
> >> > 
> >> > is it not going to spend more time doing twice the operation of
> >> 
> >> extracting
> >> 
> >> > the html source code of each url to parse it  (first time the original
> >> > parse-html plugin and the seconde time my new plugin ) ??
> >> > 
> >> > thx a lot
> >> > 
> >> > mehdi
> >> > 
> >> > > From: markus.jelsma@openindex.io
> >> > > To: user@nutch.apache.org
> >> > > Subject: Re: parse-html plugin
> >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> >> > > CC: mbellil@msn.com
> >> > > 
> >> > > Oh, i forgot. You could extend
> >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> >> > > whatever you need and store it in the
> >> 
> >> ParseResult
> >> 
> >> > > object.
> >> > > 
> >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> >> > > > hi,
> >> > > > 
> >> > > > is my question so difficult ?
> >> > > > no one have an idea ?
> >> > > > 
> >> > > > thx
> >> > > > 
> >> > > > 
> >> > > > mehdi
> >> > > > 
> >> > > > > From: mbellil@msn.com
> >> > > > > To: user@nutch.apache.org
> >> > > > > Subject: RE: parse-html plugin
> >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> >> > > > > 
> >> > > > > 
> >> > > > > Hi All,
> >> > > > > 
> >> > > > > any  idea ?
> >> > > > > 
> >> > > > > 
> >> > > > > 
> >> > > > > mehdi
> >> > > > > 
> >> > > > > > From: mbellil@msn.com
> >> > > > > > To: user@nutch.apache.org
> >> > > > > > Subject: parse-html plugin
> >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> >> > > > > > 
> >> > > > > > 
> >> > > > > > hi,
> >> > > > > > In the class HtmlParser I changed the 'text' variable to index
> >> 
> >> only
> >> 
> >> > > > > > a part of my html page, and since i did lost lot off outlinks
> >> > > > > > !
> >> > > > > > 
> >> > > > > > ...
> >> > > > > > 
> >> > > > > >  utils.getText(sb,extractIndexableContent(root));  //added on
> >> > > > > >  26-01-2011 to extract only text inside <col_centre>
> >> > > > > >  
> >> > > > > >   // utils.getText(sb, root);          // extract text   ---
> >> > > > > >   disabled on 26-01-2011-
> >> > > > > >   
> >> > > > > >       text = sb.toString();
> >> > > > > > 
> >> > > > > > ...
> >> > > > > > 
> >> > > > > > i beleived that outlinks are not obtained from the text
> >> > > > > > variable
> >> 
> >> ?!
> >> 
> >> > > > > >  in the same class we could see how outlinks are extracted !
> >> > > > > > 
> >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract
> >> > > > > > outlinks
> >> > > > > > 
> >> > > > > >       URL baseTag = utils.getBase(root);
> >> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting
> >> > > > > >       links...");
> >> 
> >> }
> >> 
> >> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> >> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> >> > > > > > 
> >> > > > > > can you plz tell me what i did wrong.
> >> > > > > > 
> >> > > > > > 
> >> > > > > > mehdi
> >> 
> >> --
> >> Markus Jelsma - CTO - Openindex
> >> http://www.linkedin.com/in/markus17
> >> 050-8536620 / 06-50258350

Re: parse-html plugin

Posted by ".: Abhishek :." <ab...@gmail.com>.

I am sorry, forgive my ignorance. I got the answer for it :) Thanks for your
time

On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi,
>
>  Just wondering what does the dumpText mean in the ParseChecker?
>
>  On the same grounds, incase I am writing a custom filter that extends the
> HtmlParseFilter..do I have to make any configuration changes for nutch?
>
> Thanks,
> Abi
>
>
> On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma <ma...@openindex.io>wrote:
>
>> I'm not really sure but i believe you must overwrite the already parsed
>> data
>> yourself in your filter.
>>
>> On Tuesday 01 February 2011 18:54:32 a a wrote:
>> > Thx for your reply :)
>> >
>> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to
>> > overwrite to ParseResult  varaible of the original plugin parser-html ?
>> >
>> > is it not going to spend more time doing twice the operation of
>> extracting
>> > the html source code of each url to parse it  (first time the original
>> > parse-html plugin and the seconde time my new plugin ) ??
>> >
>> > thx a lot
>> >
>> > mehdi
>> >
>> > > From: markus.jelsma@openindex.io
>> > > To: user@nutch.apache.org
>> > > Subject: Re: parse-html plugin
>> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
>> > > CC: mbellil@msn.com
>> > >
>> > > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter.
>> > > Then you can retrieve whatever you need and store it in the
>> ParseResult
>> > > object.
>> > >
>> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
>> > > > hi,
>> > > >
>> > > > is my question so difficult ?
>> > > > no one have an idea ?
>> > > >
>> > > > thx
>> > > >
>> > > >
>> > > > mehdi
>> > > >
>> > > > > From: mbellil@msn.com
>> > > > > To: user@nutch.apache.org
>> > > > > Subject: RE: parse-html plugin
>> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
>> > > > >
>> > > > >
>> > > > > Hi All,
>> > > > >
>> > > > > any  idea ?
>> > > > >
>> > > > >
>> > > > >
>> > > > > mehdi
>> > > > >
>> > > > > > From: mbellil@msn.com
>> > > > > > To: user@nutch.apache.org
>> > > > > > Subject: parse-html plugin
>> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
>> > > > > >
>> > > > > >
>> > > > > > hi,
>> > > > > > In the class HtmlParser I changed the 'text' variable to index
>> only
>> > > > > > a part of my html page, and since i did lost lot off outlinks !
>> > > > > >
>> > > > > > ...
>> > > > > >
>> > > > > >  utils.getText(sb,extractIndexableContent(root));  //added on
>> > > > > >  26-01-2011 to extract only text inside <col_centre>
>> > > > > >
>> > > > > >   // utils.getText(sb, root);          // extract text   ---
>> > > > > >   disabled on 26-01-2011-
>> > > > > >
>> > > > > >       text = sb.toString();
>> > > > > >
>> > > > > > ...
>> > > > > >
>> > > > > > i beleived that outlinks are not obtained from the text variable
>> ?!
>> > > > > >  in the same class we could see how outlinks are extracted !
>> > > > > >
>> > > > > >
>> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract
>> > > > > > outlinks
>> > > > > >
>> > > > > >       URL baseTag = utils.getBase(root);
>> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links...");
>> }
>> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
>> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
>> > > > > >
>> > > > > > can you plz tell me what i did wrong.
>> > > > > >
>> > > > > >
>> > > > > > mehdi
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>

Re: parse-html plugin

Posted by ".: Abhishek :." <ab...@gmail.com>.

Hi,

 Just wondering what does the dumpText mean in the ParseChecker?

 On the same grounds, incase I am writing a custom filter that extends the
HtmlParseFilter..do I have to make any configuration changes for nutch?

Thanks,
Abi


On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma <ma...@openindex.io>wrote:

> I'm not really sure but i believe you must overwrite the already parsed
> data
> yourself in your filter.
>
> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > Thx for your reply :)
> >
> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to
> > overwrite to ParseResult  varaible of the original plugin parser-html ?
> >
> > is it not going to spend more time doing twice the operation of
> extracting
> > the html source code of each url to parse it  (first time the original
> > parse-html plugin and the seconde time my new plugin ) ??
> >
> > thx a lot
> >
> > mehdi
> >
> > > From: markus.jelsma@openindex.io
> > > To: user@nutch.apache.org
> > > Subject: Re: parse-html plugin
> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > CC: mbellil@msn.com
> > >
> > > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter.
> > > Then you can retrieve whatever you need and store it in the ParseResult
> > > object.
> > >
> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > hi,
> > > >
> > > > is my question so difficult ?
> > > > no one have an idea ?
> > > >
> > > > thx
> > > >
> > > >
> > > > mehdi
> > > >
> > > > > From: mbellil@msn.com
> > > > > To: user@nutch.apache.org
> > > > > Subject: RE: parse-html plugin
> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > >
> > > > >
> > > > > Hi All,
> > > > >
> > > > > any  idea ?
> > > > >
> > > > >
> > > > >
> > > > > mehdi
> > > > >
> > > > > > From: mbellil@msn.com
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: parse-html plugin
> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > > >
> > > > > >
> > > > > > hi,
> > > > > > In the class HtmlParser I changed the 'text' variable to index
> only
> > > > > > a part of my html page, and since i did lost lot off outlinks !
> > > > > >
> > > > > > ...
> > > > > >
> > > > > >  utils.getText(sb,extractIndexableContent(root));  //added on
> > > > > >  26-01-2011 to extract only text inside <col_centre>
> > > > > >
> > > > > >   // utils.getText(sb, root);          // extract text   ---
> > > > > >   disabled on 26-01-2011-
> > > > > >
> > > > > >       text = sb.toString();
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > i beleived that outlinks are not obtained from the text variable
> ?!
> > > > > >  in the same class we could see how outlinks are extracted !
> > > > > >
> > > > > >
> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract
> > > > > > outlinks
> > > > > >
> > > > > >       URL baseTag = utils.getBase(root);
> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links...");
> }
> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > > > >
> > > > > > can you plz tell me what i did wrong.
> > > > > >
> > > > > >
> > > > > > mehdi
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: parse-html plugin

Posted by Markus Jelsma <ma...@openindex.io>.

I'm not really sure but i believe you must overwrite the already parsed data 
yourself in your filter.

On Tuesday 01 February 2011 18:54:32 a a wrote:
> Thx for your reply :)
> 
> so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to
> overwrite to ParseResult  varaible of the original plugin parser-html ?
> 
> is it not going to spend more time doing twice the operation of extracting
> the html source code of each url to parse it  (first time the original
> parse-html plugin and the seconde time my new plugin ) ??
> 
> thx a lot
> 
> mehdi
> 
> > From: markus.jelsma@openindex.io
> > To: user@nutch.apache.org
> > Subject: Re: parse-html plugin
> > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > CC: mbellil@msn.com
> > 
> > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter.
> > Then you can retrieve whatever you need and store it in the ParseResult
> > object.
> > 
> > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > hi,
> > > 
> > > is my question so difficult ?
> > > no one have an idea ?
> > > 
> > > thx
> > > 
> > > 
> > > mehdi
> > > 
> > > > From: mbellil@msn.com
> > > > To: user@nutch.apache.org
> > > > Subject: RE: parse-html plugin
> > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > any  idea ?
> > > > 
> > > > 
> > > > 
> > > > mehdi
> > > > 
> > > > > From: mbellil@msn.com
> > > > > To: user@nutch.apache.org
> > > > > Subject: parse-html plugin
> > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > > 
> > > > > 
> > > > > hi,
> > > > > In the class HtmlParser I changed the 'text' variable to index only
> > > > > a part of my html page, and since i did lost lot off outlinks !
> > > > > 
> > > > > ...
> > > > > 
> > > > >  utils.getText(sb,extractIndexableContent(root));  //added on
> > > > >  26-01-2011 to extract only text inside <col_centre>
> > > > >  
> > > > >   // utils.getText(sb, root);          // extract text   ---
> > > > >   disabled on 26-01-2011-
> > > > >   
> > > > >       text = sb.toString();
> > > > > 
> > > > > ...
> > > > > 
> > > > > i beleived that outlinks are not obtained from the text variable ?!
> > > > >  in the same class we could see how outlinks are extracted !
> > > > > 
> > > > > 
> > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract
> > > > > outlinks
> > > > > 
> > > > >       URL baseTag = utils.getBase(root);
> > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > > > 
> > > > > can you plz tell me what i did wrong.
> > > > > 
> > > > > 
> > > > > mehdi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: parse-html plugin

Posted by a a <mb...@msn.com>.

Thx for your reply :)

so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to overwrite to ParseResult  varaible of the original plugin parser-html ?

is it not going to spend more time doing twice the operation of extracting the html source code of each url to parse it  (first time the original parse-html plugin and the seconde time
my new plugin ) ??

thx a lot

mehdi




> From: markus.jelsma@openindex.io
> To: user@nutch.apache.org
> Subject: Re: parse-html plugin
> Date: Tue, 1 Feb 2011 18:42:51 +0100
> CC: mbellil@msn.com
> 
> Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter. Then you 
> can retrieve whatever you need and store it in the ParseResult object.
> 
> On Tuesday 01 February 2011 15:25:20 a a wrote:
> > hi,
> > 
> > is my question so difficult ?
> > no one have an idea ?
> > 
> > thx
> > 
> > 
> > mehdi
> > 
> > > From: mbellil@msn.com
> > > To: user@nutch.apache.org
> > > Subject: RE: parse-html plugin
> > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > 
> > > 
> > > Hi All,
> > > 
> > > any  idea ?
> > > 
> > > 
> > > 
> > > mehdi
> > > 
> > > > From: mbellil@msn.com
> > > > To: user@nutch.apache.org
> > > > Subject: parse-html plugin
> > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > 
> > > > 
> > > > hi,
> > > > In the class HtmlParser I changed the 'text' variable to index only a
> > > > part of my html page, and since i did lost lot off outlinks !
> > > > 
> > > > ...
> > > > 
> > > >  utils.getText(sb,extractIndexableContent(root));  //added on
> > > >  26-01-2011 to extract only text inside <col_centre>
> > > >  
> > > >   // utils.getText(sb, root);          // extract text   --- disabled
> > > >   on 26-01-2011-
> > > >   
> > > >       text = sb.toString();
> > > > 
> > > > ...
> > > > 
> > > > i beleived that outlinks are not obtained from the text variable ?!  in
> > > > the same class we could see how outlinks are extracted !
> > > > 
> > > > 
> > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract outlinks
> > > > 
> > > >       URL baseTag = utils.getBase(root);
> > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > > 
> > > > can you plz tell me what i did wrong.
> > > > 
> > > > 
> > > > mehdi
> 
> -- 
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

Re: parse-html plugin

Posted by Markus Jelsma <ma...@openindex.io>.

Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter. Then you 
can retrieve whatever you need and store it in the ParseResult object.

On Tuesday 01 February 2011 15:25:20 a a wrote:
> hi,
> 
> is my question so difficult ?
> no one have an idea ?
> 
> thx
> 
> 
> mehdi
> 
> > From: mbellil@msn.com
> > To: user@nutch.apache.org
> > Subject: RE: parse-html plugin
> > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > 
> > 
> > Hi All,
> > 
> > any  idea ?
> > 
> > 
> > 
> > mehdi
> > 
> > > From: mbellil@msn.com
> > > To: user@nutch.apache.org
> > > Subject: parse-html plugin
> > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > 
> > > 
> > > hi,
> > > In the class HtmlParser I changed the 'text' variable to index only a
> > > part of my html page, and since i did lost lot off outlinks !
> > > 
> > > ...
> > > 
> > >  utils.getText(sb,extractIndexableContent(root));  //added on
> > >  26-01-2011 to extract only text inside <col_centre>
> > >  
> > >   // utils.getText(sb, root);          // extract text   --- disabled
> > >   on 26-01-2011-
> > >   
> > >       text = sb.toString();
> > > 
> > > ...
> > > 
> > > i beleived that outlinks are not obtained from the text variable ?!  in
> > > the same class we could see how outlinks are extracted !
> > > 
> > > 
> > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract outlinks
> > > 
> > >       URL baseTag = utils.getBase(root);
> > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > 
> > > can you plz tell me what i did wrong.
> > > 
> > > 
> > > mehdi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350