You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2013/01/29 13:16:03 UTC

Outlinks in parse filter

Hi,

Outlinks that reach the parse filters via ParseData are not normalized or filtered but i believe they should be. If you would try to do something sensible with the outlinks in the parse filter you cannot rely on their accuracy. Should we not move the calls to ParseOutputFormat.filterNormalize to the parse plugin?

Any thoughts?
Markus

Re: Outlinks in parse filter

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Markus,

> we should be fine right?
Yes, even better: FeedParser only contains URLNormalizers and URLFilters objects which get the
references to plugin instances themselves via ObjectCache in the constructor.
Btw., that's also the way the parse filter plugins are referenced,
eg. TikaParser -> HtmlParseFilters -> ObjectCache.get(conf).getObject(MyCustomParseFilter).
That's efficient, but thread-safety is a requirement ;-)
I found also Andrzej's post:
http://mail-archives.apache.org/mod_mbox/nutch-dev/201204.mbox/%3C4F842D74.6030703@getopt.org%3E

Sebastian

On 02/01/2013 03:41 PM, Markus Jelsma wrote:
> At a second thought, if like the feed parser the instance is kept in the class and only loaded in setConf(), we should be fine right?
>  
>  
> -----Original message-----
>> From:Markus Jelsma <ma...@openindex.io>
>> Sent: Fri 01-Feb-2013 15:38
>> To: dev@nutch.apache.org
>> Subject: RE: Outlinks in parse filter
>>
>> Hi Sebastian,
>>
>> Alright. How about a performance penalty if we get a new instance of filters and normalizers for each parse? Right now each thread has its own instances. Some filters can be very costly to load too frequently. 
>>
>> Thanks,
>> Markus
>>
>>  
>>  
>> -----Original message-----
>>> From:Sebastian Nagel <wa...@googlemail.com>
>>> Sent: Tue 29-Jan-2013 22:22
>>> To: dev@nutch.apache.org
>>> Subject: Re: Outlinks in parse filter
>>>
>>> Hi Markus,
>>>
>>> this would mean that urlfilter and urlnormalizer plugins are accessed from parse plugins.
>>> At a first glance, sounds somewhat oddish. But it's already the case for the feed parser.
>>>
>>> We would have to do it for all parse plugins. Since there not so many that's no argument against.
>>>
>>> Supposed you can still switch it off via the parse.(filter|normalize).urls properties I see no
>>> serious reason why it can't be done.
>>>
>>> Sebastian
>>>
>>> On 01/29/2013 01:16 PM, Markus Jelsma wrote:
>>>> Hi,
>>>>
>>>> Outlinks that reach the parse filters via ParseData are not normalized or filtered but i believe they should be. If you would try to do something sensible with the outlinks in the parse filter you cannot rely on their accuracy. Should we not move the calls to ParseOutputFormat.filterNormalize to the parse plugin?
>>>>
>>>> Any thoughts?
>>>> Markus
>>>>
>>>
>>>
>>


RE: Outlinks in parse filter

Posted by Markus Jelsma <ma...@openindex.io>.
At a second thought, if like the feed parser the instance is kept in the class and only loaded in setConf(), we should be fine right?
 
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Fri 01-Feb-2013 15:38
> To: dev@nutch.apache.org
> Subject: RE: Outlinks in parse filter
> 
> Hi Sebastian,
> 
> Alright. How about a performance penalty if we get a new instance of filters and normalizers for each parse? Right now each thread has its own instances. Some filters can be very costly to load too frequently. 
> 
> Thanks,
> Markus
> 
>  
>  
> -----Original message-----
> > From:Sebastian Nagel <wa...@googlemail.com>
> > Sent: Tue 29-Jan-2013 22:22
> > To: dev@nutch.apache.org
> > Subject: Re: Outlinks in parse filter
> > 
> > Hi Markus,
> > 
> > this would mean that urlfilter and urlnormalizer plugins are accessed from parse plugins.
> > At a first glance, sounds somewhat oddish. But it's already the case for the feed parser.
> > 
> > We would have to do it for all parse plugins. Since there not so many that's no argument against.
> > 
> > Supposed you can still switch it off via the parse.(filter|normalize).urls properties I see no
> > serious reason why it can't be done.
> > 
> > Sebastian
> > 
> > On 01/29/2013 01:16 PM, Markus Jelsma wrote:
> > > Hi,
> > > 
> > > Outlinks that reach the parse filters via ParseData are not normalized or filtered but i believe they should be. If you would try to do something sensible with the outlinks in the parse filter you cannot rely on their accuracy. Should we not move the calls to ParseOutputFormat.filterNormalize to the parse plugin?
> > > 
> > > Any thoughts?
> > > Markus
> > > 
> > 
> > 
> 

RE: Outlinks in parse filter

Posted by Markus Jelsma <ma...@openindex.io>.
Hi Sebastian,

Alright. How about a performance penalty if we get a new instance of filters and normalizers for each parse? Right now each thread has its own instances. Some filters can be very costly to load too frequently. 

Thanks,
Markus

 
 
-----Original message-----
> From:Sebastian Nagel <wa...@googlemail.com>
> Sent: Tue 29-Jan-2013 22:22
> To: dev@nutch.apache.org
> Subject: Re: Outlinks in parse filter
> 
> Hi Markus,
> 
> this would mean that urlfilter and urlnormalizer plugins are accessed from parse plugins.
> At a first glance, sounds somewhat oddish. But it's already the case for the feed parser.
> 
> We would have to do it for all parse plugins. Since there not so many that's no argument against.
> 
> Supposed you can still switch it off via the parse.(filter|normalize).urls properties I see no
> serious reason why it can't be done.
> 
> Sebastian
> 
> On 01/29/2013 01:16 PM, Markus Jelsma wrote:
> > Hi,
> > 
> > Outlinks that reach the parse filters via ParseData are not normalized or filtered but i believe they should be. If you would try to do something sensible with the outlinks in the parse filter you cannot rely on their accuracy. Should we not move the calls to ParseOutputFormat.filterNormalize to the parse plugin?
> > 
> > Any thoughts?
> > Markus
> > 
> 
> 

Re: Outlinks in parse filter

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Markus,

this would mean that urlfilter and urlnormalizer plugins are accessed from parse plugins.
At a first glance, sounds somewhat oddish. But it's already the case for the feed parser.

We would have to do it for all parse plugins. Since there not so many that's no argument against.

Supposed you can still switch it off via the parse.(filter|normalize).urls properties I see no
serious reason why it can't be done.

Sebastian

On 01/29/2013 01:16 PM, Markus Jelsma wrote:
> Hi,
> 
> Outlinks that reach the parse filters via ParseData are not normalized or filtered but i believe they should be. If you would try to do something sensible with the outlinks in the parse filter you cannot rely on their accuracy. Should we not move the calls to ParseOutputFormat.filterNormalize to the parse plugin?
> 
> Any thoughts?
> Markus
>