You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by remi tassing <ta...@gmail.com> on 2012/09/08 19:30:40 UTC

Escaping URL during redirection

Hi guys,

I'm not quite sure how to make Nutch follow the normalizer regular
expressions during redirection. I see some URLs are not properly escaped.

Any help?

Remi

Re: Escaping URL during redirection

Posted by remi tassing <ta...@gmail.com>.
Sorry, I think it works. I was trying 'parsechecker' and it doesn't apply
'regexnormalizer' rules by default.

So, case solved, thanks a lot!

On Sunday, September 9, 2012, Sebastian Nagel wrote:

> Redirects are filtered and normalized. It works for 1.4/1.5 and should for
> trunk.
> One subtlety: there is an extra scope for normalization of redirects
> ("fetcher").
> If scoped normalization rules/expressions are used don't forget to
> configure
> this scope with the appropriate regex-normalize rule file
> via property "urlnormalizer.regex.file.fetcher".
>
> On 09/08/2012 08:59 PM, Markus Jelsma wrote:
> > You mean the redirects followed by the fetcher (if enabled) are not
> passed through the filters and normalizers? You can open an issue for that
> and if possible provide a patch for trunk. An example of the fetcher
> following filtered and normalized outlinks can be found in the fetcher
> around line 1036.
> >
> >
> > -----Original message-----
> >> From:remi tassing <tassingremi@gmail.com <javascript:;>>
> >> Sent: Sat 08-Sep-2012 19:34
> >> To: user@nutch.apache.org <javascript:;>
> >> Subject: Escaping URL during redirection
> >>
> >> Hi guys,
> >>
> >> I'm not quite sure how to make Nutch follow the normalizer regular
> >> expressions during redirection. I see some URLs are not properly
> escaped.
> >>
> >> Any help?
> >>
> >> Remi
> >>
>
>

Re: Escaping URL during redirection

Posted by Sebastian Nagel <wa...@googlemail.com>.
Redirects are filtered and normalized. It works for 1.4/1.5 and should for trunk.
One subtlety: there is an extra scope for normalization of redirects ("fetcher").
If scoped normalization rules/expressions are used don't forget to configure
this scope with the appropriate regex-normalize rule file
via property "urlnormalizer.regex.file.fetcher".

On 09/08/2012 08:59 PM, Markus Jelsma wrote:
> You mean the redirects followed by the fetcher (if enabled) are not passed through the filters and normalizers? You can open an issue for that and if possible provide a patch for trunk. An example of the fetcher following filtered and normalized outlinks can be found in the fetcher around line 1036.
>  
>  
> -----Original message-----
>> From:remi tassing <ta...@gmail.com>
>> Sent: Sat 08-Sep-2012 19:34
>> To: user@nutch.apache.org
>> Subject: Escaping URL during redirection
>>
>> Hi guys,
>>
>> I'm not quite sure how to make Nutch follow the normalizer regular
>> expressions during redirection. I see some URLs are not properly escaped.
>>
>> Any help?
>>
>> Remi
>>


RE: Escaping URL during redirection

Posted by Markus Jelsma <ma...@openindex.io>.
You mean the redirects followed by the fetcher (if enabled) are not passed through the filters and normalizers? You can open an issue for that and if possible provide a patch for trunk. An example of the fetcher following filtered and normalized outlinks can be found in the fetcher around line 1036.
 
 
-----Original message-----
> From:remi tassing <ta...@gmail.com>
> Sent: Sat 08-Sep-2012 19:34
> To: user@nutch.apache.org
> Subject: Escaping URL during redirection
> 
> Hi guys,
> 
> I'm not quite sure how to make Nutch follow the normalizer regular
> expressions during redirection. I see some URLs are not properly escaped.
> 
> Any help?
> 
> Remi
>