You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alexander Fahlke <al...@googlemail.com> on 2011/09/05 12:06:06 UTC

RegEx URL Normalizer

Hi!

I have problems with the right setup of the RegExURLNormalizer. It should
strip out some parameters for a specific script.
Only pages where "document.py" is present should be normalized.

Here is an example:

  Input:
http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
  Output:
http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf

Date, Sort, Page, pos, anz are the parameters to be stripped out.

I tried it with the following setup:

  ([;_]?((?i)l|j|bv_)?((?i)date|
sort|page|pos|anz)=.*?)(\?|&|#|$)


How to tell nutch to use this regex only for pages with "document.py"?


BR

-- 
Alexander Fahlke
Software Development
www.informera.de

Re: RegEx URL Normalizer

Posted by Markus Jelsma <ma...@openindex.io>.

On Monday 05 September 2011 12:06:06 Alexander Fahlke wrote:
> Hi!
> 
> I have problems with the right setup of the RegExURLNormalizer. It should
> strip out some parameters for a specific script.
> Only pages where "document.py" is present should be normalized.
> 
> Here is an example:
> 
>   Input:
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2
> 000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf Output:
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=165
> 19&Blank=1.pdf
> 
> Date, Sort, Page, pos, anz are the parameters to be stripped out.
> 
> I tried it with the following setup:
> 
>   ([;_]?((?i)l|j|bv_)?((?i)date|
> sort|page|pos|anz)=.*?)(\?|&|#|$)
> 
> 
> How to tell nutch to use this regex only for pages with "document.py"?

You can modify the regex to force matching of preceding document.py with some 
look-behind operator. Nutch 1.4-dev uses java.util.regex instead of Apache ORO 
in the normalizer so you have support for the look-behind operator.

> 
> 
> BR

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: RegEx URL Normalizer

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Check the bottom normalizer, it uses the lookbehind operator to remove double slashes except the first two.

Cheers,

http://svn.apache.org/viewvc/nutch/trunk/conf/regex-normalize.xml.template?view=markup
 
 
-----Original message-----
> From:Magnús Skúlason <ma...@gmail.com>
> Sent: Mon 22-Oct-2012 00:34
> To: user@nutch.apache.org
> Cc: dkavraal@gmail.com; Markus Jelsma <ma...@openindex.io>
> Subject: Re: RegEx URL Normalizer
> 
> Hi,
> 
> I am interested in doing this i.e. only strip out parameters from url
> if some other string is found as well, in my case it will be a domain
> name. I am using 1.5.1 but I am unfamiliar with the look-behind
> operator.
> 
> Does anyone have a sample of how this is done?
> 
> best regards,
> Magnus
> 
> On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke
> <al...@googlemail.com> wrote:
> > Thanks guys!
> >
> > @Dinçer: This does not check if the URL contains "document.py". :(
> >
> > @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
> > RegexURLNormalizer. ;)
> >
> >   -->  regexNormalize(String urlString, String scope) { ...
> >
> >   It now simple stupid checks if urlString contains "document.py" and then
> > cuts out the unwanted stuff.
> >   I made this is even configurable via nutch-site.xml.
> >
> >
> > Nutch 1.4 would be better for this. Maybe in the next project.
> >
> >
> > BR
> >
> > On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <dk...@gmail.com> wrote:
> >
> >> Hi Alexander,
> >>
> >> Would this one work? (I am far away from a Nutch installation to test)
> >>
> >> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
> >>
> >> Don't forget to use &amp; instead of & in the regex.
> >>
> >> Best,
> >> Dinçer
> >>
> >>
> >> 2011/9/5 Alexander Fahlke <al...@googlemail.com>
> >>
> >>> Hi!
> >>>
> >>> I have problems with the right setup of the RegExURLNormalizer. It should
> >>> strip out some parameters for a specific script.
> >>> Only pages where "document.py" is present should be normalized.
> >>>
> >>> Here is an example:
> >>>
> >>>  Input:
> >>>
> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
> >>>  Output:
> >>>
> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
> >>>
> >>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
> >>>
> >>> I tried it with the following setup:
> >>>
> >>>  ([;_]?((?i)l|j|bv_)?((?i)date|
> >>> sort|page|pos|anz)=.*?)(\?|&|#|$)
> >>>
> >>>
> >>> How to tell nutch to use this regex only for pages with "document.py"?
> >>>
> >>>
> >>> BR
> >>>
> >>> --
> >>> Alexander Fahlke
> >>> Software Development
> >>> www.informera.de
> >>>
> >>
> >>
> >
> >
> > --
> > Alexander Fahlke
> > Software Development
> > www.informera.de
> 

Re: RegEx URL Normalizer

Posted by Magnús Skúlason <ma...@gmail.com>.
Hi,

I am interested in doing this i.e. only strip out parameters from url
if some other string is found as well, in my case it will be a domain
name. I am using 1.5.1 but I am unfamiliar with the look-behind
operator.

Does anyone have a sample of how this is done?

best regards,
Magnus

On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke
<al...@googlemail.com> wrote:
> Thanks guys!
>
> @Dinçer: This does not check if the URL contains "document.py". :(
>
> @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
> RegexURLNormalizer. ;)
>
>   -->  regexNormalize(String urlString, String scope) { ...
>
>   It now simple stupid checks if urlString contains "document.py" and then
> cuts out the unwanted stuff.
>   I made this is even configurable via nutch-site.xml.
>
>
> Nutch 1.4 would be better for this. Maybe in the next project.
>
>
> BR
>
> On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <dk...@gmail.com> wrote:
>
>> Hi Alexander,
>>
>> Would this one work? (I am far away from a Nutch installation to test)
>>
>> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
>>
>> Don't forget to use &amp; instead of & in the regex.
>>
>> Best,
>> Dinçer
>>
>>
>> 2011/9/5 Alexander Fahlke <al...@googlemail.com>
>>
>>> Hi!
>>>
>>> I have problems with the right setup of the RegExURLNormalizer. It should
>>> strip out some parameters for a specific script.
>>> Only pages where "document.py" is present should be normalized.
>>>
>>> Here is an example:
>>>
>>>  Input:
>>>
>>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
>>>  Output:
>>>
>>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>>>
>>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>>>
>>> I tried it with the following setup:
>>>
>>>  ([;_]?((?i)l|j|bv_)?((?i)date|
>>> sort|page|pos|anz)=.*?)(\?|&|#|$)
>>>
>>>
>>> How to tell nutch to use this regex only for pages with "document.py"?
>>>
>>>
>>> BR
>>>
>>> --
>>> Alexander Fahlke
>>> Software Development
>>> www.informera.de
>>>
>>
>>
>
>
> --
> Alexander Fahlke
> Software Development
> www.informera.de

Re: RegEx URL Normalizer

Posted by Alexander Fahlke <al...@googlemail.com>.
Thanks guys!

@Dinçer: This does not check if the URL contains "document.py". :(

@Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
RegexURLNormalizer. ;)

  -->  regexNormalize(String urlString, String scope) { ...

  It now simple stupid checks if urlString contains "document.py" and then
cuts out the unwanted stuff.
  I made this is even configurable via nutch-site.xml.


Nutch 1.4 would be better for this. Maybe in the next project.


BR

On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <dk...@gmail.com> wrote:

> Hi Alexander,
>
> Would this one work? (I am far away from a Nutch installation to test)
>
> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
>
> Don't forget to use &amp; instead of & in the regex.
>
> Best,
> Dinçer
>
>
> 2011/9/5 Alexander Fahlke <al...@googlemail.com>
>
>> Hi!
>>
>> I have problems with the right setup of the RegExURLNormalizer. It should
>> strip out some parameters for a specific script.
>> Only pages where "document.py" is present should be normalized.
>>
>> Here is an example:
>>
>>  Input:
>>
>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
>>  Output:
>>
>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>>
>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>>
>> I tried it with the following setup:
>>
>>  ([;_]?((?i)l|j|bv_)?((?i)date|
>> sort|page|pos|anz)=.*?)(\?|&|#|$)
>>
>>
>> How to tell nutch to use this regex only for pages with "document.py"?
>>
>>
>> BR
>>
>> --
>> Alexander Fahlke
>> Software Development
>> www.informera.de
>>
>
>


-- 
Alexander Fahlke
Software Development
www.informera.de

Re: RegEx URL Normalizer

Posted by Dinçer Kavraal <dk...@gmail.com>.
Hi Alexander,

Would this one work? (I am far away from a Nutch installation to test)
(?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))

Don't forget to use &amp; instead of & in the regex.

Best,
Dinçer


2011/9/5 Alexander Fahlke <al...@googlemail.com>

> Hi!
>
> I have problems with the right setup of the RegExURLNormalizer. It should
> strip out some parameters for a specific script.
> Only pages where "document.py" is present should be normalized.
>
> Here is an example:
>
>  Input:
>
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
>  Output:
>
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>
> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>
> I tried it with the following setup:
>
>  ([;_]?((?i)l|j|bv_)?((?i)date|
> sort|page|pos|anz)=.*?)(\?|&|#|$)
>
>
> How to tell nutch to use this regex only for pages with "document.py"?
>
>
> BR
>
> --
> Alexander Fahlke
> Software Development
> www.informera.de
>