You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alexander Fahlke <al...@googlemail.com> on 2011/09/05 12:06:06 UTC
RegEx URL Normalizer
Hi!
I have problems with the right setup of the RegExURLNormalizer. It should
strip out some parameters for a specific script.
Only pages where "document.py" is present should be normalized.
Here is an example:
Input:
http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
Output:
http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
Date, Sort, Page, pos, anz are the parameters to be stripped out.
I tried it with the following setup:
([;_]?((?i)l|j|bv_)?((?i)date|
sort|page|pos|anz)=.*?)(\?|&|#|$)
How to tell nutch to use this regex only for pages with "document.py"?
BR
--
Alexander Fahlke
Software Development
www.informera.de
Re: RegEx URL Normalizer
Posted by Markus Jelsma <ma...@openindex.io>.
On Monday 05 September 2011 12:06:06 Alexander Fahlke wrote:
> Hi!
>
> I have problems with the right setup of the RegExURLNormalizer. It should
> strip out some parameters for a specific script.
> Only pages where "document.py" is present should be normalized.
>
> Here is an example:
>
> Input:
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2
> 000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf Output:
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=165
> 19&Blank=1.pdf
>
> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>
> I tried it with the following setup:
>
> ([;_]?((?i)l|j|bv_)?((?i)date|
> sort|page|pos|anz)=.*?)(\?|&|#|$)
>
>
> How to tell nutch to use this regex only for pages with "document.py"?
You can modify the regex to force matching of preceding document.py with some
look-behind operator. Nutch 1.4-dev uses java.util.regex instead of Apache ORO
in the normalizer so you have support for the look-behind operator.
>
>
> BR
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
RE: RegEx URL Normalizer
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
Check the bottom normalizer, it uses the lookbehind operator to remove double slashes except the first two.
Cheers,
http://svn.apache.org/viewvc/nutch/trunk/conf/regex-normalize.xml.template?view=markup
-----Original message-----
> From:Magnús Skúlason <ma...@gmail.com>
> Sent: Mon 22-Oct-2012 00:34
> To: user@nutch.apache.org
> Cc: dkavraal@gmail.com; Markus Jelsma <ma...@openindex.io>
> Subject: Re: RegEx URL Normalizer
>
> Hi,
>
> I am interested in doing this i.e. only strip out parameters from url
> if some other string is found as well, in my case it will be a domain
> name. I am using 1.5.1 but I am unfamiliar with the look-behind
> operator.
>
> Does anyone have a sample of how this is done?
>
> best regards,
> Magnus
>
> On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke
> <al...@googlemail.com> wrote:
> > Thanks guys!
> >
> > @Dinçer: This does not check if the URL contains "document.py". :(
> >
> > @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
> > RegexURLNormalizer. ;)
> >
> > --> regexNormalize(String urlString, String scope) { ...
> >
> > It now simple stupid checks if urlString contains "document.py" and then
> > cuts out the unwanted stuff.
> > I made this is even configurable via nutch-site.xml.
> >
> >
> > Nutch 1.4 would be better for this. Maybe in the next project.
> >
> >
> > BR
> >
> > On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <dk...@gmail.com> wrote:
> >
> >> Hi Alexander,
> >>
> >> Would this one work? (I am far away from a Nutch installation to test)
> >>
> >> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
> >>
> >> Don't forget to use & instead of & in the regex.
> >>
> >> Best,
> >> Dinçer
> >>
> >>
> >> 2011/9/5 Alexander Fahlke <al...@googlemail.com>
> >>
> >>> Hi!
> >>>
> >>> I have problems with the right setup of the RegExURLNormalizer. It should
> >>> strip out some parameters for a specific script.
> >>> Only pages where "document.py" is present should be normalized.
> >>>
> >>> Here is an example:
> >>>
> >>> Input:
> >>>
> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
> >>> Output:
> >>>
> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
> >>>
> >>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
> >>>
> >>> I tried it with the following setup:
> >>>
> >>> ([;_]?((?i)l|j|bv_)?((?i)date|
> >>> sort|page|pos|anz)=.*?)(\?|&|#|$)
> >>>
> >>>
> >>> How to tell nutch to use this regex only for pages with "document.py"?
> >>>
> >>>
> >>> BR
> >>>
> >>> --
> >>> Alexander Fahlke
> >>> Software Development
> >>> www.informera.de
> >>>
> >>
> >>
> >
> >
> > --
> > Alexander Fahlke
> > Software Development
> > www.informera.de
>
Re: RegEx URL Normalizer
Posted by Magnús Skúlason <ma...@gmail.com>.
Hi,
I am interested in doing this i.e. only strip out parameters from url
if some other string is found as well, in my case it will be a domain
name. I am using 1.5.1 but I am unfamiliar with the look-behind
operator.
Does anyone have a sample of how this is done?
best regards,
Magnus
On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke
<al...@googlemail.com> wrote:
> Thanks guys!
>
> @Dinçer: This does not check if the URL contains "document.py". :(
>
> @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
> RegexURLNormalizer. ;)
>
> --> regexNormalize(String urlString, String scope) { ...
>
> It now simple stupid checks if urlString contains "document.py" and then
> cuts out the unwanted stuff.
> I made this is even configurable via nutch-site.xml.
>
>
> Nutch 1.4 would be better for this. Maybe in the next project.
>
>
> BR
>
> On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <dk...@gmail.com> wrote:
>
>> Hi Alexander,
>>
>> Would this one work? (I am far away from a Nutch installation to test)
>>
>> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
>>
>> Don't forget to use & instead of & in the regex.
>>
>> Best,
>> Dinçer
>>
>>
>> 2011/9/5 Alexander Fahlke <al...@googlemail.com>
>>
>>> Hi!
>>>
>>> I have problems with the right setup of the RegExURLNormalizer. It should
>>> strip out some parameters for a specific script.
>>> Only pages where "document.py" is present should be normalized.
>>>
>>> Here is an example:
>>>
>>> Input:
>>>
>>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
>>> Output:
>>>
>>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>>>
>>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>>>
>>> I tried it with the following setup:
>>>
>>> ([;_]?((?i)l|j|bv_)?((?i)date|
>>> sort|page|pos|anz)=.*?)(\?|&|#|$)
>>>
>>>
>>> How to tell nutch to use this regex only for pages with "document.py"?
>>>
>>>
>>> BR
>>>
>>> --
>>> Alexander Fahlke
>>> Software Development
>>> www.informera.de
>>>
>>
>>
>
>
> --
> Alexander Fahlke
> Software Development
> www.informera.de
Re: RegEx URL Normalizer
Posted by Alexander Fahlke <al...@googlemail.com>.
Thanks guys!
@Dinçer: This does not check if the URL contains "document.py". :(
@Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
RegexURLNormalizer. ;)
--> regexNormalize(String urlString, String scope) { ...
It now simple stupid checks if urlString contains "document.py" and then
cuts out the unwanted stuff.
I made this is even configurable via nutch-site.xml.
Nutch 1.4 would be better for this. Maybe in the next project.
BR
On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <dk...@gmail.com> wrote:
> Hi Alexander,
>
> Would this one work? (I am far away from a Nutch installation to test)
>
> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
>
> Don't forget to use & instead of & in the regex.
>
> Best,
> Dinçer
>
>
> 2011/9/5 Alexander Fahlke <al...@googlemail.com>
>
>> Hi!
>>
>> I have problems with the right setup of the RegExURLNormalizer. It should
>> strip out some parameters for a specific script.
>> Only pages where "document.py" is present should be normalized.
>>
>> Here is an example:
>>
>> Input:
>>
>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
>> Output:
>>
>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>>
>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>>
>> I tried it with the following setup:
>>
>> ([;_]?((?i)l|j|bv_)?((?i)date|
>> sort|page|pos|anz)=.*?)(\?|&|#|$)
>>
>>
>> How to tell nutch to use this regex only for pages with "document.py"?
>>
>>
>> BR
>>
>> --
>> Alexander Fahlke
>> Software Development
>> www.informera.de
>>
>
>
--
Alexander Fahlke
Software Development
www.informera.de
Re: RegEx URL Normalizer
Posted by Dinçer Kavraal <dk...@gmail.com>.
Hi Alexander,
Would this one work? (I am far away from a Nutch installation to test)
(?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
Don't forget to use & instead of & in the regex.
Best,
Dinçer
2011/9/5 Alexander Fahlke <al...@googlemail.com>
> Hi!
>
> I have problems with the right setup of the RegExURLNormalizer. It should
> strip out some parameters for a specific script.
> Only pages where "document.py" is present should be normalized.
>
> Here is an example:
>
> Input:
>
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
> Output:
>
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>
> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>
> I tried it with the following setup:
>
> ([;_]?((?i)l|j|bv_)?((?i)date|
> sort|page|pos|anz)=.*?)(\?|&|#|$)
>
>
> How to tell nutch to use this regex only for pages with "document.py"?
>
>
> BR
>
> --
> Alexander Fahlke
> Software Development
> www.informera.de
>