You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by remi tassing <ta...@gmail.com> on 2014/11/01 23:03:03 UTC

Re: Ignoring parts of a URL like certain query parameters

Hi John,

Have a look at some regex tutorials. What you are asking for is absolutely
doable. E.g.:

<regex>
  <pattern>^(http://www.test.com?.*)query2=.*&
<http://www.test.com?.*query2=.*&>(.*)</pattern>
  <substitution>$1&$2</substitution>
</regex>

Plz double check if the "ampersand" should be escaped or not. I'm currently
not able to verified the regex on this machine so it might not be 100%
correct.

Remi


On Fri, Oct 31, 2014 at 9:41 AM, John Smith <pf...@gmail.com> wrote:

> Just realized that we have the regex-normalize.xml
>
> If we want to universally replace a certain parameter for all URLs, then
> that is doable
>
> <regex>
>   <pattern>query2=.*&</pattern>
>   <substitution></substitution>
> </regex>
>
> However, the problem with this approach would be that the same parameter
> which can be ignored for one URL might be valid for another URL.
>
> So, if we want to replace that parameter only for certain URLs, then the
> problem is that the crawler will replace the entire pattern with the given
> substitution.
>
> Example:
>
> Lets say we have the URL:
> "http://www.test.com?query1=value1&query2=value2&query3=value3";
>
> And we want the crawler to ignore the query2 parameter
>
> If we enter something like this, it doesn't work because the entire pattern
> is stripped and the output is : query3=value3
>
> <regex>
>   <pattern>^http://www.test.com?.*query2=.*&</pattern>
>   <substitution></substitution>
> </regex>
>
> Is there any other way to do this?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Ignoring-parts-of-a-URL-like-certain-query-parameters-tp4166747p4166763.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>