You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sourabh Kasliwal <so...@mojostation.com> on 2011/01/12 06:21:53 UTC

Truncation of url after #

Hi,

While crawling some links I found that nutch truncate some urls that have #
within it.
Eg:-
*http://www.techmeme.com/110111/p82#a110111p82* gets truncated to *
http://www.techmeme.com/110111/p82*

Can any one please let me know why does nutch does this... or is there a
simple way to avoid it.

regards
Sourabh

Re: Truncation of url after #

Posted by charan kumar <ch...@gmail.com>.
Hello,

# is for interpage anchoring, which mean both URLS should point to the same
webpage.

It is done via URLNormalizers.
Comment the following entry in regex-normalize.xml, if your really have to
do it.
<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <substitution>$1</substitution>
</regex>

Thanks,
Charan
On Tue, Jan 11, 2011 at 9:21 PM, Sourabh Kasliwal
<so...@mojostation.com>wrote:

> Hi,
>
> While crawling some links I found that nutch truncate some urls that have #
> within it.
> Eg:-
> *http://www.techmeme.com/110111/p82#a110111p82* gets truncated to *
> http://www.techmeme.com/110111/p82*
>
> Can any one please let me know why does nutch does this... or is there a
> simple way to avoid it.
>
> regards
> Sourabh
>