You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/02/04 20:00:14 UTC

[jira] [Commented] (NUTCH-1711) Normalizer does not encode exclamation mark

    [ https://issues.apache.org/jira/browse/NUTCH-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891019#comment-13891019 ] 

Sebastian Nagel commented on NUTCH-1711:
----------------------------------------

Some sites use the exclamation mark regularly, e.g. [http://www.taz.de/Politik/!p4615/]. It's also part of ajax URLs (NUTCH-1323). In doubt, web servers may behave different if `!' is escaped to `%21'.
If the problem is specific to SolrCloud as one indexer back-end: why not escape it during indexing either in a normalizer with scope indexer or in the back-end plugin code (only for id field)?

> Normalizer does not encode exclamation mark
> -------------------------------------------
>
>                 Key: NUTCH-1711
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1711
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.8
>
>
> {code}
> $ bin/nutch org.apache.nutch.net.URLNormalizerChecker
> Checking combination of all URLNormalizers available
> http://nutch.apache.org/bla!
> http://nutch.apache.org/bla!
> {code}
> I never noticed that many URL encoders do not encode the exclamation mark until just now. SolrCloud uses the character to delimit the composite ID in SolrCloud, if you end with the exclamation mark, you will get an error!
> Any thoughts on this?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)