You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ramanathapuram, Rajesh" <Ra...@turner.com> on 2011/10/04 02:19:43 UTC
Re: Nutch not crawling URLs with spanish accented characters ( ñ)
Thanks Marcus, I 'll try it and let you know in the morning.
Rajesh Ramana
On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <ma...@openindex.io> wrote:
>
>> Looks like you're using protocol-httpclient, try again with the
>> protocol-http plugin instead. We crawler a large part of wikipedia for
>> test purposes and all global modern character sets worked just fine.
>>
>> Can you fetch:
>> http://es.wikipedia.org/wiki/Espa%C3%B1olas
>>
>> with parse or index checker? It works fine here.
>
> try bin/nutch org.apache.nutch.parse.ParserChecker <URL>
> with both protocol-httpclient and protocol-http.
>
>>
>>> I am trying to crawl a website which has link(s) with spanish/latin
>>> characters in the url filename. I can't get Nutch to crawl the page(s)
>>> with spanish accented chars in URL.
>>>
>>> Link: http://mydomain.com/en Español.aspx
>>>
>>> <http://mydomain.com/en%20Español.aspx> or
>>> http://mydomain.com/en%20Español.aspx
>>> <http://mydomain.com/en%20Español.aspx>
>>>
>>>
>>>
>>> I tried to substitute the URL encode(%F1) for the special character (ñ),
>>> (and %20 is for " "), the whole list here
>>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
>>>
>>> The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
>>>
>>> browser
>>>
>>>
>>>
>>> I tried to use regex URL normalizer to do the substitution in
>>> regex-normalize.xml file as below (%20 is for " ") and (%F1 for the
>>> special character ñ).
>>>
>>> <!-- replaces blank space(" ") in URL with escaped "%20" -->
>>>
>>> <regex>
>>>
>>> <pattern> </pattern>
>>>
>>> <substitution>%20</substitution>
>>>
>>> </regex>
>>>
>>>
>>>
>>> <!-- replaces accented char("ñ") in URL with escaped "%F1" -->
>>>
>>> <regex>
>>>
>>> <pattern>ñ</pattern>
>>>
>>> <substitution>%F1</substitution>
>>>
>>> </regex>
>>>
>>>
>>>
>>> The former(blank space) substitution works fine, but having trouble with
>>> the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
>>> location in the file) in the command prompt and the below error in my
>>> hadoop log.
>>>
>>> ERROR regex.RegexURLNormalizer - error parsing conf file:
>>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
>>> of 4-byte UTF-8 sequence.
>>>
>>>
>>>
>>> Then I tried changing the character encoding in nutch-site.xml file
>>>
>>> <property>
>>>
>>> <name>parser.character.encoding.default</name>
>>>
>>> <value>ISO-8859-1</value>
>>>
>>> <description>The character encoding to fall back to when no other
>>>
>>> information
>>>
>>> is available</description>
>>>
>>> </property>
>>>
>>> And in the regex-normalize.xml file as below
>>>
>>> <regex>
>>>
>>> <pattern>U+00F1</pattern>
>>>
>>> <substitution>%F1</substitution>
>>>
>>> </regex>
>>>
>>>
>>>
>>> Now, I don't have any error in the command prompt and but the below error
>>> in my hadoop log. It looks like the substitution is happening but instead
>>> of the "%F1" it uses "?".
>>>
>>>
>>>
>>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2
>>> 2 2)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav
>>> a
>>>
>>> :70)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja
>>> v a:224)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
>>>
>>> 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of
>>> http://mydomain.com/en%20Espa?ol.aspx failed with:
>>> java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
>>>
>>>
>>>
>>>
>>>
>>> Can anyone help me with this issue? Is there any other config changes I
>>> need to do to get this to work?
>>>
>>>
>>>
>>> Thanks in advance, any help in resolving this issue is much appreciated.
>>>
>>>
>>>
>>> thanks & regards,
>>> Rajesh Ramana