You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ramanathapuram, Rajesh" <Ra...@turner.com> on 2011/10/04 02:19:43 UTC
Re: Nutch not crawling URLs with spanish accented characters ( ñ)

Thanks Marcus, I 'll try it and let you know in the morning.


Rajesh Ramana




On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <ma...@openindex.io> wrote:

> 
>> Looks like you're using protocol-httpclient, try again with the
>> protocol-http plugin instead. We crawler a large part of wikipedia for
>> test purposes and all global modern character sets worked just fine.
>> 
>> Can you fetch:
>> http://es.wikipedia.org/wiki/Espa%C3%B1olas
>> 
>> with parse or index checker? It works fine here.
> 
> try bin/nutch org.apache.nutch.parse.ParserChecker <URL>
> with both protocol-httpclient and protocol-http.
> 
>> 
>>> I am trying to crawl a website which has link(s) with spanish/latin
>>> characters in the url filename. I can't get Nutch to crawl the page(s)
>>> with spanish accented chars in URL.
>>> 
>>>  Link: http://mydomain.com/en Español.aspx
>>> 
>>> <http://mydomain.com/en%20Español.aspx>   or
>>> http://mydomain.com/en%20Español.aspx
>>> <http://mydomain.com/en%20Español.aspx>
>>> 
>>> 
>>> 
>>> I tried to substitute the URL encode(%F1) for the special character (ñ),
>>> (and %20 is for " "), the whole list here
>>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
>>> 
>>>  The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
>>> 
>>> browser
>>> 
>>> 
>>> 
>>> I tried to use regex URL normalizer to do the substitution in
>>> regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the
>>> special character ñ).
>>> 
>>> <!-- replaces blank space(" ") in URL with escaped "%20"  -->
>>> 
>>> <regex>
>>> 
>>>  <pattern> </pattern>
>>> 
>>>  <substitution>%20</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> <!-- replaces accented char("ñ") in URL with escaped "%F1"  -->
>>> 
>>> <regex>
>>> 
>>>  <pattern>ñ</pattern>
>>> 
>>>  <substitution>%F1</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> The former(blank space) substitution works fine, but having trouble with
>>> the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
>>> location in the file) in the command prompt and the below error in my
>>> hadoop log.
>>> 
>>>     ERROR regex.RegexURLNormalizer - error parsing conf file:
>>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
>>> of 4-byte UTF-8 sequence.
>>> 
>>> 
>>> 
>>> Then I tried changing the character encoding in nutch-site.xml file
>>> 
>>> <property>
>>> 
>>>  <name>parser.character.encoding.default</name>
>>> 
>>>  <value>ISO-8859-1</value>
>>> 
>>>  <description>The character encoding to fall back to when no other
>>> 
>>> information
>>> 
>>>  is available</description>
>>> 
>>> </property>
>>> 
>>>  And in the regex-normalize.xml file as below
>>> 
>>> <regex>
>>> 
>>>  <pattern>U+00F1</pattern>
>>> 
>>>  <substitution>%F1</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> Now, I don't have any error in the command prompt and but the below error
>>> in my hadoop log. It looks like the substitution is happening but instead
>>> of the "%F1" it uses "?".
>>> 
>>> 
>>> 
>>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2
>>> 2 2)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav
>>> a
>>> 
>>> :70)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja
>>> v a:224)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
>>> 
>>> 2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of
>>> http://mydomain.com/en%20Espa?ol.aspx failed with:
>>> java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Can anyone help me with this issue? Is there any other config changes I
>>> need to do to get this to work?
>>> 
>>> 
>>> 
>>> Thanks in advance, any help in resolving this issue is much appreciated.
>>> 
>>> 
>>> 
>>> thanks & regards,
>>> Rajesh Ramana