You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Ramanathapuram, Rajesh" <Ra...@turner.com> on 2011/10/03 23:27:14 UTC

Nutch not crawling URLs with spanish accented characters (ñ)

Hi, 

 

I am trying to crawl a website which has link(s) with spanish/latin characters in the url filename. I can't get Nutch to crawl the page(s) with spanish accented chars in URL. 

 

  Link: http://mydomain.com/en Español.aspx <http://mydomain.com/en%20Español.aspx>   or http://mydomain.com/en%20Español.aspx <http://mydomain.com/en%20Español.aspx>   

 

I tried to substitute the URL encode(%F1) for the special character (ñ), (and %20 is for " "), the whole list here <http://www.w3schools.com/TAGS/ref_urlencode.asp> .


  The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the browser

 

I tried to use regex URL normalizer to do the substitution in regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the special character ñ).

<!-- replaces blank space(" ") in URL with escaped "%20"  -->

<regex>

  <pattern> </pattern>

  <substitution>%20</substitution>

</regex>

 

<!-- replaces accented char("ñ") in URL with escaped "%F1"  -->

<regex>

  <pattern>ñ</pattern>

  <substitution>%F1</substitution>

</regex>

 

The former(blank space) substitution works fine, but having trouble with the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ location in the file) in the command prompt and the below error in my hadoop log.

     ERROR regex.RegexURLNormalizer - error parsing conf file: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.

  

Then I tried changing the character encoding in nutch-site.xml file

<property>

  <name>parser.character.encoding.default</name>

  <value>ISO-8859-1</value>

  <description>The character encoding to fall back to when no other information

  is available</description>

</property>

  And in the regex-normalize.xml file as below 

<regex>

  <pattern>U+00F1</pattern>

  <substitution>%F1</substitution>

</regex>

 

Now, I don't have any error in the command prompt and but the below error in my hadoop log. It looks like the substitution is happening but instead of the "%F1" it uses "?".

 

ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:70)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:224)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)

2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of http://mydomain.com/en%20Espa?ol.aspx failed with: java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.

 

 

Can anyone help me with this issue? Is there any other config changes I need to do to get this to work?

 

Thanks in advance, any help in resolving this issue is much appreciated. 

 

thanks & regards,
Rajesh Ramana

Re: Nutch not crawling URLs with spanish accented characters ( ñ)

Posted by "Ramanathapuram, Rajesh" <Ra...@turner.com>.

Thanks Marcus, I 'll try it and let you know in the morning.


Rajesh Ramana




On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <ma...@openindex.io> wrote:

> 
>> Looks like you're using protocol-httpclient, try again with the
>> protocol-http plugin instead. We crawler a large part of wikipedia for
>> test purposes and all global modern character sets worked just fine.
>> 
>> Can you fetch:
>> http://es.wikipedia.org/wiki/Espa%C3%B1olas
>> 
>> with parse or index checker? It works fine here.
> 
> try bin/nutch org.apache.nutch.parse.ParserChecker <URL>
> with both protocol-httpclient and protocol-http.
> 
>> 
>>> I am trying to crawl a website which has link(s) with spanish/latin
>>> characters in the url filename. I can't get Nutch to crawl the page(s)
>>> with spanish accented chars in URL.
>>> 
>>>  Link: http://mydomain.com/en Español.aspx
>>> 
>>> <http://mydomain.com/en%20Español.aspx>   or
>>> http://mydomain.com/en%20Español.aspx
>>> <http://mydomain.com/en%20Español.aspx>
>>> 
>>> 
>>> 
>>> I tried to substitute the URL encode(%F1) for the special character (ñ),
>>> (and %20 is for " "), the whole list here
>>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
>>> 
>>>  The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
>>> 
>>> browser
>>> 
>>> 
>>> 
>>> I tried to use regex URL normalizer to do the substitution in
>>> regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the
>>> special character ñ).
>>> 
>>> <!-- replaces blank space(" ") in URL with escaped "%20"  -->
>>> 
>>> <regex>
>>> 
>>>  <pattern> </pattern>
>>> 
>>>  <substitution>%20</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> <!-- replaces accented char("ñ") in URL with escaped "%F1"  -->
>>> 
>>> <regex>
>>> 
>>>  <pattern>ñ</pattern>
>>> 
>>>  <substitution>%F1</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> The former(blank space) substitution works fine, but having trouble with
>>> the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
>>> location in the file) in the command prompt and the below error in my
>>> hadoop log.
>>> 
>>>     ERROR regex.RegexURLNormalizer - error parsing conf file:
>>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
>>> of 4-byte UTF-8 sequence.
>>> 
>>> 
>>> 
>>> Then I tried changing the character encoding in nutch-site.xml file
>>> 
>>> <property>
>>> 
>>>  <name>parser.character.encoding.default</name>
>>> 
>>>  <value>ISO-8859-1</value>
>>> 
>>>  <description>The character encoding to fall back to when no other
>>> 
>>> information
>>> 
>>>  is available</description>
>>> 
>>> </property>
>>> 
>>>  And in the regex-normalize.xml file as below
>>> 
>>> <regex>
>>> 
>>>  <pattern>U+00F1</pattern>
>>> 
>>>  <substitution>%F1</substitution>
>>> 
>>> </regex>
>>> 
>>> 
>>> 
>>> Now, I don't have any error in the command prompt and but the below error
>>> in my hadoop log. It looks like the substitution is happening but instead
>>> of the "%F1" it uses "?".
>>> 
>>> 
>>> 
>>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2
>>> 2 2)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav
>>> a
>>> 
>>> :70)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja
>>> v a:224)
>>> 
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
>>> 
>>> 2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of
>>> http://mydomain.com/en%20Espa?ol.aspx failed with:
>>> java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Can anyone help me with this issue? Is there any other config changes I
>>> need to do to get this to work?
>>> 
>>> 
>>> 
>>> Thanks in advance, any help in resolving this issue is much appreciated.
>>> 
>>> 
>>> 
>>> thanks & regards,
>>> Rajesh Ramana

Re: Nutch not crawling URLs with spanish accented characters ( ñ)

Posted by Markus Jelsma <ma...@openindex.io>.

> Looks like you're using protocol-httpclient, try again with the
> protocol-http plugin instead. We crawler a large part of wikipedia for
> test purposes and all global modern character sets worked just fine.
> 
> Can you fetch:
> http://es.wikipedia.org/wiki/Espa%C3%B1olas
> 
> with parse or index checker? It works fine here.

try bin/nutch org.apache.nutch.parse.ParserChecker <URL>
with both protocol-httpclient and protocol-http.

> 
> > I am trying to crawl a website which has link(s) with spanish/latin
> > characters in the url filename. I can't get Nutch to crawl the page(s)
> > with spanish accented chars in URL.
> > 
> >   Link: http://mydomain.com/en Español.aspx
> > 
> > <http://mydomain.com/en%20Español.aspx>   or
> > http://mydomain.com/en%20Español.aspx
> > <http://mydomain.com/en%20Español.aspx>
> > 
> > 
> > 
> > I tried to substitute the URL encode(%F1) for the special character (ñ),
> > (and %20 is for " "), the whole list here
> > <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
> > 
> >   The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
> > 
> > browser
> > 
> > 
> > 
> > I tried to use regex URL normalizer to do the substitution in
> > regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the
> > special character ñ).
> > 
> > <!-- replaces blank space(" ") in URL with escaped "%20"  -->
> > 
> > <regex>
> > 
> >   <pattern> </pattern>
> >   
> >   <substitution>%20</substitution>
> > 
> > </regex>
> > 
> > 
> > 
> > <!-- replaces accented char("ñ") in URL with escaped "%F1"  -->
> > 
> > <regex>
> > 
> >   <pattern>ñ</pattern>
> >   
> >   <substitution>%F1</substitution>
> > 
> > </regex>
> > 
> > 
> > 
> > The former(blank space) substitution works fine, but having trouble with
> > the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
> > location in the file) in the command prompt and the below error in my
> > hadoop log.
> > 
> >      ERROR regex.RegexURLNormalizer - error parsing conf file:
> > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
> > of 4-byte UTF-8 sequence.
> > 
> > 
> > 
> > Then I tried changing the character encoding in nutch-site.xml file
> > 
> > <property>
> > 
> >   <name>parser.character.encoding.default</name>
> >   
> >   <value>ISO-8859-1</value>
> >   
> >   <description>The character encoding to fall back to when no other
> > 
> > information
> > 
> >   is available</description>
> > 
> > </property>
> > 
> >   And in the regex-normalize.xml file as below
> > 
> > <regex>
> > 
> >   <pattern>U+00F1</pattern>
> >   
> >   <substitution>%F1</substitution>
> > 
> > </regex>
> > 
> > 
> > 
> > Now, I don't have any error in the command prompt and but the below error
> > in my hadoop log. It looks like the substitution is happening but instead
> > of the "%F1" it uses "?".
> > 
> > 
> > 
> > ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
> > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2
> > 2 2)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav
> > a
> > 
> > :70)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja
> > v a:224)
> > 
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
> > 
> > 2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of
> > http://mydomain.com/en%20Espa?ol.aspx failed with:
> > java.lang.IllegalArgumentException: Invalid uri
> > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
> > 
> > 
> > 
> > 
> > 
> > Can anyone help me with this issue? Is there any other config changes I
> > need to do to get this to work?
> > 
> > 
> > 
> > Thanks in advance, any help in resolving this issue is much appreciated.
> > 
> > 
> > 
> > thanks & regards,
> > Rajesh Ramana

Re: Nutch not crawling URLs with spanish accented characters ( ñ)

Posted by Markus Jelsma <ma...@openindex.io>.

Looks like you're using protocol-httpclient, try again with the protocol-http 
plugin instead. We crawler a large part of wikipedia for test purposes and all 
global modern character sets worked just fine.

Can you fetch:
http://es.wikipedia.org/wiki/Espa%C3%B1olas

with parse or index checker? It works fine here.


> 
> 
> 
> I am trying to crawl a website which has link(s) with spanish/latin
> characters in the url filename. I can't get Nutch to crawl the page(s)
> with spanish accented chars in URL.
> 
> 
> 
>   Link: http://mydomain.com/en Español.aspx
> <http://mydomain.com/en%20Español.aspx>   or
> http://mydomain.com/en%20Español.aspx
> <http://mydomain.com/en%20Español.aspx>
> 
> 
> 
> I tried to substitute the URL encode(%F1) for the special character (ñ),
> (and %20 is for " "), the whole list here
> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
> 
> 
>   The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
> browser
> 
> 
> 
> I tried to use regex URL normalizer to do the substitution in
> regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the
> special character ñ).
> 
> <!-- replaces blank space(" ") in URL with escaped "%20"  -->
> 
> <regex>
> 
>   <pattern> </pattern>
> 
>   <substitution>%20</substitution>
> 
> </regex>
> 
> 
> 
> <!-- replaces accented char("ñ") in URL with escaped "%F1"  -->
> 
> <regex>
> 
>   <pattern>ñ</pattern>
> 
>   <substitution>%F1</substitution>
> 
> </regex>
> 
> 
> 
> The former(blank space) substitution works fine, but having trouble with
> the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
> location in the file) in the command prompt and the below error in my
> hadoop log.
> 
>      ERROR regex.RegexURLNormalizer - error parsing conf file:
> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
> of 4-byte UTF-8 sequence.
> 
> 
> 
> Then I tried changing the character encoding in nutch-site.xml file
> 
> <property>
> 
>   <name>parser.character.encoding.default</name>
> 
>   <value>ISO-8859-1</value>
> 
>   <description>The character encoding to fall back to when no other
> information
> 
>   is available</description>
> 
> </property>
> 
>   And in the regex-normalize.xml file as below
> 
> <regex>
> 
>   <pattern>U+00F1</pattern>
> 
>   <substitution>%F1</substitution>
> 
> </regex>
> 
> 
> 
> Now, I don't have any error in the command prompt and but the below error
> in my hadoop log. It looks like the substitution is happening but instead
> of the "%F1" it uses "?".
> 
> 
> 
> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:22
> 2)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java
> :70)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.jav
> a:224)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
> 
> 2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of
> http://mydomain.com/en%20Espa?ol.aspx failed with:
> java.lang.IllegalArgumentException: Invalid uri
> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
> 
> 
> 
> 
> 
> Can anyone help me with this issue? Is there any other config changes I
> need to do to get this to work?
> 
> 
> 
> Thanks in advance, any help in resolving this issue is much appreciated.
> 
> 
> 
> thanks & regards,
> Rajesh Ramana

RE: Nutch not crawling URLs with spanish accented characters (ñ)

Posted by "Ramanathapuram, Rajesh" <Ra...@turner.com>.

Oops! Forgot to mention, I am using Nutch 1.2.

thanks & regards,
Rajesh Ramana 

-----Original Message-----
Sent: Monday, October 03, 2011 5:27 PM
To: user@nutch.apache.org
Subject: Nutch not crawling URLs with spanish accented characters (ñ) 

Hi, 

I am trying to crawl a website which has link(s) with spanish/latin characters in the url filename. I can't get Nutch to crawl the page(s) with spanish accented chars in URL. 

  Link: http://mydomain.com/en Español.aspx <http://mydomain.com/en%20Español.aspx>   or http://mydomain.com/en%20Español.aspx <http://mydomain.com/en%20Español.aspx>   

I tried to substitute the URL encode(%F1) for the special character (ñ), (and %20 is for " "), the whole list here <http://www.w3schools.com/TAGS/ref_urlencode.asp> .

  The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the browser

I tried to use regex URL normalizer to do the substitution in regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the special character ñ).

<!-- replaces blank space(" ") in URL with escaped "%20"  -->

<regex>

  <pattern> </pattern>

  <substitution>%20</substitution>

</regex>

<!-- replaces accented char("ñ") in URL with escaped "%F1"  -->

<regex>

  <pattern>ñ</pattern>

  <substitution>%F1</substitution>

</regex>

The former(blank space) substitution works fine, but having trouble with the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ location in the file) in the command prompt and the below error in my hadoop log.

     ERROR regex.RegexURLNormalizer - error parsing conf file: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.

Then I tried changing the character encoding in nutch-site.xml file

<property>

  <name>parser.character.encoding.default</name>

  <value>ISO-8859-1</value>

  <description>The character encoding to fall back to when no other information

  is available</description>

</property>

  And in the regex-normalize.xml file as below 

<regex>

  <pattern>U+00F1</pattern>

  <substitution>%F1</substitution>

</regex>

Now, I don't have any error in the command prompt and but the below error in my hadoop log. It looks like the substitution is happening but instead of the "%F1" it uses "?".

ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:70)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:224)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)

2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of http://mydomain.com/en%20Espa?ol.aspx failed with: java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.

Can anyone help me with this issue? Is there any other config changes I need to do to get this to work?

Thanks in advance, any help in resolving this issue is much appreciated. 

thanks & regards,
Rajesh Ramana