You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ramanathapuram, Rajesh" <Ra...@turner.com> on 2011/10/03 23:27:14 UTC
Nutch not crawling URLs with spanish accented characters (ñ)
Hi,
I am trying to crawl a website which has link(s) with spanish/latin characters in the url filename. I can't get Nutch to crawl the page(s) with spanish accented chars in URL.
Link: http://mydomain.com/en Español.aspx <http://mydomain.com/en%20Español.aspx> or http://mydomain.com/en%20Español.aspx <http://mydomain.com/en%20Español.aspx>
I tried to substitute the URL encode(%F1) for the special character (ñ), (and %20 is for " "), the whole list here <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the browser
I tried to use regex URL normalizer to do the substitution in regex-normalize.xml file as below (%20 is for " ") and (%F1 for the special character ñ).
<!-- replaces blank space(" ") in URL with escaped "%20" -->
<regex>
<pattern> </pattern>
<substitution>%20</substitution>
</regex>
<!-- replaces accented char("ñ") in URL with escaped "%F1" -->
<regex>
<pattern>ñ</pattern>
<substitution>%F1</substitution>
</regex>
The former(blank space) substitution works fine, but having trouble with the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ location in the file) in the command prompt and the below error in my hadoop log.
ERROR regex.RegexURLNormalizer - error parsing conf file: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
Then I tried changing the character encoding in nutch-site.xml file
<property>
<name>parser.character.encoding.default</name>
<value>ISO-8859-1</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
And in the regex-normalize.xml file as below
<regex>
<pattern>U+00F1</pattern>
<substitution>%F1</substitution>
</regex>
Now, I don't have any error in the command prompt and but the below error in my hadoop log. It looks like the substitution is happening but instead of the "%F1" it uses "?".
ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:70)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:224)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of http://mydomain.com/en%20Espa?ol.aspx failed with: java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
Can anyone help me with this issue? Is there any other config changes I need to do to get this to work?
Thanks in advance, any help in resolving this issue is much appreciated.
thanks & regards,
Rajesh Ramana
Re: Nutch not crawling URLs with spanish accented characters ( ñ)
Posted by "Ramanathapuram, Rajesh" <Ra...@turner.com>.
Thanks Marcus, I 'll try it and let you know in the morning.
Rajesh Ramana
On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <ma...@openindex.io> wrote:
>
>> Looks like you're using protocol-httpclient, try again with the
>> protocol-http plugin instead. We crawler a large part of wikipedia for
>> test purposes and all global modern character sets worked just fine.
>>
>> Can you fetch:
>> http://es.wikipedia.org/wiki/Espa%C3%B1olas
>>
>> with parse or index checker? It works fine here.
>
> try bin/nutch org.apache.nutch.parse.ParserChecker <URL>
> with both protocol-httpclient and protocol-http.
>
>>
>>> I am trying to crawl a website which has link(s) with spanish/latin
>>> characters in the url filename. I can't get Nutch to crawl the page(s)
>>> with spanish accented chars in URL.
>>>
>>> Link: http://mydomain.com/en Español.aspx
>>>
>>> <http://mydomain.com/en%20Español.aspx> or
>>> http://mydomain.com/en%20Español.aspx
>>> <http://mydomain.com/en%20Español.aspx>
>>>
>>>
>>>
>>> I tried to substitute the URL encode(%F1) for the special character (ñ),
>>> (and %20 is for " "), the whole list here
>>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
>>>
>>> The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
>>>
>>> browser
>>>
>>>
>>>
>>> I tried to use regex URL normalizer to do the substitution in
>>> regex-normalize.xml file as below (%20 is for " ") and (%F1 for the
>>> special character ñ).
>>>
>>> <!-- replaces blank space(" ") in URL with escaped "%20" -->
>>>
>>> <regex>
>>>
>>> <pattern> </pattern>
>>>
>>> <substitution>%20</substitution>
>>>
>>> </regex>
>>>
>>>
>>>
>>> <!-- replaces accented char("ñ") in URL with escaped "%F1" -->
>>>
>>> <regex>
>>>
>>> <pattern>ñ</pattern>
>>>
>>> <substitution>%F1</substitution>
>>>
>>> </regex>
>>>
>>>
>>>
>>> The former(blank space) substitution works fine, but having trouble with
>>> the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
>>> location in the file) in the command prompt and the below error in my
>>> hadoop log.
>>>
>>> ERROR regex.RegexURLNormalizer - error parsing conf file:
>>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
>>> of 4-byte UTF-8 sequence.
>>>
>>>
>>>
>>> Then I tried changing the character encoding in nutch-site.xml file
>>>
>>> <property>
>>>
>>> <name>parser.character.encoding.default</name>
>>>
>>> <value>ISO-8859-1</value>
>>>
>>> <description>The character encoding to fall back to when no other
>>>
>>> information
>>>
>>> is available</description>
>>>
>>> </property>
>>>
>>> And in the regex-normalize.xml file as below
>>>
>>> <regex>
>>>
>>> <pattern>U+00F1</pattern>
>>>
>>> <substitution>%F1</substitution>
>>>
>>> </regex>
>>>
>>>
>>>
>>> Now, I don't have any error in the command prompt and but the below error
>>> in my hadoop log. It looks like the substitution is happening but instead
>>> of the "%F1" it uses "?".
>>>
>>>
>>>
>>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2
>>> 2 2)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav
>>> a
>>>
>>> :70)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja
>>> v a:224)
>>>
>>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
>>>
>>> 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of
>>> http://mydomain.com/en%20Espa?ol.aspx failed with:
>>> java.lang.IllegalArgumentException: Invalid uri
>>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
>>>
>>>
>>>
>>>
>>>
>>> Can anyone help me with this issue? Is there any other config changes I
>>> need to do to get this to work?
>>>
>>>
>>>
>>> Thanks in advance, any help in resolving this issue is much appreciated.
>>>
>>>
>>>
>>> thanks & regards,
>>> Rajesh Ramana
Re: Nutch not crawling URLs with spanish accented characters ( ñ)
Posted by Markus Jelsma <ma...@openindex.io>.
> Looks like you're using protocol-httpclient, try again with the
> protocol-http plugin instead. We crawler a large part of wikipedia for
> test purposes and all global modern character sets worked just fine.
>
> Can you fetch:
> http://es.wikipedia.org/wiki/Espa%C3%B1olas
>
> with parse or index checker? It works fine here.
try bin/nutch org.apache.nutch.parse.ParserChecker <URL>
with both protocol-httpclient and protocol-http.
>
> > I am trying to crawl a website which has link(s) with spanish/latin
> > characters in the url filename. I can't get Nutch to crawl the page(s)
> > with spanish accented chars in URL.
> >
> > Link: http://mydomain.com/en Español.aspx
> >
> > <http://mydomain.com/en%20Español.aspx> or
> > http://mydomain.com/en%20Español.aspx
> > <http://mydomain.com/en%20Español.aspx>
> >
> >
> >
> > I tried to substitute the URL encode(%F1) for the special character (ñ),
> > (and %20 is for " "), the whole list here
> > <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
> >
> > The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
> >
> > browser
> >
> >
> >
> > I tried to use regex URL normalizer to do the substitution in
> > regex-normalize.xml file as below (%20 is for " ") and (%F1 for the
> > special character ñ).
> >
> > <!-- replaces blank space(" ") in URL with escaped "%20" -->
> >
> > <regex>
> >
> > <pattern> </pattern>
> >
> > <substitution>%20</substitution>
> >
> > </regex>
> >
> >
> >
> > <!-- replaces accented char("ñ") in URL with escaped "%F1" -->
> >
> > <regex>
> >
> > <pattern>ñ</pattern>
> >
> > <substitution>%F1</substitution>
> >
> > </regex>
> >
> >
> >
> > The former(blank space) substitution works fine, but having trouble with
> > the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
> > location in the file) in the command prompt and the below error in my
> > hadoop log.
> >
> > ERROR regex.RegexURLNormalizer - error parsing conf file:
> > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
> > of 4-byte UTF-8 sequence.
> >
> >
> >
> > Then I tried changing the character encoding in nutch-site.xml file
> >
> > <property>
> >
> > <name>parser.character.encoding.default</name>
> >
> > <value>ISO-8859-1</value>
> >
> > <description>The character encoding to fall back to when no other
> >
> > information
> >
> > is available</description>
> >
> > </property>
> >
> > And in the regex-normalize.xml file as below
> >
> > <regex>
> >
> > <pattern>U+00F1</pattern>
> >
> > <substitution>%F1</substitution>
> >
> > </regex>
> >
> >
> >
> > Now, I don't have any error in the command prompt and but the below error
> > in my hadoop log. It looks like the substitution is happening but instead
> > of the "%F1" it uses "?".
> >
> >
> >
> > ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
> > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
> >
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2
> > 2 2)
> >
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
> >
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav
> > a
> >
> > :70)
> >
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
> >
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja
> > v a:224)
> >
> > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
> >
> > 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of
> > http://mydomain.com/en%20Espa?ol.aspx failed with:
> > java.lang.IllegalArgumentException: Invalid uri
> > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
> >
> >
> >
> >
> >
> > Can anyone help me with this issue? Is there any other config changes I
> > need to do to get this to work?
> >
> >
> >
> > Thanks in advance, any help in resolving this issue is much appreciated.
> >
> >
> >
> > thanks & regards,
> > Rajesh Ramana
Re: Nutch not crawling URLs with spanish accented characters ( ñ)
Posted by Markus Jelsma <ma...@openindex.io>.
Looks like you're using protocol-httpclient, try again with the protocol-http
plugin instead. We crawler a large part of wikipedia for test purposes and all
global modern character sets worked just fine.
Can you fetch:
http://es.wikipedia.org/wiki/Espa%C3%B1olas
with parse or index checker? It works fine here.
>
>
>
> I am trying to crawl a website which has link(s) with spanish/latin
> characters in the url filename. I can't get Nutch to crawl the page(s)
> with spanish accented chars in URL.
>
>
>
> Link: http://mydomain.com/en Español.aspx
> <http://mydomain.com/en%20Español.aspx> or
> http://mydomain.com/en%20Español.aspx
> <http://mydomain.com/en%20Español.aspx>
>
>
>
> I tried to substitute the URL encode(%F1) for the special character (ñ),
> (and %20 is for " "), the whole list here
> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
>
>
> The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
> browser
>
>
>
> I tried to use regex URL normalizer to do the substitution in
> regex-normalize.xml file as below (%20 is for " ") and (%F1 for the
> special character ñ).
>
> <!-- replaces blank space(" ") in URL with escaped "%20" -->
>
> <regex>
>
> <pattern> </pattern>
>
> <substitution>%20</substitution>
>
> </regex>
>
>
>
> <!-- replaces accented char("ñ") in URL with escaped "%F1" -->
>
> <regex>
>
> <pattern>ñ</pattern>
>
> <substitution>%F1</substitution>
>
> </regex>
>
>
>
> The former(blank space) substitution works fine, but having trouble with
> the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
> location in the file) in the command prompt and the below error in my
> hadoop log.
>
> ERROR regex.RegexURLNormalizer - error parsing conf file:
> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
> of 4-byte UTF-8 sequence.
>
>
>
> Then I tried changing the character encoding in nutch-site.xml file
>
> <property>
>
> <name>parser.character.encoding.default</name>
>
> <value>ISO-8859-1</value>
>
> <description>The character encoding to fall back to when no other
> information
>
> is available</description>
>
> </property>
>
> And in the regex-normalize.xml file as below
>
> <regex>
>
> <pattern>U+00F1</pattern>
>
> <substitution>%F1</substitution>
>
> </regex>
>
>
>
> Now, I don't have any error in the command prompt and but the below error
> in my hadoop log. It looks like the substitution is happening but instead
> of the "%F1" it uses "?".
>
>
>
> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
>
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:22
> 2)
>
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
>
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java
> :70)
>
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
>
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.jav
> a:224)
>
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
>
> 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of
> http://mydomain.com/en%20Espa?ol.aspx failed with:
> java.lang.IllegalArgumentException: Invalid uri
> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
>
>
>
>
>
> Can anyone help me with this issue? Is there any other config changes I
> need to do to get this to work?
>
>
>
> Thanks in advance, any help in resolving this issue is much appreciated.
>
>
>
> thanks & regards,
> Rajesh Ramana
RE: Nutch not crawling URLs with spanish accented characters (ñ)
Posted by "Ramanathapuram, Rajesh" <Ra...@turner.com>.
Oops! Forgot to mention, I am using Nutch 1.2.
thanks & regards,
Rajesh Ramana
-----Original Message-----
Sent: Monday, October 03, 2011 5:27 PM
To: user@nutch.apache.org
Subject: Nutch not crawling URLs with spanish accented characters (ñ)
Hi,
I am trying to crawl a website which has link(s) with spanish/latin characters in the url filename. I can't get Nutch to crawl the page(s) with spanish accented chars in URL.
Link: http://mydomain.com/en Español.aspx <http://mydomain.com/en%20Español.aspx> or http://mydomain.com/en%20Español.aspx <http://mydomain.com/en%20Español.aspx>
I tried to substitute the URL encode(%F1) for the special character (ñ), (and %20 is for " "), the whole list here <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the browser
I tried to use regex URL normalizer to do the substitution in regex-normalize.xml file as below (%20 is for " ") and (%F1 for the special character ñ).
<!-- replaces blank space(" ") in URL with escaped "%20" -->
<regex>
<pattern> </pattern>
<substitution>%20</substitution>
</regex>
<!-- replaces accented char("ñ") in URL with escaped "%F1" -->
<regex>
<pattern>ñ</pattern>
<substitution>%F1</substitution>
</regex>
The former(blank space) substitution works fine, but having trouble with the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ location in the file) in the command prompt and the below error in my hadoop log.
ERROR regex.RegexURLNormalizer - error parsing conf file: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
Then I tried changing the character encoding in nutch-site.xml file
<property>
<name>parser.character.encoding.default</name>
<value>ISO-8859-1</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
And in the regex-normalize.xml file as below
<regex>
<pattern>U+00F1</pattern>
<substitution>%F1</substitution>
</regex>
Now, I don't have any error in the command prompt and but the below error in my hadoop log. It looks like the substitution is happening but instead of the "%F1" it uses "?".
ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:70)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:224)
2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of http://mydomain.com/en%20Espa?ol.aspx failed with: java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
Can anyone help me with this issue? Is there any other config changes I need to do to get this to work?
Thanks in advance, any help in resolving this issue is much appreciated.
thanks & regards,
Rajesh Ramana