You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "KuroSaka TeruHiko (JIRA)" <ji...@apache.org> on 2005/12/14 00:07:45 UTC

[jira] Created: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

non-Latin-1 characters cannot be submitted for search
-----------------------------------------------------

         Key: NUTCH-138
         URL: http://issues.apache.org/jira/browse/NUTCH-138
     Project: Nutch
        Type: Bug
  Components: web gui  
    Versions: 0.7.1    
 Environment: Windows XP, Tomcat 5.5.12
    Reporter: KuroSaka TeruHiko
    Priority: Minor


The search.html currently specifies GET method for query submission.

Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.)

To allow proper transmission of non-ISO-8859-1, POST method should be used.  Here's a proposed patch:
*** search.html	Tue Dec 13 15:02:15 2005
--- search-org.html	Tue Dec 13 15:02:07 2005
***************
*** 59,65 ****
  </span><span class="bodytext">
  <center>
  
! <form name="search" action="../search.jsp" method="post"> 
  <input name="query" size="44">&nbsp;<input type="submit" value="Search">
  <a href="help.html">help</a>
  
--- 59,65 ----
  </span><span class="bodytext">
  <center>
  
! <form name="search" action="../search.jsp" method="get"> 
  <input name="query" size="44">&nbsp;<input type="submit" value="Search">
  <a href="help.html">help</a>
  

BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

Posted by "Piotr Kosiorowski (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361520 ] 

Piotr Kosiorowski commented on NUTCH-138:
-----------------------------------------

I am not sure but I would suspect it is a problem of bad tomcat configuration. To handle special characters in query urls one have to change default tomcat configuration - especially URIEncoding attribute to UTF8. See:

http://tomcat.apache.org/faq/connectors.html#utf8

Please check if it helps in your particular case so we can close the issue.


> non-Latin-1 characters cannot be submitted for search
> -----------------------------------------------------
>
>          Key: NUTCH-138
>          URL: http://issues.apache.org/jira/browse/NUTCH-138
>      Project: Nutch
>         Type: Bug
>   Components: web gui
>     Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
>     Reporter: KuroSaka TeruHiko
>     Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  Here's a proposed patch:
> *** search.html	Tue Dec 13 15:02:15 2005
> --- search-org.html	Tue Dec 13 15:02:07 2005
> ***************
> *** 59,65 ****
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="post"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> --- 59,65 ----
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="get"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

Posted by "Piotr Kosiorowski (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361549 ] 

Piotr Kosiorowski commented on NUTCH-138:
-----------------------------------------

BTW - just create user for yourself in nutch Wiki and you shoudl be able to add a new page with information without problems. Thanks for checking and documenting it.

> non-Latin-1 characters cannot be submitted for search
> -----------------------------------------------------
>
>          Key: NUTCH-138
>          URL: http://issues.apache.org/jira/browse/NUTCH-138
>      Project: Nutch
>         Type: Bug
>   Components: web gui
>     Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
>     Reporter: KuroSaka TeruHiko
>     Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  Here's a proposed patch:
> *** search.html	Tue Dec 13 15:02:15 2005
> --- search-org.html	Tue Dec 13 15:02:07 2005
> ***************
> *** 59,65 ****
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="post"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> --- 59,65 ----
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="get"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Closed: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

Posted by "Piotr Kosiorowski (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-138?page=all ]
     
Piotr Kosiorowski closed NUTCH-138:
-----------------------------------

    Resolution: Invalid

Setting URIEncoding in tomcat config file fixes the problem.


> non-Latin-1 characters cannot be submitted for search
> -----------------------------------------------------
>
>          Key: NUTCH-138
>          URL: http://issues.apache.org/jira/browse/NUTCH-138
>      Project: Nutch
>         Type: Bug
>   Components: web gui
>     Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
>     Reporter: KuroSaka TeruHiko
>     Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  Here's a proposed patch:
> *** search.html	Tue Dec 13 15:02:15 2005
> --- search-org.html	Tue Dec 13 15:02:07 2005
> ***************
> *** 59,65 ****
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="post"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> --- 59,65 ----
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="get"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

Posted by "KuroSaka TeruHiko (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361546 ] 

KuroSaka TeruHiko commented on NUTCH-138:
-----------------------------------------

You are right.  WIth this Tomcat config, UTF-8 characters can be passed.
Also works is having:	useBodyEncodingForURI="true"
in the <Connector> tag within $TOMCAT/conf/service.xml
This is documented in:
http://issues.apache.org/bugzilla/show_bug.cgi?id=29900

What I suggest is to add this note to:
http://lucene.apache.org/nutch/i18n.html
(which currently explains the GUI localization issue only, rather than internationalization proper),
or perhaps creating a new page:
http://wiki.apache.org/nutch/GettingNutchRunningUTF8Tomcat5

I am willing to write a draft if someone tell me where to submit.

Feel free to close this bug.


> non-Latin-1 characters cannot be submitted for search
> -----------------------------------------------------
>
>          Key: NUTCH-138
>          URL: http://issues.apache.org/jira/browse/NUTCH-138
>      Project: Nutch
>         Type: Bug
>   Components: web gui
>     Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
>     Reporter: KuroSaka TeruHiko
>     Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  Here's a proposed patch:
> *** search.html	Tue Dec 13 15:02:15 2005
> --- search-org.html	Tue Dec 13 15:02:07 2005
> ***************
> *** 59,65 ****
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="post"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> --- 59,65 ----
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="get"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

Posted by "KuroSaka TeruHiko (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361552 ] 

KuroSaka TeruHiko commented on NUTCH-138:
-----------------------------------------

Sorry, my oversight, useBodyEncodingForURI did not work as I expected.  Setting URIEncoding is the only way.  I'll write this in Wiki.


> non-Latin-1 characters cannot be submitted for search
> -----------------------------------------------------
>
>          Key: NUTCH-138
>          URL: http://issues.apache.org/jira/browse/NUTCH-138
>      Project: Nutch
>         Type: Bug
>   Components: web gui
>     Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
>     Reporter: KuroSaka TeruHiko
>     Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  Here's a proposed patch:
> *** search.html	Tue Dec 13 15:02:15 2005
> --- search-org.html	Tue Dec 13 15:02:07 2005
> ***************
> *** 59,65 ****
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="post"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> --- 59,65 ----
>   </span><span class="bodytext">
>   <center>
>   
> ! <form name="search" action="../search.jsp" method="get"> 
>   <input name="query" size="44">&nbsp;<input type="submit" value="Search">
>   <a href="help.html">help</a>
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira