You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "crossany (JIRA)" <ji...@apache.org> on 2007/08/09 09:13:43 UTC

[jira] Created: (NUTCH-540) some problem about the Nutch cache

some problem about the Nutch cache
----------------------------------

                 Key: NUTCH-540
                 URL: https://issues.apache.org/jira/browse/NUTCH-540
             Project: Nutch
          Issue Type: Bug
          Components: searcher
    Affects Versions: 0.9.0
         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
            Reporter: crossany
            Priority: Blocker
             Fix For: 0.9.0


I'am a chinese.
I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-540) some problem about the Nutch cache

Posted by "crossany (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

crossany updated NUTCH-540:
---------------------------

    Attachment: 1.gif

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-540) some problem about the Nutch cache

Posted by "crossany (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

crossany updated NUTCH-540:
---------------------------

    Attachment:     (was: 1186733525.jpg)

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-540) some problem about the Nutch cache

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-540:
------------------------------------

    Fix Version/s:     (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-540) some problem about the Nutch cache

Posted by "crossany (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

crossany updated NUTCH-540:
---------------------------

    Attachment: 1186733525.jpg

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-540) some problem about the Nutch cache

Posted by "david euler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542153 ] 

david euler commented on NUTCH-540:
-----------------------------------

hello, crossany, i met the problem too. and finally fixed it by replace the following line in cached.jsp :

content = new String(bean.getContent(details));

with:
content = new String(bean.getContent(details), "UTF-8");

the error is caused by new String(byte[]), when we construct a String from byte array without specifying any charset, it would read your platform's default charset. On Windows XP (Chinese Edition), it is GBK by default. 

hope it helps, see reference of JDK :
java.lang.String.String(byte[] bytes)

Constructs a new String by decoding the specified array of bytes using the 
 platform's default charset. The length of the new String is a function of the 
 charset, and hence may not be equal to the length of the byte array. 
The behavior of this constructor when the given bytes are not valid in the 
 default charset is unspecified. The java.nio.charset.CharsetDecoder class 
 should be used when more control over the decoding process is required. 
Parameters:
	bytes the bytes to be decoded into characters
Since:
	JDK1.1

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-540) some problem about the Nutch cache

Posted by "crossany (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521326 ] 

crossany commented on NUTCH-540:
--------------------------------

I just test the Nutch-0.7.2 + tomcat + linux, and I seach the chinese word,the search result was good and when I click the web cached link
The web cache was good. I think maybe some wrong with nutch's jsp.


> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-540) some problem about the Nutch cache

Posted by "david euler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542154 ] 

david euler commented on NUTCH-540:
-----------------------------------

hi, Renaud Richardet, when nutch get null encoding from meta data:

String encoding = (String) metaData.get("CharEncodingForConversion"); 

it would construct content String from bytes using platform default charset, when server's default charset is different from the cached page's charset, error encoded chars would be displayed. in fact, most of the cases, we can find the correct charset of a web page by it's meta data:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

but i don't know why some pages fails to guess the encoding from meta data when the meta info does exist.

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-540) some problem about the Nutch cache

Posted by "crossany (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518925 ] 

crossany commented on NUTCH-540:
--------------------------------

I just use Luke to see the segments it's display good, and I can't find any log about this error
No  tomcat's Canation.out and the crawl.log it's looked good.
I think some wrong with search.jsp

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-540) some problem about the Nutch cache

Posted by "crossany (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

crossany updated NUTCH-540:
---------------------------

    Attachment: 1186733525.jpg

this image is the search chinese word before and after 
I think you know what I mean

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-540) some problem about the Nutch cache

Posted by "Renaud Richardet (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Renaud Richardet updated NUTCH-540:
-----------------------------------

    Priority: Major  (was: Blocker)

could you please attach log files and error messages? thanks

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error.
> I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.