You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by crossafire <cr...@gmail.com> on 2007/11/08 09:09:52 UTC

How can I know the Cached Web Charset

I just crawl some chinese website where Used GB2312 for Web Meta Charset,
the crawl and search it's OK. But when I want to try the Web Cached It's
encoding it's error.
So I see The cached.jsp in my tomcat . I know try to edit the cached.jsp 

if (encoding != null) {
      try {
        content = new String(bean.getContent(details), encoding);
      }
      catch (UnsupportedEncodingException e) {
        // fallback to windows-1252
        content = new String(bean.getContent(details), "windows-1252");
      }
    }
    else
      content = new String(bean.getContent(details), "gb2312");
  }

that the display Cached web it's Ok, But that just can do for web which used
GB2312
So it's not a good idear for me.
I want get the Cached web encoding
So I try to debug the Cached.jsp like this
String encoding = (String) metaData.get("CharEncodingForConversion");
System.out.print(encoding);
It's debug the encoding is NULL;

Metadata metaData = bean.getParseData(details).getContentMeta();
String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
System.out.print(contenType);

It's just debug the contenType is text/html

I hope somebody can know how to get The Cachec Web encoding

Thanks



-- 
View this message in context: http://www.nabble.com/How-can-I-know-the-Cached-Web-Charset-tf4769632.html#a13642889
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: How can I know the Cached Web Charset

Posted by crossafire <cr...@gmail.com>.


crossafire wrote:
> 
> I just crawl some chinese website where Used GB2312 for Web Meta Charset,
> the crawl and search it's OK. But when I want to try the Web Cached It's
> encoding it's error.
> So I see The cached.jsp in my tomcat . I know try to edit the cached.jsp 
> 
> if (encoding != null) {
>       try {
>         content = new String(bean.getContent(details), encoding);
>       }
>       catch (UnsupportedEncodingException e) {
>         // fallback to windows-1252
>         content = new String(bean.getContent(details), "windows-1252");
>       }
>     }
>     else
>       content = new String(bean.getContent(details), "gb2312");
>   }
> 
> that the display Cached web it's Ok, But that just can do for web which
> used GB2312
> So it's not a good idear for me.
> I want get the Cached web encoding
> So I try to debug the Cached.jsp like this
> String encoding = (String) metaData.get("CharEncodingForConversion");
> System.out.print(encoding);
> It's debug the encoding is NULL;
> 
> Metadata metaData = bean.getParseData(details).getContentMeta();
> String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
> System.out.print(contenType);
> 
> It's just debug the contenType is text/html
> 
> I hope somebody can know how to get The Cachec Web encoding
> 
> Thanks
> 
> 
> 
> 

Thank you 
But I must to know the Html charset becasue many chinese web site used
gb2312 for html page
I think I just try the jchardet , Thank you very much 

-- 
View this message in context: http://www.nabble.com/How-can-I-know-the-Cached-Web-Charset-tf4769632.html#a13660093
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: How can I know the Cached Web Charset

Posted by Chee Wu <ch...@gmail.com>.
There are  many Chinese Html pages use UTF-8, so your method might cause the
summary of theses pages to be garbage in your search result, which is very
ugly...

The encodings of  Html pages are deteced by HtmlParser.  Firstly,HtmlParser
will try to find charset meta information in the page head,if this
information doesn't exist,HtmlParser will use default encoding,and default
encoding can be set in Nutch-site.xml.I suggest you don't use default
encoding, just discard the pages whose encoding can't be determined.

You can also to use "jchardet"  to detect encoding of html pages. If charset
encoding can't be  determined by both charset meta data and jchardet, just
discard it.


On Nov 8, 2007 4:09 PM, crossafire <cr...@gmail.com> wrote:

>
> I just crawl some chinese website where Used GB2312 for Web Meta Charset,
> the crawl and search it's OK. But when I want to try the Web Cached It's
> encoding it's error.
> So I see The cached.jsp in my tomcat . I know try to edit the cached.jsp
>
> if (encoding != null) {
>      try {
>        content = new String(bean.getContent(details), encoding);
>      }
>      catch (UnsupportedEncodingException e) {
>        // fallback to windows-1252
>        content = new String(bean.getContent(details), "windows-1252");
>      }
>    }
>    else
>      content = new String(bean.getContent(details), "gb2312");
>  }
>
> that the display Cached web it's Ok, But that just can do for web which
> used
> GB2312
> So it's not a good idear for me.
> I want get the Cached web encoding
> So I try to debug the Cached.jsp like this
> String encoding = (String) metaData.get("CharEncodingForConversion");
> System.out.print(encoding);
> It's debug the encoding is NULL;
>
> Metadata metaData = bean.getParseData(details).getContentMeta();
> String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
> System.out.print(contenType);
>
> It's just debug the contenType is text/html
>
> I hope somebody can know how to get The Cachec Web encoding
>
> Thanks
>
>
>
> --
> View this message in context:
> http://www.nabble.com/How-can-I-know-the-Cached-Web-Charset-tf4769632.html#a13642889
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>