You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kauu <ba...@gmail.com> on 2006/03/25 03:09:17 UTC
what is it? need help
hi all
i got another problem now, after my crawling and startup the tomcat(I've
change the nutch-site.xml),then i search some thing , i got some tangly
results which looks like
*Ñӱߴóѧ±¾¿ÆÉúÕÐÉúÐÅÏ¢Íø-- <http://zsb.ybu.edu.cn/search.php>*
Ñӱߴóѧ±¾¿ÆÉúÕÐÉúÐÅÏ¢Íø-- Ñӱߴóѧ±¾¿ÆÉúÕÐÉúÐÅÏ¢Íø Ìáʾ ÇëÊäÈëËÑË÷¹Ø¼ü×Ö
µã»÷´Ë´¦·µ»ØÉÏÒ»Ò³ ´¦³¤* ... *
http://zsb.ybu.edu.cn/search.php
(cached<http://localhost:8080/cached.jsp?idx=0&id=28>)
(explain <http://localhost:8080/explain.jsp?idx=0&id=28&query=search>) (
anchors <http://localhost:8080/anchors.jsp?idx=0&id=28>)
####### and the tangly results should be CHINESE. ########
my
os is winxp(sp2)
brower is firefox (i get the same result in IE)
everything goes well except this
any one can help me? any reply will be appreciated!!!
--
www.babatu.com
Re: what is it? need help
Posted by Chun Wei Ho <cw...@gmail.com>.
I couldn't access the site you crawled to check, but it seems to me
that nutch couldn't get the correct encoding/charset of the page.
Nutch looks for the encoding from the contenttype header and from a
meta content type tag in the HEAD section of the page. If the
webserver/page shows neither, I think it defaults to
parser.character.encoding.default, which usually is the wrong one for
Chinese pages.
As for the characters turning right again in your email, I guess when
you got them from nutch they were encoded in java's unicode instead of
GB/UTF8 (which means they show up as a shorter squiggle - as you
observed) but after you pasted it into a email, the email was sent as
unicode which turns them back into normal characters on receipt.
On 3/25/06, kauu <ba...@gmail.com> wrote:
> what's going on?
> after sending my mail i see that what is just tangly character turn normall
> ,why? anyone can tell me something about it?
> well, another thing is that,after i entrying some CHINESE into the query
> box ,it turn tangly character when i button the query button. so why?
> any reply will be appreciated!
>
> On 3/25/06, kauu <ba...@gmail.com> wrote:
> >
> > hi all
> > i got another problem now, after my crawling and startup the tomcat(I've
> > change the nutch-site.xml),then i search some thing , i got some tangly
> > results which looks like
> >
> >
> > * 延边大学本科生招生信息网-- <http://zsb.ybu.edu.cn/search.php>*
> > 延边大学本科生招生信息网-- 延边大学本科生招生信息网 提示 请输入搜索关键字 点击此处返回上一页 处长* ... *
> > http://zsb.ybu.edu.cn/search.php (cached<http://localhost:8080/cached.jsp?idx=0&id=28>)
> > (explain <http://localhost:8080/explain.jsp?idx=0&id=28&query=search>) (
> > anchors <http://localhost:8080/anchors.jsp?idx=0&id=28>)
> >
> > ####### and the tangly results should be CHINESE. ########
> > my
> > os is winxp(sp2)
> > brower is firefox (i get the same result in IE)
> >
> > everything goes well except this
> > any one can help me? any reply will be appreciated!!!
> >
> > --
> > www.babatu.com
> >
>
>
>
> --
> www.babatu.com
>
Re: what is it? need help
Posted by kauu <ba...@gmail.com>.
what's going on?
after sending my mail i see that what is just tangly character turn normall
,why? anyone can tell me something about it?
well, another thing is that,after i entrying some CHINESE into the query
box ,it turn tangly character when i button the query button. so why?
any reply will be appreciated!
On 3/25/06, kauu <ba...@gmail.com> wrote:
>
> hi all
> i got another problem now, after my crawling and startup the tomcat(I've
> change the nutch-site.xml),then i search some thing , i got some tangly
> results which looks like
>
>
> * 延边大学本科生招生信息网-- <http://zsb.ybu.edu.cn/search.php>*
> 延边大学本科生招生信息网-- 延边大学本科生招生信息网 提示 请输入搜索关键字 点击此处返回上一页 处长* ... *
> http://zsb.ybu.edu.cn/search.php (cached<http://localhost:8080/cached.jsp?idx=0&id=28>)
> (explain <http://localhost:8080/explain.jsp?idx=0&id=28&query=search>) (
> anchors <http://localhost:8080/anchors.jsp?idx=0&id=28>)
>
> ####### and the tangly results should be CHINESE. ########
> my
> os is winxp(sp2)
> brower is firefox (i get the same result in IE)
>
> everything goes well except this
> any one can help me? any reply will be appreciated!!!
>
> --
> www.babatu.com
>
--
www.babatu.com