You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kauu <ba...@gmail.com> on 2006/03/25 03:09:17 UTC

what is it? need help

hi all
  i got another problem now, after my crawling and startup the tomcat(I've
change the nutch-site.xml),then i search some thing , i got some tangly
results which looks like


*Ñӱߴóѧ±¾¿ÆÉúÕÐÉúÐÅÏ¢Íø-- <http://zsb.ybu.edu.cn/search.php>*
Ñӱߴóѧ±¾¿ÆÉúÕÐÉúÐÅÏ¢Íø--    Ñӱߴóѧ±¾¿ÆÉúÕÐÉúÐÅÏ¢Íø Ìáʾ ÇëÊäÈëËÑË÷¹Ø¼ü×Ö
µã»÷´Ë´¦·µ»ØÉÏÒ»Ò³ ´¦³¤* ... *
http://zsb.ybu.edu.cn/search.php
(cached<http://localhost:8080/cached.jsp?idx=0&id=28>)
(explain <http://localhost:8080/explain.jsp?idx=0&id=28&query=search>) (
anchors <http://localhost:8080/anchors.jsp?idx=0&id=28>)

#######         and the tangly results should be CHINESE.      ########
my
 os is winxp(sp2)
 brower is firefox  (i get the same result in  IE)

everything goes well except this
any one can help me? any reply will be appreciated!!!

--
www.babatu.com

Re: what is it? need help

Posted by Chun Wei Ho <cw...@gmail.com>.
I couldn't access the site you crawled to check, but it seems to me
that nutch couldn't get the correct encoding/charset of the page.

Nutch looks for the encoding from the contenttype header and from a
meta content type tag in the HEAD section of the page. If the
webserver/page shows neither, I think it defaults to
parser.character.encoding.default, which usually is the wrong one for
Chinese pages.

As for the characters turning right again in your email, I guess when
you got them from nutch they were encoded in java's unicode instead of
GB/UTF8 (which means they show up as a shorter squiggle - as you
observed) but after you pasted it into a email, the email was sent as
unicode which turns them back into normal characters on receipt.


On 3/25/06, kauu <ba...@gmail.com> wrote:
> what's going on?
>  after sending my mail i see that what is just tangly character turn normall
> ,why? anyone can tell me something about it?
>  well, another thing is that,after i entrying some CHINESE into the query
> box ,it turn tangly character when i button the query button. so why?
>  any reply will be appreciated!
>
> On 3/25/06, kauu <ba...@gmail.com> wrote:
> >
> > hi all
> >   i got another problem now, after my crawling and startup the tomcat(I've
> > change the nutch-site.xml),then i search some thing , i got some tangly
> > results which looks like
> >
> >
> > * 延边大学本科生招生信息网-- <http://zsb.ybu.edu.cn/search.php>*
> > 延边大学本科生招生信息网--    延边大学本科生招生信息网 提示 请输入搜索关键字 点击此处返回上一页 处长* ... *
> > http://zsb.ybu.edu.cn/search.php (cached<http://localhost:8080/cached.jsp?idx=0&id=28>)
> > (explain <http://localhost:8080/explain.jsp?idx=0&id=28&query=search>) (
> > anchors <http://localhost:8080/anchors.jsp?idx=0&id=28>)
> >
> > #######         and the tangly results should be CHINESE.      ########
> > my
> >  os is winxp(sp2)
> >  brower is firefox  (i get the same result in  IE)
> >
> > everything goes well except this
> > any one can help me? any reply will be appreciated!!!
> >
> > --
> > www.babatu.com
> >
>
>
>
> --
> www.babatu.com
>

Re: what is it? need help

Posted by kauu <ba...@gmail.com>.
what's going on?
 after sending my mail i see that what is just tangly character turn normall
,why? anyone can tell me something about it?
 well, another thing is that,after i entrying some CHINESE into the query
box ,it turn tangly character when i button the query button. so why?
 any reply will be appreciated!

On 3/25/06, kauu <ba...@gmail.com> wrote:
>
> hi all
>   i got another problem now, after my crawling and startup the tomcat(I've
> change the nutch-site.xml),then i search some thing , i got some tangly
> results which looks like
>
>
> * 延边大学本科生招生信息网-- <http://zsb.ybu.edu.cn/search.php>*
> 延边大学本科生招生信息网--    延边大学本科生招生信息网 提示 请输入搜索关键字 点击此处返回上一页 处长* ... *
> http://zsb.ybu.edu.cn/search.php (cached<http://localhost:8080/cached.jsp?idx=0&id=28>)
> (explain <http://localhost:8080/explain.jsp?idx=0&id=28&query=search>) (
> anchors <http://localhost:8080/anchors.jsp?idx=0&id=28>)
>
> #######         and the tangly results should be CHINESE.      ########
> my
>  os is winxp(sp2)
>  brower is firefox  (i get the same result in  IE)
>
> everything goes well except this
> any one can help me? any reply will be appreciated!!!
>
> --
> www.babatu.com
>



--
www.babatu.com