You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jingkang Zhang <zj...@yahoo.com.cn> on 2005/02/18 07:12:40 UTC

The problem of using Cyber Neko HTML Parser parse HTML files

When I was using Cyber Neko HTML Parser parse HTML
files( created by Microsoft word ), if the file
contains HTML built-in entity references(for example:
&nbsp;) , node value may contain unknown character. 

Like this:
source html:
<DIV>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt
18pt"><SPAN lang=EN-US style="mso-bidi-font-size:
10.5pt"><FONT face="Times New Roman"><FONT
size=3>-rw-r--r--<SPAN style="mso-spacerun:
yes">&nbsp;&nbsp;&nbsp; </SPAN>1 root<SPAN
style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp;&nbsp;
</SPAN>root<SPAN style="mso-spacerun:
yes">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</SPAN>50 Jan 21 16:12
_1e.f6<o:p></o:p></FONT></FONT></SPAN></P>
</DIV>

after parsing html:
-rw-r--r--��?1 root���� root���������� 50 Jan 21 16:12
_1e.f6

How can I avoid it?

_________________________________________________________
Do You Yahoo!?
150����MP3����ѣ������������ֵ���
http://music.yisou.com/
��Ů����Ӧ�о��У��ѱ���ͼ����ͼ�Ϳ�ͼ
http://image.yisou.com
1G����1000�ף��Ż������������ݣ�
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Re: The problem of using Cyber Neko HTML Parser parse HTML files

Posted by Jingkang Zhang <zj...@yahoo.com.cn>.
Thank you. But how can I view correct output? If my
html files using different encode method (Like :
UTF-8, ISO8859-1, GBK , JIS, etc) , how can I treat
it?



 --- Jason Polites <ja...@tpg.com.au> �����ģ�
> This is not an unknown character.. it is a non
> breaking space (unicode value 
> 0x00A0)
> 
> 
> ----- Original Message ----- 
> From: "Jingkang Zhang" <zj...@yahoo.com.cn>
> To: <lu...@jakarta.apache.org>
> Sent: Friday, February 18, 2005 5:12 PM
> Subject: The problem of using Cyber Neko HTML Parser
> parse HTML files
> 
> 
> > When I was using Cyber Neko HTML Parser parse HTML
> > files( created by Microsoft word ), if the file
> > contains HTML built-in entity references(for
> example:
> > &nbsp;) , node value may contain unknown
> character.
> >
> > Like this:
> > source html:
> > <DIV>
> > <P class=MsoNormal style="MARGIN: 0cm 0cm 0pt
> > 18pt"><SPAN lang=EN-US style="mso-bidi-font-size:
> > 10.5pt"><FONT face="Times New Roman"><FONT
> > size=3>-rw-r--r--<SPAN style="mso-spacerun:
> > yes">&nbsp;&nbsp;&nbsp; </SPAN>1 root<SPAN
> > style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp;&nbsp;
> > </SPAN>root<SPAN style="mso-spacerun:
> >
>
yes">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > </SPAN>50 Jan 21 16:12
> > _1e.f6<o:p></o:p></FONT></FONT></SPAN></P>
> > </DIV>
> >
> > after parsing html:
> > -rw-r--r--�?1 root牋牋 root牋牋牋牋�?50
> Jan 21 16:12
> > _1e.f6
> >
> > How can I avoid it?
> >
> >
>
_________________________________________________________
> > Do You Yahoo!?
> > 150万曲MP3疯狂搜,带您闯入音乐殿堂
> > http://music.yisou.com/
> >
>
美女明星应有尽有,搜遍美图、艳图和酷图
> > http://image.yisou.com
> > 1G就是1000兆,雅虎电邮自助扩容�?> >
>
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> > 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 
>  

_________________________________________________________
Do You Yahoo!?
150����MP3����ѣ������������ֵ���
http://music.yisou.com/
��Ů����Ӧ�о��У��ѱ���ͼ����ͼ�Ϳ�ͼ
http://image.yisou.com
1G����1000�ף��Ż������������ݣ�
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: The problem of using Cyber Neko HTML Parser parse HTML files

Posted by Jason Polites <ja...@tpg.com.au>.
This is not an unknown character.. it is a non breaking space (unicode value 
0x00A0)


----- Original Message ----- 
From: "Jingkang Zhang" <zj...@yahoo.com.cn>
To: <lu...@jakarta.apache.org>
Sent: Friday, February 18, 2005 5:12 PM
Subject: The problem of using Cyber Neko HTML Parser parse HTML files


> When I was using Cyber Neko HTML Parser parse HTML
> files( created by Microsoft word ), if the file
> contains HTML built-in entity references(for example:
> &nbsp;) , node value may contain unknown character.
>
> Like this:
> source html:
> <DIV>
> <P class=MsoNormal style="MARGIN: 0cm 0cm 0pt
> 18pt"><SPAN lang=EN-US style="mso-bidi-font-size:
> 10.5pt"><FONT face="Times New Roman"><FONT
> size=3>-rw-r--r--<SPAN style="mso-spacerun:
> yes">&nbsp;&nbsp;&nbsp; </SPAN>1 root<SPAN
> style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp;&nbsp;
> </SPAN>root<SPAN style="mso-spacerun:
> yes">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> </SPAN>50 Jan 21 16:12
> _1e.f6<o:p></o:p></FONT></FONT></SPAN></P>
> </DIV>
>
> after parsing html:
> -rw-r--r--牋?1 root牋牋 root牋牋牋牋牋 50 Jan 21 16:12
> _1e.f6
>
> How can I avoid it?
>
> _________________________________________________________
> Do You Yahoo!?
> 150万曲MP3疯狂搜,带您闯入音乐殿堂
> http://music.yisou.com/
> 美女明星应有尽有,搜遍美图、艳图和酷图
> http://image.yisou.com
> 1G就是1000兆,雅虎电邮自助扩容!
> http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org