You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "wuwuengr@gmail.com" <wu...@gmail.com> on 2008/10/13 11:08:42 UTC
Fetch/Dump problem: Some Chinese characters incorrect.
I obtained some Chinese language webpages via "nutch fetch". But some
Chinese characters do not come out right after I dumped the segment back to
html pages. For instance:
http://www.dianping.com/shop/501079/
has title portion:
<head><title>
��ɽ��(����)(ͼ)_�Ϻ�_���ڵ�����
</title>
However, I got this after dumping:
<head><title>
��ɽ��1��7(���ׯ1��7)(��1��7)_�Ϻ�_���ڵ�����1��7
</title>
The charset specified in the page is "UTF-8". As I includeded the following
in "nutch-site.xml"
<name>parser.character.encoding.default</name>
<value>UTF-8</value>
It makes no difference.
What could be the problem?
[image: �ظ�ʱ���ô���] <newreply.php?do=newreply&p=5869>
Re: Fwd: Fetch/Dump problem: Some Chinese characters incorrect.
Posted by matinte <mi...@gmail.com>.
For example, some accentuated characters in Spanish are replaced for "?":
"....una opci�n muy interesante. Divertido y pr�ctico...."
I have read in the Nutch-user archives that UTF-8 encoding is enough for
that issue.
Thanks again
--
View this message in context: http://lucene.472066.n3.nabble.com/Fetch-Dump-problem-Some-Chinese-characters-incorrect-tp616293p1643601.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fwd: Fetch/Dump problem: Some Chinese characters incorrect.
Posted by matinte <mi...@gmail.com>.
I am actually having the same problems for Spanish characters (with nutch
configurated as UTF-8).
Did finally solved the problem?
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Fetch-Dump-problem-Some-Chinese-characters-incorrect-tp616293p1643111.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Fwd: Fetch/Dump problem: Some Chinese characters incorrect.
Posted by "wuwuengr@gmail.com" <wu...@gmail.com>.
And it's becoming weirder when I used "readseg -get".
The Chinese text in "parsetext" section is all correct, while the main html
page is totally messed up, both different from what I got with "readseg
-dump".
Anybody has a clue? Seems to be a SegmentReader problem, which for some
reason used shaky encoding/conversion pulling text from segments?
By the way, all the Chinese characters are in three-byte UTF-8.
---------- Forwarded message ----------
From: wuwuengr@gmail.com <wu...@gmail.com>
Date: 2008/10/13
Subject: Fetch/Dump problem: Some Chinese characters incorrect.
To: nutch-user@lucene.apache.org
I obtained some Chinese language webpages via "nutch fetch". But some
Chinese characters do not come out right after I dumped the segment back to
html pages. For instance:
http://www.dianping.com/shop/501079/
has title portion:
<head><title>
��ɽ��(����)(ͼ)_�Ϻ�_���ڵ�����
</title>
However, I got this after dumping:
<head><title>
��ɽ��1��7(���ׯ1��7)(��1��7)_�Ϻ�_���ڵ�����1��7
</title>
The charset specified in the page is "UTF-8". As I includeded the following
in "nutch-site.xml"
<name>parser.character.encoding.default</name>
<value>UTF-8</value>
It makes no difference.
What could be the problem?
[image: �ظ�ʱ���ô���] <http://newreply.php?do=newreply&p=5869>