You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "wuwuengr@gmail.com" <wu...@gmail.com> on 2008/10/13 11:08:42 UTC

Fetch/Dump problem: Some Chinese characters incorrect.

I obtained some Chinese language webpages via "nutch fetch". But some
Chinese characters do not come out right after I dumped the segment back to
html pages. For instance:
http://www.dianping.com/shop/501079/
has title portion:
<head><title>
��ɽ��(����)(ͼ)_�Ϻ�_���ڵ�����
</title>

However, I got this after dumping:
<head><title>
��ɽ��1��7(���ׯ1��7)(��1��7)_�Ϻ�_���ڵ�����1��7
</title>


The charset specified in the page is "UTF-8". As I includeded the following
in "nutch-site.xml"
<name>parser.character.encoding.default</name>
  <value>UTF-8</value>

It makes no difference.

What could be the problem?


[image: �ظ�ʱ���ô���] <newreply.php?do=newreply&p=5869>

Re: Fwd: Fetch/Dump problem: Some Chinese characters incorrect.

Posted by matinte <mi...@gmail.com>.

For example, some accentuated characters in Spanish are replaced for "?":

"....una opci�n muy interesante. Divertido y pr�ctico...."

I have read in the Nutch-user archives that UTF-8 encoding is enough for
that issue.

Thanks again
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Fetch-Dump-problem-Some-Chinese-characters-incorrect-tp616293p1643601.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fwd: Fetch/Dump problem: Some Chinese characters incorrect.

Posted by matinte <mi...@gmail.com>.

I am actually having the same problems for Spanish characters (with nutch
configurated as UTF-8). 
Did finally solved the problem?

Thanks
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Fetch-Dump-problem-Some-Chinese-characters-incorrect-tp616293p1643111.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Fwd: Fetch/Dump problem: Some Chinese characters incorrect.

Posted by "wuwuengr@gmail.com" <wu...@gmail.com>.

And it's becoming weirder when I used "readseg -get".

The Chinese text in "parsetext" section is all correct, while the main html
page is totally messed up, both different from what I got with "readseg
-dump".

Anybody has a clue? Seems to be a SegmentReader problem, which for some
reason used shaky encoding/conversion pulling text from segments?

By the way, all the Chinese characters are in three-byte UTF-8.

---------- Forwarded message ----------
From: wuwuengr@gmail.com <wu...@gmail.com>
Date: 2008/10/13
Subject: Fetch/Dump problem: Some Chinese characters incorrect.
To: nutch-user@lucene.apache.org


I obtained some Chinese language webpages via "nutch fetch". But some
Chinese characters do not come out right after I dumped the segment back to
html pages. For instance:
http://www.dianping.com/shop/501079/
has title portion:
<head><title>
��ɽ��(����)(ͼ)_�Ϻ�_���ڵ�����
</title>

However, I got this after dumping:
<head><title>
��ɽ��1��7(���ׯ1��7)(��1��7)_�Ϻ�_���ڵ�����1��7
</title>


The charset specified in the page is "UTF-8". As I includeded the following
in "nutch-site.xml"
<name>parser.character.encoding.default</name>
  <value>UTF-8</value>

It makes no difference.

What could be the problem?


[image: �ظ�ʱ���ô���] <http://newreply.php?do=newreply&p=5869>