You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ake Tangkananond <ia...@gmail.com> on 2012/08/09 16:05:28 UTC
Nutch 2 encoding
Hi all,
I just wonder if Nutch 2 is working fine with non english characters in your
deployment? Thai language used to work fine for me in Nutch 1.5 but not in
Nutch 2. Did I miss something. Anything I should check.
Sorry for silly questions, but thank you in advance. ;-)
Regards,
Ake Tangkananond
Re: Nutch 2 encoding
Posted by al...@aim.com.
Hi,
I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly your problem?
Alex.
-----Original Message-----
From: Ake Tangkananond <ia...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Thu, Aug 9, 2012 11:12 am
Subject: Re: Nutch 2 encoding
Hi,
I'm debugging.
I inserted a code to print out the encoding here in HtmlParser:java
function getParse and it printed utf-8. So I think it might be the data
store problem. What else could be the cause? Could you advise what next I
should go for to have my Thai chars stored correctly in HBase? Can I
simply go with the latest version of HBase? (Not sure if it is compatible
with nutch 2.0)
byte[] contentInOctets = page.getContent().array();
InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));
EncodingDetector detector = new EncodingDetector(conf);
detector.autoDetectClues(page, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(page, defaultCharEncoding);
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
LOG.info("encoding : " + encoding);
input.setEncoding(encoding);
Regards,
Ake Tangkananond
On 8/9/12 11:06 PM, "Ake Tangkananond" <ia...@gmail.com> wrote:
>Hi,
>
>Sorry for late reply. I was trying to figure out myself but seem no luck.
>
>I'm on Hbase with local deploy version 0.90.6, r1295128, the working
>version as said in Wiki:
>http://wiki.apache.org/nutch/Nutch2Tutorial
>
>
>Regards,
>Ake Tangkananond
>
>
>
>
>On 8/9/12 10:30 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>
>>It depends on the datastore and possibly the server? What store are you
>>using?
>>
>>On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <ia...@gmail.com>
>>wrote:
>>
>>> Hi all,
>>>
>>> I just wonder if Nutch 2 is working fine with non english characters in
>>> your
>>> deployment? Thai language used to work fine for me in Nutch 1.5 but not
>>>in
>>> Nutch 2. Did I miss something. Anything I should check.
>>>
>>> Sorry for silly questions, but thank you in advance. ;-)
>>>
>>>
>>> Regards,
>>> Ake Tangkananond
>>>
>>>
>>>
>
>
Re: Nutch 2 encoding
Posted by Ake Tangkananond <ia...@gmail.com>.
Hi,
I'm debugging.
I inserted a code to print out the encoding here in HtmlParser:java
function getParse and it printed utf-8. So I think it might be the data
store problem. What else could be the cause? Could you advise what next I
should go for to have my Thai chars stored correctly in HBase? Can I
simply go with the latest version of HBase? (Not sure if it is compatible
with nutch 2.0)
byte[] contentInOctets = page.getContent().array();
InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));
EncodingDetector detector = new EncodingDetector(conf);
detector.autoDetectClues(page, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(page, defaultCharEncoding);
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
LOG.info("encoding : " + encoding);
input.setEncoding(encoding);
Regards,
Ake Tangkananond
On 8/9/12 11:06 PM, "Ake Tangkananond" <ia...@gmail.com> wrote:
>Hi,
>
>Sorry for late reply. I was trying to figure out myself but seem no luck.
>
>I'm on Hbase with local deploy version 0.90.6, r1295128, the working
>version as said in Wiki:
>http://wiki.apache.org/nutch/Nutch2Tutorial
>
>
>Regards,
>Ake Tangkananond
>
>
>
>
>On 8/9/12 10:30 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>
>>It depends on the datastore and possibly the server? What store are you
>>using?
>>
>>On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <ia...@gmail.com>
>>wrote:
>>
>>> Hi all,
>>>
>>> I just wonder if Nutch 2 is working fine with non english characters in
>>> your
>>> deployment? Thai language used to work fine for me in Nutch 1.5 but not
>>>in
>>> Nutch 2. Did I miss something. Anything I should check.
>>>
>>> Sorry for silly questions, but thank you in advance. ;-)
>>>
>>>
>>> Regards,
>>> Ake Tangkananond
>>>
>>>
>>>
>
>
Re: Nutch 2 encoding
Posted by Ake Tangkananond <ia...@gmail.com>.
Hi,
Sorry for late reply. I was trying to figure out myself but seem no luck.
I'm on Hbase with local deploy version 0.90.6, r1295128, the working
version as said in Wiki:
http://wiki.apache.org/nutch/Nutch2Tutorial
Regards,
Ake Tangkananond
On 8/9/12 10:30 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>It depends on the datastore and possibly the server? What store are you
>using?
>
>On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <ia...@gmail.com> wrote:
>
>> Hi all,
>>
>> I just wonder if Nutch 2 is working fine with non english characters in
>> your
>> deployment? Thai language used to work fine for me in Nutch 1.5 but not
>>in
>> Nutch 2. Did I miss something. Anything I should check.
>>
>> Sorry for silly questions, but thank you in advance. ;-)
>>
>>
>> Regards,
>> Ake Tangkananond
>>
>>
>>
Re: Nutch 2 encoding
Posted by Ferdy Galema <fe...@kalooga.com>.
It depends on the datastore and possibly the server? What store are you
using?
On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <ia...@gmail.com> wrote:
> Hi all,
>
> I just wonder if Nutch 2 is working fine with non english characters in
> your
> deployment? Thai language used to work fine for me in Nutch 1.5 but not in
> Nutch 2. Did I miss something. Anything I should check.
>
> Sorry for silly questions, but thank you in advance. ;-)
>
>
> Regards,
> Ake Tangkananond
>
>
>