You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ake Tangkananond <ia...@gmail.com> on 2012/08/09 16:05:28 UTC

Nutch 2 encoding

Hi all,

I just wonder if Nutch 2 is working fine with non english characters in your
deployment? Thai language used to work fine for me in Nutch 1.5 but not in
Nutch 2. Did I miss something. Anything I should check.

Sorry for silly questions, but thank you in advance. ;-)


Regards,
Ake Tangkananond



Re: Nutch 2 encoding

Posted by al...@aim.com.
Hi,

I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly your problem?

Alex.


-----Original Message-----
From: Ake Tangkananond <ia...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Thu, Aug 9, 2012 11:12 am
Subject: Re: Nutch 2 encoding


Hi,

I'm debugging.

I inserted a code to print out the encoding here in HtmlParser:java
function getParse and it printed utf-8. So I think it might be the data
store problem. What else could be the cause? Could you advise what next I
should go for to have my Thai chars stored correctly in HBase? Can I
simply go with the latest version of HBase? (Not sure if it is compatible
with nutch 2.0)


byte[] contentInOctets = page.getContent().array();
      InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));

      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(page, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(page, defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

LOG.info("encoding : " + encoding);
      input.setEncoding(encoding);



Regards,
Ake Tangkananond



On 8/9/12 11:06 PM, "Ake Tangkananond" <ia...@gmail.com> wrote:

>Hi,
>
>Sorry for late reply. I was trying to figure out myself but seem no luck.
>
>I'm on Hbase with local deploy version 0.90.6, r1295128, the working
>version as said in Wiki:
>http://wiki.apache.org/nutch/Nutch2Tutorial
>
>
>Regards,
>Ake Tangkananond
>
>
>
>
>On 8/9/12 10:30 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>
>>It depends on the datastore and possibly the server? What store are you
>>using?
>>
>>On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <ia...@gmail.com>
>>wrote:
>>
>>> Hi all,
>>>
>>> I just wonder if Nutch 2 is working fine with non english characters in
>>> your
>>> deployment? Thai language used to work fine for me in Nutch 1.5 but not
>>>in
>>> Nutch 2. Did I miss something. Anything I should check.
>>>
>>> Sorry for silly questions, but thank you in advance. ;-)
>>>
>>>
>>> Regards,
>>> Ake Tangkananond
>>>
>>>
>>>
>
>



 

Re: Nutch 2 encoding

Posted by Ake Tangkananond <ia...@gmail.com>.
Hi,

I'm debugging.

I inserted a code to print out the encoding here in HtmlParser:java
function getParse and it printed utf-8. So I think it might be the data
store problem. What else could be the cause? Could you advise what next I
should go for to have my Thai chars stored correctly in HBase? Can I
simply go with the latest version of HBase? (Not sure if it is compatible
with nutch 2.0)


byte[] contentInOctets = page.getContent().array();
      InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));

      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(page, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(page, defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

LOG.info("encoding : " + encoding);
      input.setEncoding(encoding);



Regards,
Ake Tangkananond



On 8/9/12 11:06 PM, "Ake Tangkananond" <ia...@gmail.com> wrote:

>Hi,
>
>Sorry for late reply. I was trying to figure out myself but seem no luck.
>
>I'm on Hbase with local deploy version 0.90.6, r1295128, the working
>version as said in Wiki:
>http://wiki.apache.org/nutch/Nutch2Tutorial
>
>
>Regards,
>Ake Tangkananond
>
>
>
>
>On 8/9/12 10:30 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>
>>It depends on the datastore and possibly the server? What store are you
>>using?
>>
>>On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <ia...@gmail.com>
>>wrote:
>>
>>> Hi all,
>>>
>>> I just wonder if Nutch 2 is working fine with non english characters in
>>> your
>>> deployment? Thai language used to work fine for me in Nutch 1.5 but not
>>>in
>>> Nutch 2. Did I miss something. Anything I should check.
>>>
>>> Sorry for silly questions, but thank you in advance. ;-)
>>>
>>>
>>> Regards,
>>> Ake Tangkananond
>>>
>>>
>>>
>
>



Re: Nutch 2 encoding

Posted by Ake Tangkananond <ia...@gmail.com>.
Hi,

Sorry for late reply. I was trying to figure out myself but seem no luck.

I'm on Hbase with local deploy version 0.90.6, r1295128, the working
version as said in Wiki:
http://wiki.apache.org/nutch/Nutch2Tutorial


Regards,
Ake Tangkananond




On 8/9/12 10:30 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:

>It depends on the datastore and possibly the server? What store are you
>using?
>
>On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <ia...@gmail.com> wrote:
>
>> Hi all,
>>
>> I just wonder if Nutch 2 is working fine with non english characters in
>> your
>> deployment? Thai language used to work fine for me in Nutch 1.5 but not
>>in
>> Nutch 2. Did I miss something. Anything I should check.
>>
>> Sorry for silly questions, but thank you in advance. ;-)
>>
>>
>> Regards,
>> Ake Tangkananond
>>
>>
>>



Re: Nutch 2 encoding

Posted by Ferdy Galema <fe...@kalooga.com>.
It depends on the datastore and possibly the server? What store are you
using?

On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond <ia...@gmail.com> wrote:

> Hi all,
>
> I just wonder if Nutch 2 is working fine with non english characters in
> your
> deployment? Thai language used to work fine for me in Nutch 1.5 but not in
> Nutch 2. Did I miss something. Anything I should check.
>
> Sorry for silly questions, but thank you in advance. ;-)
>
>
> Regards,
> Ake Tangkananond
>
>
>