You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rushi <ru...@gmail.com> on 2018/01/25 14:32:02 UTC

Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue

Hello Everyone,
I am having an issue while crawling the spanish website,some the accent
characters are not converting properly.
Here is an example  Infección (wrong one)should be Infección (correct ).

Note:This is with  *Bayan Group Extractor plugin.* Is there any change that
i need to make to convert correctly.

-- 
Regards
Rushikesh M
.Net Developer

RE: Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue

Posted by Yossi Tamari <yo...@pipl.com>.
Hi Rushikesh,

I don't have any experience with this specific plugin, but I have run across similar problems, with 2 possible reasons:
1. It is possible that this specific site does not properly declare what encoding it is using, and the browser guesses the correct one.
2. You may have run across https://issues.apache.org/jira/browse/NUTCH-1807. I solved a similar problem by setting the environment variable LC_ALL to en_US.UTF-8 for all Hadoop processes (more specifically, adding `export LC_ALL=en_US.UTF-8` in ~hadoop/.bashrc on all Hadoop machines solved the problem for me).

	Yossi.

> -----Original Message-----
> From: Rushi [mailto:rushikeshmodem3@gmail.com]
> Sent: 25 January 2018 16:32
> To: user@nutch.apache.org; Mark Vega <ve...@uci.edu>
> Subject: Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue
> 
> Hello Everyone,
> I am having an issue while crawling the spanish website,some the accent
> characters are not converting properly.
> Here is an example  Infección (wrong one)should be Infección (correct ).
> 
> Note:This is with  *Bayan Group Extractor plugin.* Is there any change that i
> need to make to convert correctly.
> 
> --
> Regards
> Rushikesh M
> .Net Developer