You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Niccolò Becchi <ni...@gmail.com> on 2012/08/08 13:25:04 UTC

Nutch Encoding on AWS

Hi,
I have been using Nutch for fetching english sites (UTF-8 and ISO-8859-1).
All go well running in local-mode or on a single-node hadoop cluster
installed on my pc.
Recently I have moved the crawling system to the Amazon AWS and Fetcher has
some encoding problems with special character, they are not recognizable
(they appear as '?')
I have tried both with EMR and cluster launched manually with the
"hadoop-ec2 launch-cluster" command but it doesn't work well.
The same page that are correctly fetched with my local hadoop cluster have
same encoding errors running on AWS (with exactly the same job)

Any idea?
Thanks!

Re: Nutch Encoding on AWS

Posted by Niccolò Becchi <ni...@gmail.com>.
I am using an hadoop cluster on the us-east-1 region.
The strange thing is that if I run Nutch just on the hadoop-master instance
with the jar (in the no-hadoop way) all work well about encoding.
But If I use the job file on the hadoop way (in the same master instance
with a slave of the same type) I start to have this problem on the encoding
of special characters.
And the same job file work well running on an single-node hadoop cluster
installed on my pc.
I really haven't any idea..
Thanks for all.
Niccolò

On Wed, Aug 8, 2012 at 9:46 PM, X3C TECH <te...@x3chaos.com> wrote:

> Not sure if it matters, but what data center are you using? Maybe the data
> center region uses different characters if the native language isn't
> english
>
> On Wed, Aug 8, 2012 at 7:25 AM, Niccolò Becchi <niccolo.becchi@gmail.com
> >wrote:
>
> > Hi,
> > I have been using Nutch for fetching english sites (UTF-8 and
> ISO-8859-1).
> > All go well running in local-mode or on a single-node hadoop cluster
> > installed on my pc.
> > Recently I have moved the crawling system to the Amazon AWS and Fetcher
> has
> > some encoding problems with special character, they are not recognizable
> > (they appear as '?')
> > I have tried both with EMR and cluster launched manually with the
> > "hadoop-ec2 launch-cluster" command but it doesn't work well.
> > The same page that are correctly fetched with my local hadoop cluster
> have
> > same encoding errors running on AWS (with exactly the same job)
> >
> > Any idea?
> > Thanks!
> >
>

Re: Nutch Encoding on AWS

Posted by X3C TECH <te...@x3chaos.com>.
Not sure if it matters, but what data center are you using? Maybe the data
center region uses different characters if the native language isn't english

On Wed, Aug 8, 2012 at 7:25 AM, Niccolò Becchi <ni...@gmail.com>wrote:

> Hi,
> I have been using Nutch for fetching english sites (UTF-8 and ISO-8859-1).
> All go well running in local-mode or on a single-node hadoop cluster
> installed on my pc.
> Recently I have moved the crawling system to the Amazon AWS and Fetcher has
> some encoding problems with special character, they are not recognizable
> (they appear as '?')
> I have tried both with EMR and cluster launched manually with the
> "hadoop-ec2 launch-cluster" command but it doesn't work well.
> The same page that are correctly fetched with my local hadoop cluster have
> same encoding errors running on AWS (with exactly the same job)
>
> Any idea?
> Thanks!
>