You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Philip <da...@gmail.com> on 2013/03/14 05:58:25 UTC

Parsed content in form of special characters

Hi,

  For some specific urls, the content fetched is in the form of special
characters, Is it character encoding issue? any settings need to be done at
nutch parsing level?


*url:*
http://service.sony.com.cn/vaio/Announcments/33412.htm

*content extracted is something like this: *
*
*
 SONY China
Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
首页   新闻与公告   产�支� 个人电脑�周边产�
VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
按照产�型��索 关键字     选择产�系列 / 型�
选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
其他产å“..................

*title: *
SONY China Service-关于建议使用正宗索尼电�适�器的声明


Thanks - David

Re: Parsed content in form of special characters

Posted by David Philip <da...@gmail.com>.
Hi Kiran,

  I am using Nutch 1.6 and to index and search - solr3.6

Thanks -David



On Thu, Mar 14, 2013 at 10:36 AM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi David,
>
> Which version of Nutch are you using ? If 2.x, which backend are you using
> ?
>
>
> On Thu, Mar 14, 2013 at 12:58 AM, David Philip
> <da...@gmail.com>wrote:
>
> > Hi,
> >
> >   For some specific urls, the content fetched is in the form of special
> > characters, Is it character encoding issue? any settings need to be done
> at
> > nutch parsing level?
> >
> >
> > *url:*
> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >
> > *content extracted is something like this: *
> > *
> > *
> >  SONY China
> > Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> > 首页   新闻与公告   产�支� 个人电脑�周边产�
> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> > 按照产�型��索 关键字     选择产�系列 / 型�
> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> > 其他产å“..................
> >
> > *title: *
> > SONY China Service-关于建议使用正宗索尼电�适�器的声明
> >
> >
> > Thanks - David
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>

Re: Parsed content in form of special characters

Posted by kiran chitturi <ch...@gmail.com>.
Hi David,

Which version of Nutch are you using ? If 2.x, which backend are you using ?


On Thu, Mar 14, 2013 at 12:58 AM, David Philip
<da...@gmail.com>wrote:

> Hi,
>
>   For some specific urls, the content fetched is in the form of special
> characters, Is it character encoding issue? any settings need to be done at
> nutch parsing level?
>
>
> *url:*
> http://service.sony.com.cn/vaio/Announcments/33412.htm
>
> *content extracted is something like this: *
> *
> *
>  SONY China
> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> 首页   新闻与公告   产�支� 个人电脑�周边产�
> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> 按照产�型��索 关键字     选择产�系列 / 型�
> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> 其他产å“..................
>
> *title: *
> SONY China Service-关于建议使用正宗索尼电�适�器的声明
>
>
> Thanks - David
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: Parsed content in form of special characters

Posted by David Philip <da...@gmail.com>.
Hi,

   Thank you Rajani Maski and feng lu. It worked for me. I had done the
tomcat setting but had missed nutch setting.
Thank you very much.

Thanks - David



On Thu, Mar 14, 2013 at 3:16 PM, feng lu <am...@gmail.com> wrote:

> Hi
>
> The problem is happen in HtmlParser#sniffCharacterEncoding. It read out
> 'charset' parameter in the meta tag  from the first <code>CHUNK_SIZE</code>
> bytes.
>
> In this page http://service.sony.com.cn/vaio/Announcments/33412.htm, the
> meta tag is pass the first 2000 (CHUNK_SIZE) bytes. So that page encoding
> will not be detected. But this CHUNK_SIZE param can not configured.
>
>
>
>
> On Thu, Mar 14, 2013 at 5:18 PM, feng lu <am...@gmail.com> wrote:
>
> > Hi David
> >
> > The problem is that parseHtml will detect the encoding of parsing html.
> > The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not
> > be detected by EncodingDetector class. so it set to the default charactor
> > encoding. Maybe you can set this property
> parser.character.encoding.default
> > to utf-8 to fixed this problem temporarily.
> >
> > <property>
> >   <name>parser.character.encoding.default</name>
> >   <value>utf-8</value>
> >   <description>The character encoding to fall back to when no other
> > information
> >   is available</description>
> > </property>
> >
> > i test it in my computer and output is like this:
> >
> > gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch
> > plugin parse-html org.apache.nutch.parse.html.HtmlParser
> > ~/Downloads/45962.htm
> > data: Version: 5
> > Status: success(1,0)
> > Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知
> >
> > .....
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 14, 2013 at 3:23 PM, David Philip <
> davidphilipsheron@gmail.com
> > > wrote:
> >
> >> Hi,
> >>
> >>   I did crawl through this
> >> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
> >> its same issue.
> >>
> >> Title extracted is in this format:SONY China
> >>
> >>
> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
> >>
> >> It was supposed to be like this :
> >> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
> >>
> >> For specific urls like above it has this special characters problem. For
> >> rest, characters extracted are proper. ex: this
> >> url<http://service.sony.com.cn/9380.htm>it is proper parse.
> >>
> >>
> >> Thanks David.
> >>
> >>
> >> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> >> <da...@gmail.com>wrote:
> >>
> >> > I am attaching the extracted text file. not sure if you can receive
> and
> >> > view it.
> >> >
> >> > My observation:
> >> > When I compared the extracted text with url<
> >> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> >> > (by doing view source). all most everything looks same other than data
> >> that
> >> > is in ParseText:: section of the extracted text.
> >> >
> >> >
> >> > Thanks -David
> >> >
> >> >
> >> >
> >> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> >> > davidphilipsheron@gmail.com> wrote:
> >> >
> >> >> Hi Tejas,
> >> >>
> >> >>    I used the redseg command:bin/nutch readseg -dump
> >> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> >> >> -nogenerate -noparse -nofetch -noparsedata
> >> >>
> >> >> It generated the dump file,then I used less/cat command:
> >> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
> >> >test459.txt -
> >> >> viewed the content as text file(gedit).
> >> >>
> >> >>
> >> >> Below is brief of that text file(test459.txt):
> >> >>
> >> >> Recno:: 0
> >> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >>
> >> >> ParseText::
> >> >>  SONY China
> >> >> Service-关于建议使用正宗索尼电�适�器的声明
> &nbsp
> >> >> 首页   新闻与公告   产�支� 个人电脑�周边产�
> >> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> æ•°ç
> >> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
> >> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> >> 按照产�型��索 关键字     选择产�系列 / 型�
> >> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >> 家庭音�产�
> >> >> 其他产� 选择产��类别 选择产�系列
> >> >> /..........................
> >> >> this is little huge.. so didn't paste everything.
> >> >>
> >> >>
> >> >> Content::
> >> >> Version: -1
> >> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >> contentType: application/xhtml+xml
> >> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> >> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> >> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> >> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> >> >> Content-Type=text/html Connection=close
> >> >> Content:
> >> >>
> >> >>
> >> >> Thanks - David
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <
> >> tejas.patil.cs@gmail.com>wrote:
> >> >>
> >> >>> I dont think so. The tool that you are using to view this must have
> >> >>> support
> >> >>> for the desired languages. I had same problem while looking at the
> >> pages
> >> >>> having chinese content over putty. Installing language packs and
> >> tweaking
> >> >>> putty settings made this go away. I don't recall exact steps /
> details
> >> >>> as I
> >> >>> did that about a year back.
> >> >>>
> >> >>>
> >> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> >> >>> <da...@gmail.com>wrote:
> >> >>>
> >> >>> > Hi,
> >> >>> >
> >> >>> >   For some specific urls, the content fetched is in the form of
> >> special
> >> >>> > characters, Is it character encoding issue? any settings need to
> be
> >> >>> done at
> >> >>> > nutch parsing level?
> >> >>> >
> >> >>> >
> >> >>> > *url:*
> >> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >>> >
> >> >>> > *content extracted is something like this: *
> >> >>> > *
> >> >>> > *
> >> >>> >  SONY China
> >> >>> > Service-关于建议使用正宗索尼电�适�器的声明
> >> &nbsp
> >> >>> > 首页   新闻与公告   产�支�
> 个人电脑�周边产�
> >> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> >> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> >> 其他产�
> >> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> >>> > 按照产�型��索 关键字     选择产�系列 / 型�
> >> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >> >>> 家庭音�产�
> >> >>> > 其他产å“..................
> >> >>> >
> >> >>> > *title: *
> >> >>> > SONY China
> >> >>> Service-关于建议使用正宗索尼电�适�器的声明
> >> >>> >
> >> >>> >
> >> >>> > Thanks - David
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Parsed content in form of special characters

Posted by feng lu <am...@gmail.com>.
Hi

The problem is happen in HtmlParser#sniffCharacterEncoding. It read out
'charset' parameter in the meta tag  from the first <code>CHUNK_SIZE</code>
bytes.

In this page http://service.sony.com.cn/vaio/Announcments/33412.htm, the
meta tag is pass the first 2000 (CHUNK_SIZE) bytes. So that page encoding
will not be detected. But this CHUNK_SIZE param can not configured.




On Thu, Mar 14, 2013 at 5:18 PM, feng lu <am...@gmail.com> wrote:

> Hi David
>
> The problem is that parseHtml will detect the encoding of parsing html.
> The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not
> be detected by EncodingDetector class. so it set to the default charactor
> encoding. Maybe you can set this property parser.character.encoding.default
> to utf-8 to fixed this problem temporarily.
>
> <property>
>   <name>parser.character.encoding.default</name>
>   <value>utf-8</value>
>   <description>The character encoding to fall back to when no other
> information
>   is available</description>
> </property>
>
> i test it in my computer and output is like this:
>
> gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch
> plugin parse-html org.apache.nutch.parse.html.HtmlParser
> ~/Downloads/45962.htm
> data: Version: 5
> Status: success(1,0)
> Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知
>
> .....
>
>
>
>
>
>
> On Thu, Mar 14, 2013 at 3:23 PM, David Philip <davidphilipsheron@gmail.com
> > wrote:
>
>> Hi,
>>
>>   I did crawl through this
>> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
>> its same issue.
>>
>> Title extracted is in this format:SONY China
>>
>> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
>>
>> It was supposed to be like this :
>> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
>>
>> For specific urls like above it has this special characters problem. For
>> rest, characters extracted are proper. ex: this
>> url<http://service.sony.com.cn/9380.htm>it is proper parse.
>>
>>
>> Thanks David.
>>
>>
>> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
>> <da...@gmail.com>wrote:
>>
>> > I am attaching the extracted text file. not sure if you can receive and
>> > view it.
>> >
>> > My observation:
>> > When I compared the extracted text with url<
>> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
>> > (by doing view source). all most everything looks same other than data
>> that
>> > is in ParseText:: section of the extracted text.
>> >
>> >
>> > Thanks -David
>> >
>> >
>> >
>> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
>> > davidphilipsheron@gmail.com> wrote:
>> >
>> >> Hi Tejas,
>> >>
>> >>    I used the redseg command:bin/nutch readseg -dump
>> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
>> >> -nogenerate -noparse -nofetch -noparsedata
>> >>
>> >> It generated the dump file,then I used less/cat command:
>> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
>> >test459.txt -
>> >> viewed the content as text file(gedit).
>> >>
>> >>
>> >> Below is brief of that text file(test459.txt):
>> >>
>> >> Recno:: 0
>> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >>
>> >> ParseText::
>> >>  SONY China
>> >> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
>> >> 首页   新闻与公告   产�支� 个人电脑�周边产�
>> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
>> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
>> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> >> 按照产�型��索 关键字     选择产�系列 / 型�
>> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>> 家庭音�产�
>> >> 其他产� 选择产��类别 选择产�系列
>> >> /..........................
>> >> this is little huge.. so didn't paste everything.
>> >>
>> >>
>> >> Content::
>> >> Version: -1
>> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >> contentType: application/xhtml+xml
>> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
>> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
>> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
>> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
>> >> Content-Type=text/html Connection=close
>> >> Content:
>> >>
>> >>
>> >> Thanks - David
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <
>> tejas.patil.cs@gmail.com>wrote:
>> >>
>> >>> I dont think so. The tool that you are using to view this must have
>> >>> support
>> >>> for the desired languages. I had same problem while looking at the
>> pages
>> >>> having chinese content over putty. Installing language packs and
>> tweaking
>> >>> putty settings made this go away. I don't recall exact steps / details
>> >>> as I
>> >>> did that about a year back.
>> >>>
>> >>>
>> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
>> >>> <da...@gmail.com>wrote:
>> >>>
>> >>> > Hi,
>> >>> >
>> >>> >   For some specific urls, the content fetched is in the form of
>> special
>> >>> > characters, Is it character encoding issue? any settings need to be
>> >>> done at
>> >>> > nutch parsing level?
>> >>> >
>> >>> >
>> >>> > *url:*
>> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >>> >
>> >>> > *content extracted is something like this: *
>> >>> > *
>> >>> > *
>> >>> >  SONY China
>> >>> > Service-关于建议使用正宗索尼电�适�器的声明
>> &nbsp
>> >>> > 首页   新闻与公告   产�支� 个人电脑�周边产�
>> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
>> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
>> 其他产�
>> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> >>> > 按照产�型��索 关键字     选择产�系列 / 型�
>> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>> >>> 家庭音�产�
>> >>> > 其他产å“..................
>> >>> >
>> >>> > *title: *
>> >>> > SONY China
>> >>> Service-关于建议使用正宗索尼电�适�器的声明
>> >>> >
>> >>> >
>> >>> > Thanks - David
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Parsed content in form of special characters

Posted by feng lu <am...@gmail.com>.
Hi David

The problem is that parseHtml will detect the encoding of parsing html. The
page http://service.sony.com.cn/vaio/Announcments/33412.htm can not be
detected by EncodingDetector class. so it set to the default charactor
encoding. Maybe you can set this property parser.character.encoding.default
to utf-8 to fixed this problem temporarily.

<property>
  <name>parser.character.encoding.default</name>
  <value>utf-8</value>
  <description>The character encoding to fall back to when no other
information
  is available</description>
</property>

i test it in my computer and output is like this:

gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch plugin
parse-html org.apache.nutch.parse.html.HtmlParser ~/Downloads/45962.htm
data: Version: 5
Status: success(1,0)
Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知

.....






On Thu, Mar 14, 2013 at 3:23 PM, David Philip
<da...@gmail.com>wrote:

> Hi,
>
>   I did crawl through this
> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
> its same issue.
>
> Title extracted is in this format:SONY China
>
> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
>
> It was supposed to be like this :
> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
>
> For specific urls like above it has this special characters problem. For
> rest, characters extracted are proper. ex: this
> url<http://service.sony.com.cn/9380.htm>it is proper parse.
>
>
> Thanks David.
>
>
> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> <da...@gmail.com>wrote:
>
> > I am attaching the extracted text file. not sure if you can receive and
> > view it.
> >
> > My observation:
> > When I compared the extracted text with url<
> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> > (by doing view source). all most everything looks same other than data
> that
> > is in ParseText:: section of the extracted text.
> >
> >
> > Thanks -David
> >
> >
> >
> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> > davidphilipsheron@gmail.com> wrote:
> >
> >> Hi Tejas,
> >>
> >>    I used the redseg command:bin/nutch readseg -dump
> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> >> -nogenerate -noparse -nofetch -noparsedata
> >>
> >> It generated the dump file,then I used less/cat command:
> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
> >test459.txt -
> >> viewed the content as text file(gedit).
> >>
> >>
> >> Below is brief of that text file(test459.txt):
> >>
> >> Recno:: 0
> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>
> >> ParseText::
> >>  SONY China
> >> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> >> 首页   新闻与公告   产�支� 个人电脑�周边产�
> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> 按照产�型��索 关键字     选择产�系列 / 型�
> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> >> 其他产� 选择产��类别 选择产�系列
> >> /..........................
> >> this is little huge.. so didn't paste everything.
> >>
> >>
> >> Content::
> >> Version: -1
> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> contentType: application/xhtml+xml
> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> >> Content-Type=text/html Connection=close
> >> Content:
> >>
> >>
> >> Thanks - David
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
> >>
> >>> I dont think so. The tool that you are using to view this must have
> >>> support
> >>> for the desired languages. I had same problem while looking at the
> pages
> >>> having chinese content over putty. Installing language packs and
> tweaking
> >>> putty settings made this go away. I don't recall exact steps / details
> >>> as I
> >>> did that about a year back.
> >>>
> >>>
> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> >>> <da...@gmail.com>wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >   For some specific urls, the content fetched is in the form of
> special
> >>> > characters, Is it character encoding issue? any settings need to be
> >>> done at
> >>> > nutch parsing level?
> >>> >
> >>> >
> >>> > *url:*
> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>> >
> >>> > *content extracted is something like this: *
> >>> > *
> >>> > *
> >>> >  SONY China
> >>> > Service-关于建议使用正宗索尼电�适�器的声明
> &nbsp
> >>> > 首页   新闻与公告   产�支� 个人电脑�周边产�
> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >>> > 按照产�型��索 关键字     选择产�系列 / 型�
> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >>> 家庭音�产�
> >>> > 其他产å“..................
> >>> >
> >>> > *title: *
> >>> > SONY China
> >>> Service-关于建议使用正宗索尼电�适�器的声明
> >>> >
> >>> >
> >>> > Thanks - David
> >>> >
> >>>
> >>
> >>
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Parsed content in form of special characters

Posted by Rajani Maski <ra...@gmail.com>.
Hi David,

     Try setting the property : *parser.character.encoding.default to utf-8
* in nutch-site.xml and if you have already done this, make sure that you
have added URIEncoding=utf-8 in tomcat before executing bin/nutch solrindex
command to index to solr.

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>

tomcat :
<Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8445" URIEncoding="UTF-8" />


Thanks & Regards
Rajani Maski



On Thu, Mar 14, 2013 at 12:53 PM, David Philip
<da...@gmail.com>wrote:

> Hi,
>
>   I did crawl through this
> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
> its same issue.
>
> Title extracted is in this format:SONY China
>
> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
>
> It was supposed to be like this :
> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
>
> For specific urls like above it has this special characters problem. For
> rest, characters extracted are proper. ex: this
> url<http://service.sony.com.cn/9380.htm>it is proper parse.
>
>
> Thanks David.
>
>
> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> <da...@gmail.com>wrote:
>
> > I am attaching the extracted text file. not sure if you can receive and
> > view it.
> >
> > My observation:
> > When I compared the extracted text with url<
> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> > (by doing view source). all most everything looks same other than data
> that
> > is in ParseText:: section of the extracted text.
> >
> >
> > Thanks -David
> >
> >
> >
> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> > davidphilipsheron@gmail.com> wrote:
> >
> >> Hi Tejas,
> >>
> >>    I used the redseg command:bin/nutch readseg -dump
> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> >> -nogenerate -noparse -nofetch -noparsedata
> >>
> >> It generated the dump file,then I used less/cat command:
> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
> >test459.txt -
> >> viewed the content as text file(gedit).
> >>
> >>
> >> Below is brief of that text file(test459.txt):
> >>
> >> Recno:: 0
> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>
> >> ParseText::
> >>  SONY China
> >> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> >> 首页   新闻与公告   产�支� 个人电脑�周边产�
> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> 按照产�型��索 关键字     选择产�系列 / 型�
> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> >> 其他产� 选择产��类别 选择产�系列
> >> /..........................
> >> this is little huge.. so didn't paste everything.
> >>
> >>
> >> Content::
> >> Version: -1
> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> contentType: application/xhtml+xml
> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> >> Content-Type=text/html Connection=close
> >> Content:
> >>
> >>
> >> Thanks - David
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
> >>
> >>> I dont think so. The tool that you are using to view this must have
> >>> support
> >>> for the desired languages. I had same problem while looking at the
> pages
> >>> having chinese content over putty. Installing language packs and
> tweaking
> >>> putty settings made this go away. I don't recall exact steps / details
> >>> as I
> >>> did that about a year back.
> >>>
> >>>
> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> >>> <da...@gmail.com>wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >   For some specific urls, the content fetched is in the form of
> special
> >>> > characters, Is it character encoding issue? any settings need to be
> >>> done at
> >>> > nutch parsing level?
> >>> >
> >>> >
> >>> > *url:*
> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>> >
> >>> > *content extracted is something like this: *
> >>> > *
> >>> > *
> >>> >  SONY China
> >>> > Service-关于建议使用正宗索尼电�适�器的声明
> &nbsp
> >>> > 首页   新闻与公告   产�支� 个人电脑�周边产�
> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >>> > 按照产�型��索 关键字     选择产�系列 / 型�
> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >>> 家庭音�产�
> >>> > 其他产å“..................
> >>> >
> >>> > *title: *
> >>> > SONY China
> >>> Service-关于建议使用正宗索尼电�适�器的声明
> >>> >
> >>> >
> >>> > Thanks - David
> >>> >
> >>>
> >>
> >>
> >
>

Re: Parsed content in form of special characters

Posted by David Philip <da...@gmail.com>.
Hi,

  I did crawl through this
url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
its same issue.

Title extracted is in this format:SONY China
Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ

It was supposed to be like this :
<title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>

For specific urls like above it has this special characters problem. For
rest, characters extracted are proper. ex: this
url<http://service.sony.com.cn/9380.htm>it is proper parse.


Thanks David.


On Thu, Mar 14, 2013 at 12:17 PM, David Philip
<da...@gmail.com>wrote:

> I am attaching the extracted text file. not sure if you can receive and
> view it.
>
> My observation:
> When I compared the extracted text with url<http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> (by doing view source). all most everything looks same other than data that
> is in ParseText:: section of the extracted text.
>
>
> Thanks -David
>
>
>
> On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> davidphilipsheron@gmail.com> wrote:
>
>> Hi Tejas,
>>
>>    I used the redseg command:bin/nutch readseg -dump
>> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
>> -nogenerate -noparse -nofetch -noparsedata
>>
>> It generated the dump file,then I used less/cat command:
>> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt -
>> viewed the content as text file(gedit).
>>
>>
>> Below is brief of that text file(test459.txt):
>>
>> Recno:: 0
>> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
>>
>> ParseText::
>>  SONY China
>> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
>> 首页   新闻与公告   产�支� 个人电脑�周边产�
>> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
>> �影�产� 家庭影�产� 家庭音�产� 其他产�
>> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> 按照产�型��索 关键字     选择产�系列 / 型�
>> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
>> 其他产� 选择产��类别 选择产�系列
>> /..........................
>> this is little huge.. so didn't paste everything.
>>
>>
>> Content::
>> Version: -1
>> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> contentType: application/xhtml+xml
>> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
>> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
>> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
>> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
>> Content-Type=text/html Connection=close
>> Content:
>>
>>
>> Thanks - David
>>
>>
>>
>>
>>
>>
>> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <te...@gmail.com>wrote:
>>
>>> I dont think so. The tool that you are using to view this must have
>>> support
>>> for the desired languages. I had same problem while looking at the pages
>>> having chinese content over putty. Installing language packs and tweaking
>>> putty settings made this go away. I don't recall exact steps / details
>>> as I
>>> did that about a year back.
>>>
>>>
>>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
>>> <da...@gmail.com>wrote:
>>>
>>> > Hi,
>>> >
>>> >   For some specific urls, the content fetched is in the form of special
>>> > characters, Is it character encoding issue? any settings need to be
>>> done at
>>> > nutch parsing level?
>>> >
>>> >
>>> > *url:*
>>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
>>> >
>>> > *content extracted is something like this: *
>>> > *
>>> > *
>>> >  SONY China
>>> > Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
>>> > 首页   新闻与公告   产�支� 个人电脑�周边产�
>>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
>>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
>>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>>> > 按照产�型��索 关键字     选择产�系列 / 型�
>>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>>> 家庭音�产�
>>> > 其他产å“..................
>>> >
>>> > *title: *
>>> > SONY China
>>> Service-关于建议使用正宗索尼电�适�器的声明
>>> >
>>> >
>>> > Thanks - David
>>> >
>>>
>>
>>
>

Re: Parsed content in form of special characters

Posted by David Philip <da...@gmail.com>.
I am attaching the extracted text file. not sure if you can receive and
view it.

My observation:
When I compared the extracted text with
url<http://service.sony.com.cn/vaio/Announcments/33412.htm> page
(by doing view source). all most everything looks same other than data that
is in ParseText:: section of the extracted text.


Thanks -David



On Thu, Mar 14, 2013 at 11:59 AM, David Philip
<da...@gmail.com>wrote:

> Hi Tejas,
>
>    I used the redseg command:bin/nutch readseg -dump
> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> -nogenerate -noparse -nofetch -noparsedata
>
> It generated the dump file,then I used less/cat command:
> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt -
> viewed the content as text file(gedit).
>
>
> Below is brief of that text file(test459.txt):
>
> Recno:: 0
> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
>
> ParseText::
>  SONY China
> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> 首页   新闻与公告   产�支� 个人电脑�周边产�
> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
> �影�产� 家庭影�产� 家庭音�产� 其他产�
> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> 按照产�型��索 关键字     选择产�系列 / 型�
> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> 其他产� 选择产��类别 选择产�系列
> /..........................
> this is little huge.. so didn't paste everything.
>
>
> Content::
> Version: -1
> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> contentType: application/xhtml+xml
> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> Content-Type=text/html Connection=close
> Content:
>
>
> Thanks - David
>
>
>
>
>
>
> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <te...@gmail.com>wrote:
>
>> I dont think so. The tool that you are using to view this must have
>> support
>> for the desired languages. I had same problem while looking at the pages
>> having chinese content over putty. Installing language packs and tweaking
>> putty settings made this go away. I don't recall exact steps / details as
>> I
>> did that about a year back.
>>
>>
>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
>> <da...@gmail.com>wrote:
>>
>> > Hi,
>> >
>> >   For some specific urls, the content fetched is in the form of special
>> > characters, Is it character encoding issue? any settings need to be
>> done at
>> > nutch parsing level?
>> >
>> >
>> > *url:*
>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >
>> > *content extracted is something like this: *
>> > *
>> > *
>> >  SONY China
>> > Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
>> > 首页   新闻与公告   产�支� 个人电脑�周边产�
>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> > 按照产�型��索 关键字     选择产�系列 / 型�
>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>> 家庭音�产�
>> > 其他产å“..................
>> >
>> > *title: *
>> > SONY China
>> Service-关于建议使用正宗索尼电�适�器的声明
>> >
>> >
>> > Thanks - David
>> >
>>
>
>

Re: Parsed content in form of special characters

Posted by David Philip <da...@gmail.com>.
Hi Tejas,

   I used the redseg command:bin/nutch readseg -dump
test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
-nogenerate -noparse -nofetch -noparsedata

It generated the dump file,then I used less/cat command:
/Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt -
viewed the content as text file(gedit).


Below is brief of that text file(test459.txt):

Recno:: 0
URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm

ParseText::
 SONY China
Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
首页   新闻与公告   产�支� 个人电脑�周边产�
VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
�影�产� 家庭影�产� 家庭音�产� 其他产�
æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
按照产�型��索 关键字     选择产�系列 / 型�
选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
其他产� 选择产��类别 选择产�系列
/..........................
this is little huge.. so didn't paste everything.


Content::
Version: -1
url: http://service.sony.com.cn/vaio/Announcments/33412.htm
base: http://service.sony.com.cn/vaio/Announcments/33412.htm
contentType: application/xhtml+xml
metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
Content-Type=text/html Connection=close
Content:


Thanks - David






On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <te...@gmail.com>wrote:

> I dont think so. The tool that you are using to view this must have support
> for the desired languages. I had same problem while looking at the pages
> having chinese content over putty. Installing language packs and tweaking
> putty settings made this go away. I don't recall exact steps / details as I
> did that about a year back.
>
>
> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> <da...@gmail.com>wrote:
>
> > Hi,
> >
> >   For some specific urls, the content fetched is in the form of special
> > characters, Is it character encoding issue? any settings need to be done
> at
> > nutch parsing level?
> >
> >
> > *url:*
> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >
> > *content extracted is something like this: *
> > *
> > *
> >  SONY China
> > Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> > 首页   新闻与公告   产�支� 个人电脑�周边产�
> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> > 按照产�型��索 关键字     选择产�系列 / 型�
> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> > 其他产å“..................
> >
> > *title: *
> > SONY China Service-关于建议使用正宗索尼电�适�器的声明
> >
> >
> > Thanks - David
> >
>

Re: Parsed content in form of special characters

Posted by Tejas Patil <te...@gmail.com>.
I dont think so. The tool that you are using to view this must have support
for the desired languages. I had same problem while looking at the pages
having chinese content over putty. Installing language packs and tweaking
putty settings made this go away. I don't recall exact steps / details as I
did that about a year back.


On Wed, Mar 13, 2013 at 9:58 PM, David Philip
<da...@gmail.com>wrote:

> Hi,
>
>   For some specific urls, the content fetched is in the form of special
> characters, Is it character encoding issue? any settings need to be done at
> nutch parsing level?
>
>
> *url:*
> http://service.sony.com.cn/vaio/Announcments/33412.htm
>
> *content extracted is something like this: *
> *
> *
>  SONY China
> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> 首页   新闻与公告   产�支� 个人电脑�周边产�
> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> 按照产�型��索 关键字     选择产�系列 / 型�
> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> 其他产å“..................
>
> *title: *
> SONY China Service-关于建议使用正宗索尼电�适�器的声明
>
>
> Thanks - David
>