You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John_C_3 <jo...@verizonwireless.com> on 2009/10/20 22:01:05 UTC

Nutch crawler charset issues utf-16

I'm attempting to crawl pages with charset utf-16 and send the index to solr
where it can be searched.  I followed the instructions 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ here  and
successfully crawled and searched test content with utf-8. However, when I
attempt to crawl the utf-16 content it gets sent to solr as japanese
characters. The pages encoded as utf-16 contain only english text, no
special characters. Is there anyway to force nutch to crawl the page as
utf-8 and ignore the utf-16 setting?

Thanks.
-- 
View this message in context: http://www.nabble.com/Nutch-crawler-charset-issues-utf-16-tp25981513p25981513.html
Sent from the Nutch - User mailing list archive at Nabble.com.