You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by KK <di...@gmail.com> on 2009/05/23 09:37:12 UTC

How to index pages containing NCR(dec) unicode encodings?

Hi
I'm trying to index some unicodes pages in utf-8. For all the pages which
are encoded in unicode utf-8 its fine. but for some pages when I'm crawling
the pages what I've is unicode NCR(dec) which are getting indexing as such .

What  I mean is say I'm viewing some page abc.com/hello which has non-eng
content. Now I opened the source code of that page and what I find is that
the source itself contains those characters i.e
  &#3129;&#3142;&#3122;&#3149;&#3122;&#3146;

but when this gets displayed through the browser it is shown in proper
format[in this case its Telugu language]. So what I download as raw text is
just the aboce NCR(dec) codes and thats what getting  posted to lucene. For
all the languages I'm getting the content in unicode utf-8 format which is
not able to handle this particular language.
Are these called as HTML Entity?
Now it seems before passing these content to lucene I've to get the utf-8
encoding for them. Is this the way to fix this? or there are other and
better ways for doing the same.
I need proper guidance from someone who has faced similar problems earlier.
All are welcome to give their views/ideas on the same.

Thanks,
KK