You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by karl wettin <ka...@snigel.net> on 2005/12/12 16:40:40 UTC
Character encoding per index.
Hello list,
I'm looking for a way to change character encoding per index. It
feels silly to store chinese characters in 3 bytes using UTF-8 when
it is possible to do it with 2 bytes using UTF-16. By just hacking
the IndexInput and IndexOutput I quick and dirty got it all running
in UTF-16, but this is not good enough since I have other indexes
that is more optimized when encoded in UTF-8.
The character encoding of Lucene today is quite static. In order to
select encoding it seems to me I have to do some major refactoring to
the project, passing a character codec from my analyzer (or perhaps
IndexWriter/Reader) all the way down to the IndexInput/Output via
TermVector/Info, et.c.
Can someone think of a better way to set character encoding per
index? Or perhaps some other thought?
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Character encoding per index.
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Dec 12, 2005, at 10:04 AM, karl wettin wrote:
> 12 dec 2005 kl. 16.40 skrev karl wettin:
>
>> Hello list,
>>
>> I'm looking for a way to change character encoding per index. It
>> feels silly to store chinese characters in 3 bytes using UTF-8
>> when it is possible to do it with 2 bytes using UTF-16. By just
>> hacking the IndexInput and IndexOutput I quick and dirty got it
>> all running in UTF-16, but this is not good enough since I have
>> other indexes that is more optimized when encoded in UTF-8.
>>
>> The character encoding of Lucene today is quite static. In order
>> to select encoding it seems to me I have to do some major
>> refactoring to the project, passing a character codec from my
>> analyzer (or perhaps IndexWriter/Reader) all the way down to the
>> IndexInput/Output via TermVector/Info, et.c.
On a side note, this is another issue that I believe can be addressed
by using a bytecount instead of a charcount at the head of Lucene's
Strings.
A byte-based TermBuffer needn't care what encoding the Strings are in.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Character encoding per index.
Posted by karl wettin <ka...@snigel.net>.
12 dec 2005 kl. 16.40 skrev karl wettin:
> Hello list,
>
> I'm looking for a way to change character encoding per index. It
> feels silly to store chinese characters in 3 bytes using UTF-8 when
> it is possible to do it with 2 bytes using UTF-16. By just hacking
> the IndexInput and IndexOutput I quick and dirty got it all running
> in UTF-16, but this is not good enough since I have other indexes
> that is more optimized when encoded in UTF-8.
>
> The character encoding of Lucene today is quite static. In order to
> select encoding it seems to me I have to do some major refactoring
> to the project, passing a character codec from my analyzer (or
> perhaps IndexWriter/Reader) all the way down to the IndexInput/
> Output via TermVector/Info, et.c.
>
> Can someone think of a better way to set character encoding per
> index? Or perhaps some other thought?
My current thought is to extend Directory
(CharacterEncodingAwareDirectory or so) and all implementations of it
to intercept the create/openFile methods and add a character encoding
strategy to the IndexInput/Output.
Is there a reason for the write/readCharacters in IndexInput/Output
to be final?
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org