You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Valentin Kulichenko <va...@gmail.com> on 2017/06/30 23:27:26 UTC

Custom string encoding

Folks,

Currently binary marshaller always encodes strings in UTF-8. However,
sometimes it can be useful to customize this. For example, if data contains
a lot of Cyrillic, Chinese or other symbols, but not so many Latin symbols,
memory is used very inefficiently. In this case it would be great to encode
most frequently used symbols in one byte instead of two or three.

I propose to introduce BinaryStringEncoder interface that will convert
strings to byte arrays and back, and make it pluggable via
BinaryConfiguration. This will allow users to plug in any encoding
algorithms based on their requirements.

Thoughts?

https://issues.apache.org/jira/browse/IGNITE-5655

-Val

Re: Custom string encoding

Posted by Valentin Kulichenko <va...@gmail.com>.

Yes, this needs to be tested and confirmed. I will work on it.

Would be great to get more details about indexes. I'm not sure I understand
the limitation there.

-Val

On Mon, Jul 3, 2017 at 7:21 AM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> Agree with Valya on the system-wide default. We need to have it.
>
> Also, are we certain that the encoding will provide 1-byte length for UTF-8
> for different languages? Would be nice to test it to confirm, as it has a
> potential to decrease the Ignite storage space by 2x in certain cases.
>
> D.
>
> On Sun, Jul 2, 2017 at 12:26 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
>
> > Vova,
> >
> > That's actually a good point. Probably that would be enough and there is
> no
> > need to introduce absract encoder. However, I still think it makes sense
> to
> > specify default encoding in BinaryConfiguration and
> > BinaryTypeConfiguration.
> >
> > -Val
> >
> > On Sun, Jul 2, 2017 at 10:31 AM Vladimir Ozerov <vo...@gridgain.com>
> > wrote:
> >
> > > Yes, this is exactly what non-UTF8 encodings do.
> > >
> > > вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <ds...@apache.org>:
> > >
> > > > On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <
> vozerov@gridgain.com>
> > > > wrote:
> > > >
> > > > > There is no need for custom encoders, as they are already built-in
> to
> > > > Java.
> > > > >
> > > >
> > > > Will non-ASCII encodings fit into 1 byte? The whole point here is to
> > save
> > > > space.
> > > >
> > > >
> > > > >
> > > > > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <
> dsetrakyan@apache.org
> > >:
> > > > >
> > > > > > Vladimir, how would you plugin custom encoders in your design?
> > > > > >
> > > > > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <
> > > vozerov@gridgain.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Valya,
> > > > > > >
> > > > > > > Personally I vote against this feature. BinaryConfiguration is
> > > proven
> > > > > to
> > > > > > be
> > > > > > > inconvenient, since it has to be configured before node start,
> it
> > > > > cannot
> > > > > > be
> > > > > > > changed in runtime, and it requires classes on the server.
> > > Moreover,
> > > > if
> > > > > > you
> > > > > > > decide to change encoding at some point, it would be
> impossible.
> > > > > > >
> > > > > > > I think, we should add this feature on API level instead. If
> > string
> > > > is
> > > > > > > written in non-UTF8 form, we will write in different format:
> > > > > > > [encoding_code][string]
> > > > > > >
> > > > > > > BInaryWriter.writeString(String fieldName, String val);
> > > > > > > BInaryWriter.writeString(String fieldName, String val, *String
> > > > > > encoding*);
> > > > > > >
> > > > > > > BinaryReader.readString(String fieldName);
> > > > > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > > > > >
> > > > > > > BinaryObjectBuilder.writeString(String fieldName, String val,
> > > *String
> > > > > > > encoding*);
> > > > > > >
> > > > > > > class MyClass {
> > > > > > >     *@BinaryString(encoding = "Cp1251")*
> > > > > > >     private String myCyrillicString;
> > > > > > > }
> > > > > > >
> > > > > > > Vladimir.
> > > > > > >
> > > > > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > > > > dsetrakyan@apache.org
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > > > > sergi.vladykin@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > In SQL indexes we may store partial strings and assume them
> > to
> > > be
> > > > > in
> > > > > > > > UTF-8,
> > > > > > > > > I don't think this can be abstracted away. But may be this
> is
> > > > not a
> > > > > > big
> > > > > > > > > deal if in indexes we still will use UTF-8.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Sergi, why does it matter if it is UTF8 or custom encoding?
> Why
> > > > can't
> > > > > > we
> > > > > > > > use our own compact encoding in indexes?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> > > > > dsetrakyan@apache.org
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Val, do you know how we compare strings in SQL queries?
> > Will
> > > we
> > > > > be
> > > > > > > able
> > > > > > > > > to
> > > > > > > > > > use this encoder?
> > > > > > > > > >
> > > > > > > > > > Additionally, I think that the encoder is a bit too
> > abstract.
> > > > Why
> > > > > > not
> > > > > > > > go
> > > > > > > > > > even further and allow users create their own ASCII table
> > for
> > > > > > > encoding?
> > > > > > > > > >
> > > > > > > > > > D.
> > > > > > > > > >
> > > > > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > > > > > valentin.kulichenko@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Andrey,
> > > > > > > > > > >
> > > > > > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > > > > > >
> > > > > > > > > > > -Val
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Val,
> > > > > > > > > > > >
> > > > > > > > > > > > Looks like make sense.
> > > > > > > > > > > >
> > > > > > > > > > > > This will not affect FullText index, as Lucene has
> own
> > > > format
> > > > > > for
> > > > > > > > > > storing
> > > > > > > > > > > > data.
> > > > > > > > > > > >
> > > > > > > > > > > > But.. would it be compatible with H2 indexing ? I
> > doubt.
> > > > > > > > > > > >
> > > > > > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin
> Kulichenko"
> > <
> > > > > > > > > > > > valentin.kulichenko@gmail.com> написал:
> > > > > > > > > > > >
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Currently binary marshaller always encodes strings
> in
> > > > > UTF-8.
> > > > > > > > > However,
> > > > > > > > > > > > > sometimes it can be useful to customize this. For
> > > > example,
> > > > > if
> > > > > > > > data
> > > > > > > > > > > > contains
> > > > > > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but
> not
> > so
> > > > > many
> > > > > > > > Latin
> > > > > > > > > > > > symbols,
> > > > > > > > > > > > > memory is used very inefficiently. In this case it
> > > would
> > > > be
> > > > > > > great
> > > > > > > > > to
> > > > > > > > > > > > encode
> > > > > > > > > > > > > most frequently used symbols in one byte instead of
> > two
> > > > or
> > > > > > > three.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I propose to introduce BinaryStringEncoder
> interface
> > > that
> > > > > > will
> > > > > > > > > > convert
> > > > > > > > > > > > > strings to byte arrays and back, and make it
> > pluggable
> > > > via
> > > > > > > > > > > > > BinaryConfiguration. This will allow users to plug
> in
> > > any
> > > > > > > > encoding
> > > > > > > > > > > > > algorithms based on their requirements.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Val
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Agree with Valya on the system-wide default. We need to have it.

Also, are we certain that the encoding will provide 1-byte length for UTF-8
for different languages? Would be nice to test it to confirm, as it has a
potential to decrease the Ignite storage space by 2x in certain cases.

D.

On Sun, Jul 2, 2017 at 12:26 PM, Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> Vova,
>
> That's actually a good point. Probably that would be enough and there is no
> need to introduce absract encoder. However, I still think it makes sense to
> specify default encoding in BinaryConfiguration and
> BinaryTypeConfiguration.
>
> -Val
>
> On Sun, Jul 2, 2017 at 10:31 AM Vladimir Ozerov <vo...@gridgain.com>
> wrote:
>
> > Yes, this is exactly what non-UTF8 encodings do.
> >
> > вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <vo...@gridgain.com>
> > > wrote:
> > >
> > > > There is no need for custom encoders, as they are already built-in to
> > > Java.
> > > >
> > >
> > > Will non-ASCII encodings fit into 1 byte? The whole point here is to
> save
> > > space.
> > >
> > >
> > > >
> > > > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <dsetrakyan@apache.org
> >:
> > > >
> > > > > Vladimir, how would you plugin custom encoders in your design?
> > > > >
> > > > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <
> > vozerov@gridgain.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Valya,
> > > > > >
> > > > > > Personally I vote against this feature. BinaryConfiguration is
> > proven
> > > > to
> > > > > be
> > > > > > inconvenient, since it has to be configured before node start, it
> > > > cannot
> > > > > be
> > > > > > changed in runtime, and it requires classes on the server.
> > Moreover,
> > > if
> > > > > you
> > > > > > decide to change encoding at some point, it would be impossible.
> > > > > >
> > > > > > I think, we should add this feature on API level instead. If
> string
> > > is
> > > > > > written in non-UTF8 form, we will write in different format:
> > > > > > [encoding_code][string]
> > > > > >
> > > > > > BInaryWriter.writeString(String fieldName, String val);
> > > > > > BInaryWriter.writeString(String fieldName, String val, *String
> > > > > encoding*);
> > > > > >
> > > > > > BinaryReader.readString(String fieldName);
> > > > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > > > >
> > > > > > BinaryObjectBuilder.writeString(String fieldName, String val,
> > *String
> > > > > > encoding*);
> > > > > >
> > > > > > class MyClass {
> > > > > >     *@BinaryString(encoding = "Cp1251")*
> > > > > >     private String myCyrillicString;
> > > > > > }
> > > > > >
> > > > > > Vladimir.
> > > > > >
> > > > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > > > dsetrakyan@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > > > sergi.vladykin@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > In SQL indexes we may store partial strings and assume them
> to
> > be
> > > > in
> > > > > > > UTF-8,
> > > > > > > > I don't think this can be abstracted away. But may be this is
> > > not a
> > > > > big
> > > > > > > > deal if in indexes we still will use UTF-8.
> > > > > > > >
> > > > > > >
> > > > > > > Sergi, why does it matter if it is UTF8 or custom encoding? Why
> > > can't
> > > > > we
> > > > > > > use our own compact encoding in indexes?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> > > > dsetrakyan@apache.org
> > > > > >:
> > > > > > > >
> > > > > > > > > Val, do you know how we compare strings in SQL queries?
> Will
> > we
> > > > be
> > > > > > able
> > > > > > > > to
> > > > > > > > > use this encoder?
> > > > > > > > >
> > > > > > > > > Additionally, I think that the encoder is a bit too
> abstract.
> > > Why
> > > > > not
> > > > > > > go
> > > > > > > > > even further and allow users create their own ASCII table
> for
> > > > > > encoding?
> > > > > > > > >
> > > > > > > > > D.
> > > > > > > > >
> > > > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > > > > valentin.kulichenko@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Andrey,
> > > > > > > > > >
> > > > > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > > > > >
> > > > > > > > > > -Val
> > > > > > > > > >
> > > > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Val,
> > > > > > > > > > >
> > > > > > > > > > > Looks like make sense.
> > > > > > > > > > >
> > > > > > > > > > > This will not affect FullText index, as Lucene has own
> > > format
> > > > > for
> > > > > > > > > storing
> > > > > > > > > > > data.
> > > > > > > > > > >
> > > > > > > > > > > But.. would it be compatible with H2 indexing ? I
> doubt.
> > > > > > > > > > >
> > > > > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko"
> <
> > > > > > > > > > > valentin.kulichenko@gmail.com> написал:
> > > > > > > > > > >
> > > > > > > > > > > > Folks,
> > > > > > > > > > > >
> > > > > > > > > > > > Currently binary marshaller always encodes strings in
> > > > UTF-8.
> > > > > > > > However,
> > > > > > > > > > > > sometimes it can be useful to customize this. For
> > > example,
> > > > if
> > > > > > > data
> > > > > > > > > > > contains
> > > > > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not
> so
> > > > many
> > > > > > > Latin
> > > > > > > > > > > symbols,
> > > > > > > > > > > > memory is used very inefficiently. In this case it
> > would
> > > be
> > > > > > great
> > > > > > > > to
> > > > > > > > > > > encode
> > > > > > > > > > > > most frequently used symbols in one byte instead of
> two
> > > or
> > > > > > three.
> > > > > > > > > > > >
> > > > > > > > > > > > I propose to introduce BinaryStringEncoder interface
> > that
> > > > > will
> > > > > > > > > convert
> > > > > > > > > > > > strings to byte arrays and back, and make it
> pluggable
> > > via
> > > > > > > > > > > > BinaryConfiguration. This will allow users to plug in
> > any
> > > > > > > encoding
> > > > > > > > > > > > algorithms based on their requirements.
> > > > > > > > > > > >
> > > > > > > > > > > > Thoughts?
> > > > > > > > > > > >
> > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > > > >
> > > > > > > > > > > > -Val
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Valentin Kulichenko <va...@gmail.com>.

Vova,

That's actually a good point. Probably that would be enough and there is no
need to introduce absract encoder. However, I still think it makes sense to
specify default encoding in BinaryConfiguration and BinaryTypeConfiguration.

-Val

On Sun, Jul 2, 2017 at 10:31 AM Vladimir Ozerov <vo...@gridgain.com>
wrote:

> Yes, this is exactly what non-UTF8 encodings do.
>
> вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <ds...@apache.org>:
>
> > On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <vo...@gridgain.com>
> > wrote:
> >
> > > There is no need for custom encoders, as they are already built-in to
> > Java.
> > >
> >
> > Will non-ASCII encodings fit into 1 byte? The whole point here is to save
> > space.
> >
> >
> > >
> > > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <ds...@apache.org>:
> > >
> > > > Vladimir, how would you plugin custom encoders in your design?
> > > >
> > > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <
> vozerov@gridgain.com
> > >
> > > > wrote:
> > > >
> > > > > Valya,
> > > > >
> > > > > Personally I vote against this feature. BinaryConfiguration is
> proven
> > > to
> > > > be
> > > > > inconvenient, since it has to be configured before node start, it
> > > cannot
> > > > be
> > > > > changed in runtime, and it requires classes on the server.
> Moreover,
> > if
> > > > you
> > > > > decide to change encoding at some point, it would be impossible.
> > > > >
> > > > > I think, we should add this feature on API level instead. If string
> > is
> > > > > written in non-UTF8 form, we will write in different format:
> > > > > [encoding_code][string]
> > > > >
> > > > > BInaryWriter.writeString(String fieldName, String val);
> > > > > BInaryWriter.writeString(String fieldName, String val, *String
> > > > encoding*);
> > > > >
> > > > > BinaryReader.readString(String fieldName);
> > > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > > >
> > > > > BinaryObjectBuilder.writeString(String fieldName, String val,
> *String
> > > > > encoding*);
> > > > >
> > > > > class MyClass {
> > > > >     *@BinaryString(encoding = "Cp1251")*
> > > > >     private String myCyrillicString;
> > > > > }
> > > > >
> > > > > Vladimir.
> > > > >
> > > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > > dsetrakyan@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > > sergi.vladykin@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > In SQL indexes we may store partial strings and assume them to
> be
> > > in
> > > > > > UTF-8,
> > > > > > > I don't think this can be abstracted away. But may be this is
> > not a
> > > > big
> > > > > > > deal if in indexes we still will use UTF-8.
> > > > > > >
> > > > > >
> > > > > > Sergi, why does it matter if it is UTF8 or custom encoding? Why
> > can't
> > > > we
> > > > > > use our own compact encoding in indexes?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> > > dsetrakyan@apache.org
> > > > >:
> > > > > > >
> > > > > > > > Val, do you know how we compare strings in SQL queries? Will
> we
> > > be
> > > > > able
> > > > > > > to
> > > > > > > > use this encoder?
> > > > > > > >
> > > > > > > > Additionally, I think that the encoder is a bit too abstract.
> > Why
> > > > not
> > > > > > go
> > > > > > > > even further and allow users create their own ASCII table for
> > > > > encoding?
> > > > > > > >
> > > > > > > > D.
> > > > > > > >
> > > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > > > valentin.kulichenko@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Andrey,
> > > > > > > > >
> > > > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > > > >
> > > > > > > > > -Val
> > > > > > > > >
> > > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Val,
> > > > > > > > > >
> > > > > > > > > > Looks like make sense.
> > > > > > > > > >
> > > > > > > > > > This will not affect FullText index, as Lucene has own
> > format
> > > > for
> > > > > > > > storing
> > > > > > > > > > data.
> > > > > > > > > >
> > > > > > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > > > > > >
> > > > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > > > > > valentin.kulichenko@gmail.com> написал:
> > > > > > > > > >
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > Currently binary marshaller always encodes strings in
> > > UTF-8.
> > > > > > > However,
> > > > > > > > > > > sometimes it can be useful to customize this. For
> > example,
> > > if
> > > > > > data
> > > > > > > > > > contains
> > > > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so
> > > many
> > > > > > Latin
> > > > > > > > > > symbols,
> > > > > > > > > > > memory is used very inefficiently. In this case it
> would
> > be
> > > > > great
> > > > > > > to
> > > > > > > > > > encode
> > > > > > > > > > > most frequently used symbols in one byte instead of two
> > or
> > > > > three.
> > > > > > > > > > >
> > > > > > > > > > > I propose to introduce BinaryStringEncoder interface
> that
> > > > will
> > > > > > > > convert
> > > > > > > > > > > strings to byte arrays and back, and make it pluggable
> > via
> > > > > > > > > > > BinaryConfiguration. This will allow users to plug in
> any
> > > > > > encoding
> > > > > > > > > > > algorithms based on their requirements.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts?
> > > > > > > > > > >
> > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > > >
> > > > > > > > > > > -Val
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Vladimir Ozerov <vo...@gridgain.com>.

Yes, this is exactly what non-UTF8 encodings do.

вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <ds...@apache.org>:

> On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <vo...@gridgain.com>
> wrote:
>
> > There is no need for custom encoders, as they are already built-in to
> Java.
> >
>
> Will non-ASCII encodings fit into 1 byte? The whole point here is to save
> space.
>
>
> >
> > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > Vladimir, how would you plugin custom encoders in your design?
> > >
> > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <vozerov@gridgain.com
> >
> > > wrote:
> > >
> > > > Valya,
> > > >
> > > > Personally I vote against this feature. BinaryConfiguration is proven
> > to
> > > be
> > > > inconvenient, since it has to be configured before node start, it
> > cannot
> > > be
> > > > changed in runtime, and it requires classes on the server. Moreover,
> if
> > > you
> > > > decide to change encoding at some point, it would be impossible.
> > > >
> > > > I think, we should add this feature on API level instead. If string
> is
> > > > written in non-UTF8 form, we will write in different format:
> > > > [encoding_code][string]
> > > >
> > > > BInaryWriter.writeString(String fieldName, String val);
> > > > BInaryWriter.writeString(String fieldName, String val, *String
> > > encoding*);
> > > >
> > > > BinaryReader.readString(String fieldName);
> > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > >
> > > > BinaryObjectBuilder.writeString(String fieldName, String val, *String
> > > > encoding*);
> > > >
> > > > class MyClass {
> > > >     *@BinaryString(encoding = "Cp1251")*
> > > >     private String myCyrillicString;
> > > > }
> > > >
> > > > Vladimir.
> > > >
> > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > dsetrakyan@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > sergi.vladykin@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > In SQL indexes we may store partial strings and assume them to be
> > in
> > > > > UTF-8,
> > > > > > I don't think this can be abstracted away. But may be this is
> not a
> > > big
> > > > > > deal if in indexes we still will use UTF-8.
> > > > > >
> > > > >
> > > > > Sergi, why does it matter if it is UTF8 or custom encoding? Why
> can't
> > > we
> > > > > use our own compact encoding in indexes?
> > > > >
> > > > >
> > > > > >
> > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> > dsetrakyan@apache.org
> > > >:
> > > > > >
> > > > > > > Val, do you know how we compare strings in SQL queries? Will we
> > be
> > > > able
> > > > > > to
> > > > > > > use this encoder?
> > > > > > >
> > > > > > > Additionally, I think that the encoder is a bit too abstract.
> Why
> > > not
> > > > > go
> > > > > > > even further and allow users create their own ASCII table for
> > > > encoding?
> > > > > > >
> > > > > > > D.
> > > > > > >
> > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > > valentin.kulichenko@gmail.com> wrote:
> > > > > > >
> > > > > > > > Andrey,
> > > > > > > >
> > > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Val,
> > > > > > > > >
> > > > > > > > > Looks like make sense.
> > > > > > > > >
> > > > > > > > > This will not affect FullText index, as Lucene has own
> format
> > > for
> > > > > > > storing
> > > > > > > > > data.
> > > > > > > > >
> > > > > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > > > > >
> > > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > > > > valentin.kulichenko@gmail.com> написал:
> > > > > > > > >
> > > > > > > > > > Folks,
> > > > > > > > > >
> > > > > > > > > > Currently binary marshaller always encodes strings in
> > UTF-8.
> > > > > > However,
> > > > > > > > > > sometimes it can be useful to customize this. For
> example,
> > if
> > > > > data
> > > > > > > > > contains
> > > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so
> > many
> > > > > Latin
> > > > > > > > > symbols,
> > > > > > > > > > memory is used very inefficiently. In this case it would
> be
> > > > great
> > > > > > to
> > > > > > > > > encode
> > > > > > > > > > most frequently used symbols in one byte instead of two
> or
> > > > three.
> > > > > > > > > >
> > > > > > > > > > I propose to introduce BinaryStringEncoder interface that
> > > will
> > > > > > > convert
> > > > > > > > > > strings to byte arrays and back, and make it pluggable
> via
> > > > > > > > > > BinaryConfiguration. This will allow users to plug in any
> > > > > encoding
> > > > > > > > > > algorithms based on their requirements.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > > >
> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > >
> > > > > > > > > > -Val
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <vo...@gridgain.com>
wrote:

> There is no need for custom encoders, as they are already built-in to Java.
>

Will non-ASCII encodings fit into 1 byte? The whole point here is to save
space.


>
> вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <ds...@apache.org>:
>
> > Vladimir, how would you plugin custom encoders in your design?
> >
> > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <vo...@gridgain.com>
> > wrote:
> >
> > > Valya,
> > >
> > > Personally I vote against this feature. BinaryConfiguration is proven
> to
> > be
> > > inconvenient, since it has to be configured before node start, it
> cannot
> > be
> > > changed in runtime, and it requires classes on the server. Moreover, if
> > you
> > > decide to change encoding at some point, it would be impossible.
> > >
> > > I think, we should add this feature on API level instead. If string is
> > > written in non-UTF8 form, we will write in different format:
> > > [encoding_code][string]
> > >
> > > BInaryWriter.writeString(String fieldName, String val);
> > > BInaryWriter.writeString(String fieldName, String val, *String
> > encoding*);
> > >
> > > BinaryReader.readString(String fieldName);
> > > BinaryReader.readString(String fieldName, *String encoding*);
> > >
> > > BinaryObjectBuilder.writeString(String fieldName, String val, *String
> > > encoding*);
> > >
> > > class MyClass {
> > >     *@BinaryString(encoding = "Cp1251")*
> > >     private String myCyrillicString;
> > > }
> > >
> > > Vladimir.
> > >
> > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> dsetrakyan@apache.org
> > >
> > > wrote:
> > >
> > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > sergi.vladykin@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > In SQL indexes we may store partial strings and assume them to be
> in
> > > > UTF-8,
> > > > > I don't think this can be abstracted away. But may be this is not a
> > big
> > > > > deal if in indexes we still will use UTF-8.
> > > > >
> > > >
> > > > Sergi, why does it matter if it is UTF8 or custom encoding? Why can't
> > we
> > > > use our own compact encoding in indexes?
> > > >
> > > >
> > > > >
> > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> dsetrakyan@apache.org
> > >:
> > > > >
> > > > > > Val, do you know how we compare strings in SQL queries? Will we
> be
> > > able
> > > > > to
> > > > > > use this encoder?
> > > > > >
> > > > > > Additionally, I think that the encoder is a bit too abstract. Why
> > not
> > > > go
> > > > > > even further and allow users create their own ASCII table for
> > > encoding?
> > > > > >
> > > > > > D.
> > > > > >
> > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > valentin.kulichenko@gmail.com> wrote:
> > > > > >
> > > > > > > Andrey,
> > > > > > >
> > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > >
> > > > > > > -Val
> > > > > > >
> > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Val,
> > > > > > > >
> > > > > > > > Looks like make sense.
> > > > > > > >
> > > > > > > > This will not affect FullText index, as Lucene has own format
> > for
> > > > > > storing
> > > > > > > > data.
> > > > > > > >
> > > > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > > > >
> > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > > > valentin.kulichenko@gmail.com> написал:
> > > > > > > >
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > Currently binary marshaller always encodes strings in
> UTF-8.
> > > > > However,
> > > > > > > > > sometimes it can be useful to customize this. For example,
> if
> > > > data
> > > > > > > > contains
> > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so
> many
> > > > Latin
> > > > > > > > symbols,
> > > > > > > > > memory is used very inefficiently. In this case it would be
> > > great
> > > > > to
> > > > > > > > encode
> > > > > > > > > most frequently used symbols in one byte instead of two or
> > > three.
> > > > > > > > >
> > > > > > > > > I propose to introduce BinaryStringEncoder interface that
> > will
> > > > > > convert
> > > > > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > > > > BinaryConfiguration. This will allow users to plug in any
> > > > encoding
> > > > > > > > > algorithms based on their requirements.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > >
> > > > > > > > > -Val
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Vladimir Ozerov <vo...@gridgain.com>.

There is no need for custom encoders, as they are already built-in to Java.

вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <ds...@apache.org>:

> Vladimir, how would you plugin custom encoders in your design?
>
> On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <vo...@gridgain.com>
> wrote:
>
> > Valya,
> >
> > Personally I vote against this feature. BinaryConfiguration is proven to
> be
> > inconvenient, since it has to be configured before node start, it cannot
> be
> > changed in runtime, and it requires classes on the server. Moreover, if
> you
> > decide to change encoding at some point, it would be impossible.
> >
> > I think, we should add this feature on API level instead. If string is
> > written in non-UTF8 form, we will write in different format:
> > [encoding_code][string]
> >
> > BInaryWriter.writeString(String fieldName, String val);
> > BInaryWriter.writeString(String fieldName, String val, *String
> encoding*);
> >
> > BinaryReader.readString(String fieldName);
> > BinaryReader.readString(String fieldName, *String encoding*);
> >
> > BinaryObjectBuilder.writeString(String fieldName, String val, *String
> > encoding*);
> >
> > class MyClass {
> >     *@BinaryString(encoding = "Cp1251")*
> >     private String myCyrillicString;
> > }
> >
> > Vladimir.
> >
> > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <dsetrakyan@apache.org
> >
> > wrote:
> >
> > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> sergi.vladykin@gmail.com
> > >
> > > wrote:
> > >
> > > > In SQL indexes we may store partial strings and assume them to be in
> > > UTF-8,
> > > > I don't think this can be abstracted away. But may be this is not a
> big
> > > > deal if in indexes we still will use UTF-8.
> > > >
> > >
> > > Sergi, why does it matter if it is UTF8 or custom encoding? Why can't
> we
> > > use our own compact encoding in indexes?
> > >
> > >
> > > >
> > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <dsetrakyan@apache.org
> >:
> > > >
> > > > > Val, do you know how we compare strings in SQL queries? Will we be
> > able
> > > > to
> > > > > use this encoder?
> > > > >
> > > > > Additionally, I think that the encoder is a bit too abstract. Why
> not
> > > go
> > > > > even further and allow users create their own ASCII table for
> > encoding?
> > > > >
> > > > > D.
> > > > >
> > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > valentin.kulichenko@gmail.com> wrote:
> > > > >
> > > > > > Andrey,
> > > > > >
> > > > > > Can you elaborate more on this? What is your concern?
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > andrey.mashenkov@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Val,
> > > > > > >
> > > > > > > Looks like make sense.
> > > > > > >
> > > > > > > This will not affect FullText index, as Lucene has own format
> for
> > > > > storing
> > > > > > > data.
> > > > > > >
> > > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > > >
> > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > > valentin.kulichenko@gmail.com> написал:
> > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > Currently binary marshaller always encodes strings in UTF-8.
> > > > However,
> > > > > > > > sometimes it can be useful to customize this. For example, if
> > > data
> > > > > > > contains
> > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so many
> > > Latin
> > > > > > > symbols,
> > > > > > > > memory is used very inefficiently. In this case it would be
> > great
> > > > to
> > > > > > > encode
> > > > > > > > most frequently used symbols in one byte instead of two or
> > three.
> > > > > > > >
> > > > > > > > I propose to introduce BinaryStringEncoder interface that
> will
> > > > > convert
> > > > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > > > BinaryConfiguration. This will allow users to plug in any
> > > encoding
> > > > > > > > algorithms based on their requirements.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Vladimir, how would you plugin custom encoders in your design?

On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <vo...@gridgain.com>
wrote:

> Valya,
>
> Personally I vote against this feature. BinaryConfiguration is proven to be
> inconvenient, since it has to be configured before node start, it cannot be
> changed in runtime, and it requires classes on the server. Moreover, if you
> decide to change encoding at some point, it would be impossible.
>
> I think, we should add this feature on API level instead. If string is
> written in non-UTF8 form, we will write in different format:
> [encoding_code][string]
>
> BInaryWriter.writeString(String fieldName, String val);
> BInaryWriter.writeString(String fieldName, String val, *String encoding*);
>
> BinaryReader.readString(String fieldName);
> BinaryReader.readString(String fieldName, *String encoding*);
>
> BinaryObjectBuilder.writeString(String fieldName, String val, *String
> encoding*);
>
> class MyClass {
>     *@BinaryString(encoding = "Cp1251")*
>     private String myCyrillicString;
> }
>
> Vladimir.
>
> On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <ds...@apache.org>
> wrote:
>
> > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <sergi.vladykin@gmail.com
> >
> > wrote:
> >
> > > In SQL indexes we may store partial strings and assume them to be in
> > UTF-8,
> > > I don't think this can be abstracted away. But may be this is not a big
> > > deal if in indexes we still will use UTF-8.
> > >
> >
> > Sergi, why does it matter if it is UTF8 or custom encoding? Why can't we
> > use our own compact encoding in indexes?
> >
> >
> > >
> > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <ds...@apache.org>:
> > >
> > > > Val, do you know how we compare strings in SQL queries? Will we be
> able
> > > to
> > > > use this encoder?
> > > >
> > > > Additionally, I think that the encoder is a bit too abstract. Why not
> > go
> > > > even further and allow users create their own ASCII table for
> encoding?
> > > >
> > > > D.
> > > >
> > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > valentin.kulichenko@gmail.com> wrote:
> > > >
> > > > > Andrey,
> > > > >
> > > > > Can you elaborate more on this? What is your concern?
> > > > >
> > > > > -Val
> > > > >
> > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > andrey.mashenkov@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Val,
> > > > > >
> > > > > > Looks like make sense.
> > > > > >
> > > > > > This will not affect FullText index, as Lucene has own format for
> > > > storing
> > > > > > data.
> > > > > >
> > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > >
> > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > valentin.kulichenko@gmail.com> написал:
> > > > > >
> > > > > > > Folks,
> > > > > > >
> > > > > > > Currently binary marshaller always encodes strings in UTF-8.
> > > However,
> > > > > > > sometimes it can be useful to customize this. For example, if
> > data
> > > > > > contains
> > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so many
> > Latin
> > > > > > symbols,
> > > > > > > memory is used very inefficiently. In this case it would be
> great
> > > to
> > > > > > encode
> > > > > > > most frequently used symbols in one byte instead of two or
> three.
> > > > > > >
> > > > > > > I propose to introduce BinaryStringEncoder interface that will
> > > > convert
> > > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > > BinaryConfiguration. This will allow users to plug in any
> > encoding
> > > > > > > algorithms based on their requirements.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > >
> > > > > > > -Val
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Vladimir Ozerov <vo...@gridgain.com>.

Valya,

Personally I vote against this feature. BinaryConfiguration is proven to be
inconvenient, since it has to be configured before node start, it cannot be
changed in runtime, and it requires classes on the server. Moreover, if you
decide to change encoding at some point, it would be impossible.

I think, we should add this feature on API level instead. If string is
written in non-UTF8 form, we will write in different format:
[encoding_code][string]

BInaryWriter.writeString(String fieldName, String val);
BInaryWriter.writeString(String fieldName, String val, *String encoding*);

BinaryReader.readString(String fieldName);
BinaryReader.readString(String fieldName, *String encoding*);

BinaryObjectBuilder.writeString(String fieldName, String val, *String
encoding*);

class MyClass {
    *@BinaryString(encoding = "Cp1251")*
    private String myCyrillicString;
}

Vladimir.

On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <se...@gmail.com>
> wrote:
>
> > In SQL indexes we may store partial strings and assume them to be in
> UTF-8,
> > I don't think this can be abstracted away. But may be this is not a big
> > deal if in indexes we still will use UTF-8.
> >
>
> Sergi, why does it matter if it is UTF8 or custom encoding? Why can't we
> use our own compact encoding in indexes?
>
>
> >
> > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > Val, do you know how we compare strings in SQL queries? Will we be able
> > to
> > > use this encoder?
> > >
> > > Additionally, I think that the encoder is a bit too abstract. Why not
> go
> > > even further and allow users create their own ASCII table for encoding?
> > >
> > > D.
> > >
> > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > valentin.kulichenko@gmail.com> wrote:
> > >
> > > > Andrey,
> > > >
> > > > Can you elaborate more on this? What is your concern?
> > > >
> > > > -Val
> > > >
> > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > andrey.mashenkov@gmail.com>
> > > > wrote:
> > > >
> > > > > Val,
> > > > >
> > > > > Looks like make sense.
> > > > >
> > > > > This will not affect FullText index, as Lucene has own format for
> > > storing
> > > > > data.
> > > > >
> > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > >
> > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > valentin.kulichenko@gmail.com> написал:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > Currently binary marshaller always encodes strings in UTF-8.
> > However,
> > > > > > sometimes it can be useful to customize this. For example, if
> data
> > > > > contains
> > > > > > a lot of Cyrillic, Chinese or other symbols, but not so many
> Latin
> > > > > symbols,
> > > > > > memory is used very inefficiently. In this case it would be great
> > to
> > > > > encode
> > > > > > most frequently used symbols in one byte instead of two or three.
> > > > > >
> > > > > > I propose to introduce BinaryStringEncoder interface that will
> > > convert
> > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > BinaryConfiguration. This will allow users to plug in any
> encoding
> > > > > > algorithms based on their requirements.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > >
> > > > > > -Val
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <se...@gmail.com>
wrote:

> In SQL indexes we may store partial strings and assume them to be in UTF-8,
> I don't think this can be abstracted away. But may be this is not a big
> deal if in indexes we still will use UTF-8.
>

Sergi, why does it matter if it is UTF8 or custom encoding? Why can't we
use our own compact encoding in indexes?


>
> 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <ds...@apache.org>:
>
> > Val, do you know how we compare strings in SQL queries? Will we be able
> to
> > use this encoder?
> >
> > Additionally, I think that the encoder is a bit too abstract. Why not go
> > even further and allow users create their own ASCII table for encoding?
> >
> > D.
> >
> > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > valentin.kulichenko@gmail.com> wrote:
> >
> > > Andrey,
> > >
> > > Can you elaborate more on this? What is your concern?
> > >
> > > -Val
> > >
> > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > andrey.mashenkov@gmail.com>
> > > wrote:
> > >
> > > > Val,
> > > >
> > > > Looks like make sense.
> > > >
> > > > This will not affect FullText index, as Lucene has own format for
> > storing
> > > > data.
> > > >
> > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > >
> > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > valentin.kulichenko@gmail.com> написал:
> > > >
> > > > > Folks,
> > > > >
> > > > > Currently binary marshaller always encodes strings in UTF-8.
> However,
> > > > > sometimes it can be useful to customize this. For example, if data
> > > > contains
> > > > > a lot of Cyrillic, Chinese or other symbols, but not so many Latin
> > > > symbols,
> > > > > memory is used very inefficiently. In this case it would be great
> to
> > > > encode
> > > > > most frequently used symbols in one byte instead of two or three.
> > > > >
> > > > > I propose to introduce BinaryStringEncoder interface that will
> > convert
> > > > > strings to byte arrays and back, and make it pluggable via
> > > > > BinaryConfiguration. This will allow users to plug in any encoding
> > > > > algorithms based on their requirements.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > >
> > > > > -Val
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Sergi Vladykin <se...@gmail.com>.

In SQL indexes we may store partial strings and assume them to be in UTF-8,
I don't think this can be abstracted away. But may be this is not a big
deal if in indexes we still will use UTF-8.

Sergi

2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <ds...@apache.org>:

> Val, do you know how we compare strings in SQL queries? Will we be able to
> use this encoder?
>
> Additionally, I think that the encoder is a bit too abstract. Why not go
> even further and allow users create their own ASCII table for encoding?
>
> D.
>
> On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
>
> > Andrey,
> >
> > Can you elaborate more on this? What is your concern?
> >
> > -Val
> >
> > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > andrey.mashenkov@gmail.com>
> > wrote:
> >
> > > Val,
> > >
> > > Looks like make sense.
> > >
> > > This will not affect FullText index, as Lucene has own format for
> storing
> > > data.
> > >
> > > But.. would it be compatible with H2 indexing ? I doubt.
> > >
> > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > valentin.kulichenko@gmail.com> написал:
> > >
> > > > Folks,
> > > >
> > > > Currently binary marshaller always encodes strings in UTF-8. However,
> > > > sometimes it can be useful to customize this. For example, if data
> > > contains
> > > > a lot of Cyrillic, Chinese or other symbols, but not so many Latin
> > > symbols,
> > > > memory is used very inefficiently. In this case it would be great to
> > > encode
> > > > most frequently used symbols in one byte instead of two or three.
> > > >
> > > > I propose to introduce BinaryStringEncoder interface that will
> convert
> > > > strings to byte arrays and back, and make it pluggable via
> > > > BinaryConfiguration. This will allow users to plug in any encoding
> > > > algorithms based on their requirements.
> > > >
> > > > Thoughts?
> > > >
> > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > >
> > > > -Val
> > > >
> > >
> >
>

Re: Custom string encoding

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Val, do you know how we compare strings in SQL queries? Will we be able to
use this encoder?

Additionally, I think that the encoder is a bit too abstract. Why not go
even further and allow users create their own ASCII table for encoding?

D.

On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> Andrey,
>
> Can you elaborate more on this? What is your concern?
>
> -Val
>
> On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> andrey.mashenkov@gmail.com>
> wrote:
>
> > Val,
> >
> > Looks like make sense.
> >
> > This will not affect FullText index, as Lucene has own format for storing
> > data.
> >
> > But.. would it be compatible with H2 indexing ? I doubt.
> >
> > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > valentin.kulichenko@gmail.com> написал:
> >
> > > Folks,
> > >
> > > Currently binary marshaller always encodes strings in UTF-8. However,
> > > sometimes it can be useful to customize this. For example, if data
> > contains
> > > a lot of Cyrillic, Chinese or other symbols, but not so many Latin
> > symbols,
> > > memory is used very inefficiently. In this case it would be great to
> > encode
> > > most frequently used symbols in one byte instead of two or three.
> > >
> > > I propose to introduce BinaryStringEncoder interface that will convert
> > > strings to byte arrays and back, and make it pluggable via
> > > BinaryConfiguration. This will allow users to plug in any encoding
> > > algorithms based on their requirements.
> > >
> > > Thoughts?
> > >
> > > https://issues.apache.org/jira/browse/IGNITE-5655
> > >
> > > -Val
> > >
> >
>

Re: Custom string encoding

Posted by Valentin Kulichenko <va...@gmail.com>.

Andrey,

Can you elaborate more on this? What is your concern?

-Val

On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <an...@gmail.com>
wrote:

> Val,
>
> Looks like make sense.
>
> This will not affect FullText index, as Lucene has own format for storing
> data.
>
> But.. would it be compatible with H2 indexing ? I doubt.
>
> 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> valentin.kulichenko@gmail.com> написал:
>
> > Folks,
> >
> > Currently binary marshaller always encodes strings in UTF-8. However,
> > sometimes it can be useful to customize this. For example, if data
> contains
> > a lot of Cyrillic, Chinese or other symbols, but not so many Latin
> symbols,
> > memory is used very inefficiently. In this case it would be great to
> encode
> > most frequently used symbols in one byte instead of two or three.
> >
> > I propose to introduce BinaryStringEncoder interface that will convert
> > strings to byte arrays and back, and make it pluggable via
> > BinaryConfiguration. This will allow users to plug in any encoding
> > algorithms based on their requirements.
> >
> > Thoughts?
> >
> > https://issues.apache.org/jira/browse/IGNITE-5655
> >
> > -Val
> >
>

Re: Custom string encoding

Posted by Andrey Mashenkov <an...@gmail.com>.

Val,

Looks like make sense.

This will not affect FullText index, as Lucene has own format for storing
data.

But.. would it be compatible with H2 indexing ? I doubt.

1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
valentin.kulichenko@gmail.com> написал:

> Folks,
>
> Currently binary marshaller always encodes strings in UTF-8. However,
> sometimes it can be useful to customize this. For example, if data contains
> a lot of Cyrillic, Chinese or other symbols, but not so many Latin symbols,
> memory is used very inefficiently. In this case it would be great to encode
> most frequently used symbols in one byte instead of two or three.
>
> I propose to introduce BinaryStringEncoder interface that will convert
> strings to byte arrays and back, and make it pluggable via
> BinaryConfiguration. This will allow users to plug in any encoding
> algorithms based on their requirements.
>
> Thoughts?
>
> https://issues.apache.org/jira/browse/IGNITE-5655
>
> -Val
>