You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Vladimir Ozerov <vo...@gridgain.com> on 2017/08/01 09:23:24 UTC

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Managing encoding on per-cache level is not that complex thing.
Essentially, when any cache message are prepared on initiating node, we
perform Object -> BinaryObject transition. These places has reference to
cache context ([1], [2]). This is where we should define proper string
encoding - either take global encoding, or cache-specific encoding.

As far as per-column encoding, let's put this fine-grained case aside for a
while. This is not as widely used as global or per-cache/per-table scenario.

[1] org.apache.ignite.internal.processors.cache.GridCacheContext#toCacheKeyObject(java.lang.Object)
[2]
org.apache.ignite.internal.processors.cache.GridCacheContext#toCacheObject

On Fri, Jul 28, 2017 at 8:08 PM, Artem Schitow <ar...@gmail.com>
wrote:

> > String encoding is a concept similar to "collation" in RDBMS. You can
> > define it either globally, or on per-table basis.
>
> Or on per-column (per-field) basis. Though Oracle does not have per-column
> charset, some other databases provide this option.
>
> MySQL:
> - https://dev.mysql.com/doc/refman/5.7/en/create-table.html
> | CHAR[(length)] [BINARY]
> [CHARACTER SET charset_name] [COLLATE collation_name]
>
> | VARCHAR(length) [BINARY]
> [CHARACTER SET charset_name] [COLLATE collation_name]
>
> | TEXT [BINARY]
> [CHARACTER SET charset_name] [COLLATE collation_name]
>
> SQL Server:
> - https://docs.microsoft.com/en-us/sql/t-sql/statements/
> create-table-transact-sql
> <column_definition> ::=
> column_name <data_type>
>     [ FILESTREAM ]
>     [ COLLATE collation_name ]
>
> Postgres:
> - https://www.postgresql.org/docs/9.6/static/sql-createtable.html
> CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF
> NOT EXISTS ] table_name
>  ( [
>   {
> column_name data_type [ COLLATE collation ]
>
> > 1) I have a class Person with field "name". I have two caches/tables -
> one
> > for US persons, where name is in Latin, another for RU persons with
> > Cyrillic names. How can achieve optimal encoding formats for both tables?
>
> You have to have two classes in this case, maybe with a common parent. Or
> you have to select a common denominator and settle with one encoding for
> both of them. Like Java did with UTF-16 java.util.String-s.
>
> —
> Artem Schitow
> artem.schitow@gmail.com
>
>
>
>
> > On 28 Jul 2017, at 14:45, Vladimir Ozerov <vo...@gridgain.com> wrote:
> >
> > String encoding is a concept similar to "collation" in RDBMS. You can
> > define it either globally, or on per-table basis. The same should be done
> > for Ignite. We do not define behavior of a type. We define behavior of a
> > *storage*.
> >
> > Two cases when proposed approach with per-type and per-type-field
> approach
> > doesn't work:
> > 1) I have a class Person with field "name". I have two caches/tables -
> one
> > for US persons, where name is in Latin, another for RU persons with
> > Cyrillic names. How can achieve optimal encoding formats for both tables?
> > 2) I have an empty grid. Now I want to create a cache/table with custom
> > encoding. How can I do that without cluster restart? Nohow, because
> > BinaryTypeConfiguration configured statically, while caches/tables can be
> > created in runtime.
> >
> > On Fri, Jul 28, 2017 at 2:38 PM, Pavel Tupitsyn <pt...@apache.org>
> > wrote:
> >
> >>> As Pavel mentioned, Marshaller should not be tied to cache
> >>> should be added to per-cache level
> >> Not sure if I follow.
> >> Marshalling and caching are two separate mechanisms.
> >> Defining binary format in CacheConfiguration violates separation of
> >> concerns.
> >>
> >>> Encoding *must not* be added to per-class or per-field level, this is
> >> wrong
> >> What is wrong with this? BinaryTypeConfiguration looks the right place
> for
> >> such a setting.
> >> Are we talking from SQL standpoint here, so you want this to be defined
> >> somehow via DDL in future?
> >>
> >> On Fri, Jul 28, 2017 at 2:30 PM, Vladimir Ozerov <vo...@gridgain.com>
> >> wrote:
> >>
> >>> Encoding *must not* be added to per-class or per-field level, this is
> >>> wrong.
> >>>
> >>> It should be added to per-cache level, and to per-cache-column level in
> >>> future.
> >>>
> >>> пт, 28 июля 2017 г. в 14:27, Andrey Kuznetsov <st...@gmail.com>:
> >>>
> >>>> We discussed this with Pavel and Anton just a moment ago. Summary
> >>> follows.
> >>>>
> >>>> - New byte "flag" is to be added (ENCODED_STRING)
> >>>> - 'Encoding' property is to be added at
> >>>>  -- global level (BinaryConfiguration)
> >>>>  -- per-class level (BinaryTypeConfiguration)
> >>>>  -- per-field level (BinaryTypeConfiguration)
> >>>>
> >>>> 2017-07-28 14:15 GMT+03:00 Vladimir Ozerov [via Apache Ignite
> >>> Developers] <
> >>>> ml+s2346864n20159h78@n4.nabble.com>:
> >>>>
> >>>>> As Pavel mentioned, Marshaller should not be tied to cache,
> >>> BinaryObject
> >>>>> should be self-explanatory, i.e. containing all information necessary
> >>> for
> >>>>> unmarshalling. This is an absolute requirement.
> >>>>>
> >>>>> We will have one extra byte for in serialized form, meaning that
> >>>> advantage
> >>>>> of custom encoding will become evident for all strings with length >=
> >>> 1,
> >>>>> which is perfectly fine. I do not quite understand what are we
> >> arguing
> >>>>> about.
> >>>>>
> >>>>> As far as configuration, we can do it as follows:
> >>>>>
> >>>>> 1) Add global encoding, UTF8 by default.
> >>>>> 2) Add per-cache encoding.
> >>>>> 3) Add encoding to JDBC and ODBC driver properties.
> >>>>>
> >>>>> This should be enough.
> >>>>>
> >>>>>
> >>>> --
> >>>> Best regards,
> >>>>  Andrey Kuznetsov.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> View this message in context:
> >>>> http://apache-ignite-developers.2346864.n4.nabble.
> >>> com/Non-UTF-8-string-encoding-support-in-BinaryMarshaller-
> >>> IGNITE-5655-tp20024p20161.html
> >>>> Sent from the Apache Ignite Developers mailing list archive at
> >>> Nabble.com.
> >>>
> >>
>
>

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Posted by Pavel Tupitsyn <pt...@apache.org>.
Vladimir, what about binary mode (IgniteCache.withKeepBinary)?
Two caches may have different encoding settings:

BinaryObject obj = cache1.get(key);   // Got fields in utf8
cache2.put(key, obj);  // Fields are expected to be in Windows-1251

What do we do here? Re-build the binary object?

Also, what about BinaryRawWriter - do we need encoding support there?

Pavel


On Tue, Aug 1, 2017 at 12:23 PM, Vladimir Ozerov <vo...@gridgain.com>
wrote:

> Managing encoding on per-cache level is not that complex thing.
> Essentially, when any cache message are prepared on initiating node, we
> perform Object -> BinaryObject transition. These places has reference to
> cache context ([1], [2]). This is where we should define proper string
> encoding - either take global encoding, or cache-specific encoding.
>
> As far as per-column encoding, let's put this fine-grained case aside for a
> while. This is not as widely used as global or per-cache/per-table
> scenario.
>
> [1] org.apache.ignite.internal.processors.cache.GridCacheContext#
> toCacheKeyObject(java.lang.Object)
> [2]
> org.apache.ignite.internal.processors.cache.GridCacheContext#toCacheObject
>
> On Fri, Jul 28, 2017 at 8:08 PM, Artem Schitow <ar...@gmail.com>
> wrote:
>
> > > String encoding is a concept similar to "collation" in RDBMS. You can
> > > define it either globally, or on per-table basis.
> >
> > Or on per-column (per-field) basis. Though Oracle does not have
> per-column
> > charset, some other databases provide this option.
> >
> > MySQL:
> > - https://dev.mysql.com/doc/refman/5.7/en/create-table.html
> > | CHAR[(length)] [BINARY]
> > [CHARACTER SET charset_name] [COLLATE collation_name]
> >
> > | VARCHAR(length) [BINARY]
> > [CHARACTER SET charset_name] [COLLATE collation_name]
> >
> > | TEXT [BINARY]
> > [CHARACTER SET charset_name] [COLLATE collation_name]
> >
> > SQL Server:
> > - https://docs.microsoft.com/en-us/sql/t-sql/statements/
> > create-table-transact-sql
> > <column_definition> ::=
> > column_name <data_type>
> >     [ FILESTREAM ]
> >     [ COLLATE collation_name ]
> >
> > Postgres:
> > - https://www.postgresql.org/docs/9.6/static/sql-createtable.html
> > CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF
> > NOT EXISTS ] table_name
> >  ( [
> >   {
> > column_name data_type [ COLLATE collation ]
> >
> > > 1) I have a class Person with field "name". I have two caches/tables -
> > one
> > > for US persons, where name is in Latin, another for RU persons with
> > > Cyrillic names. How can achieve optimal encoding formats for both
> tables?
> >
> > You have to have two classes in this case, maybe with a common parent. Or
> > you have to select a common denominator and settle with one encoding for
> > both of them. Like Java did with UTF-16 java.util.String-s.
> >
> > —
> > Artem Schitow
> > artem.schitow@gmail.com
> >
> >
> >
> >
> > > On 28 Jul 2017, at 14:45, Vladimir Ozerov <vo...@gridgain.com>
> wrote:
> > >
> > > String encoding is a concept similar to "collation" in RDBMS. You can
> > > define it either globally, or on per-table basis. The same should be
> done
> > > for Ignite. We do not define behavior of a type. We define behavior of
> a
> > > *storage*.
> > >
> > > Two cases when proposed approach with per-type and per-type-field
> > approach
> > > doesn't work:
> > > 1) I have a class Person with field "name". I have two caches/tables -
> > one
> > > for US persons, where name is in Latin, another for RU persons with
> > > Cyrillic names. How can achieve optimal encoding formats for both
> tables?
> > > 2) I have an empty grid. Now I want to create a cache/table with custom
> > > encoding. How can I do that without cluster restart? Nohow, because
> > > BinaryTypeConfiguration configured statically, while caches/tables can
> be
> > > created in runtime.
> > >
> > > On Fri, Jul 28, 2017 at 2:38 PM, Pavel Tupitsyn <pt...@apache.org>
> > > wrote:
> > >
> > >>> As Pavel mentioned, Marshaller should not be tied to cache
> > >>> should be added to per-cache level
> > >> Not sure if I follow.
> > >> Marshalling and caching are two separate mechanisms.
> > >> Defining binary format in CacheConfiguration violates separation of
> > >> concerns.
> > >>
> > >>> Encoding *must not* be added to per-class or per-field level, this is
> > >> wrong
> > >> What is wrong with this? BinaryTypeConfiguration looks the right place
> > for
> > >> such a setting.
> > >> Are we talking from SQL standpoint here, so you want this to be
> defined
> > >> somehow via DDL in future?
> > >>
> > >> On Fri, Jul 28, 2017 at 2:30 PM, Vladimir Ozerov <
> vozerov@gridgain.com>
> > >> wrote:
> > >>
> > >>> Encoding *must not* be added to per-class or per-field level, this is
> > >>> wrong.
> > >>>
> > >>> It should be added to per-cache level, and to per-cache-column level
> in
> > >>> future.
> > >>>
> > >>> пт, 28 июля 2017 г. в 14:27, Andrey Kuznetsov <st...@gmail.com>:
> > >>>
> > >>>> We discussed this with Pavel and Anton just a moment ago. Summary
> > >>> follows.
> > >>>>
> > >>>> - New byte "flag" is to be added (ENCODED_STRING)
> > >>>> - 'Encoding' property is to be added at
> > >>>>  -- global level (BinaryConfiguration)
> > >>>>  -- per-class level (BinaryTypeConfiguration)
> > >>>>  -- per-field level (BinaryTypeConfiguration)
> > >>>>
> > >>>> 2017-07-28 14:15 GMT+03:00 Vladimir Ozerov [via Apache Ignite
> > >>> Developers] <
> > >>>> ml+s2346864n20159h78@n4.nabble.com>:
> > >>>>
> > >>>>> As Pavel mentioned, Marshaller should not be tied to cache,
> > >>> BinaryObject
> > >>>>> should be self-explanatory, i.e. containing all information
> necessary
> > >>> for
> > >>>>> unmarshalling. This is an absolute requirement.
> > >>>>>
> > >>>>> We will have one extra byte for in serialized form, meaning that
> > >>>> advantage
> > >>>>> of custom encoding will become evident for all strings with length
> >=
> > >>> 1,
> > >>>>> which is perfectly fine. I do not quite understand what are we
> > >> arguing
> > >>>>> about.
> > >>>>>
> > >>>>> As far as configuration, we can do it as follows:
> > >>>>>
> > >>>>> 1) Add global encoding, UTF8 by default.
> > >>>>> 2) Add per-cache encoding.
> > >>>>> 3) Add encoding to JDBC and ODBC driver properties.
> > >>>>>
> > >>>>> This should be enough.
> > >>>>>
> > >>>>>
> > >>>> --
> > >>>> Best regards,
> > >>>>  Andrey Kuznetsov.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> View this message in context:
> > >>>> http://apache-ignite-developers.2346864.n4.nabble.
> > >>> com/Non-UTF-8-string-encoding-support-in-BinaryMarshaller-
> > >>> IGNITE-5655-tp20024p20161.html
> > >>>> Sent from the Apache Ignite Developers mailing list archive at
> > >>> Nabble.com.
> > >>>
> > >>
> >
> >
>