You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Grant Henke <gh...@cloudera.com> on 2017/11/16 22:19:27 UTC

INT128 Column Support Interest

Hi all,

As a part of adding DECIMAL support to Kudu it was necessary to add
internal support for 128 bit integers. Taking that one step further and
supporting public columns and APIs for 128 bit integers would not be too
much additional work. However, I wanted to gauge the interest from the
community.

My initial thoughts are that having an INT128 column type could be useful
for things like UUIDs, IPv6 addresses, MD5 hashes and other similar types
of data.

Is there any interest or uses for a INT128 column type? Is anyone currently
using a STRING or BINARY column for 128 bit data?

Thank you,
Grant
-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: INT128 Column Support Interest

Posted by Grant Henke <gh...@cloudera.com>.

>
> I'm somewhat against such a configuration. This being a server-side
> configuration results in Kudu deployments in different environments having
> different sets of available types, which seems very difficult for
> downstream users to deal with.


Yeah I agree. I am not super into the idea.

Even though "least common denominator" kind
> of sucks, it's also not a bad policy for software that aims to be part of a
> pretty diverse ecosystem.


I think because Kudu is generally the "bottom" layer it would be best to
build
new features/types from the bottom up where possible. As opposed to always
playing catchup in the ecosystem. That said, I think thats only true given
there is
interest or demand for the feature or data type. It doesn't look like that
demand exists in this case though.

I think without clear user demand for >28 digits it's just not worth the
> complexity.


Agreed. Not much response here so we should drop this for now.

 That's a good point. However, I'm guessing that users are more likely to

intuitively know that "9 digits is enough" more easily than they will know
> that "64 bits is enough". In my experience people underestimate the range
> of 64-bit integers and might choose INT128 if available even if they have
> no need for anywhere near that range


That makes sense. Instead of supporting INT128 for larger ranges,
if there is demand for more digits we could add support for decimal
precisions 39 to 77 with internal INT256 support (or VarInt).


On Mon, Nov 20, 2017 at 6:51 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Mon, Nov 20, 2017 at 1:12 PM, Grant Henke <gh...@cloudera.com> wrote:
>
> > Thank you for the feedback. Below are some responses.
> >
> > Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> > > Presto, etc? What type would we map to in Java?
> >
> >
> > In Java we would Map to a BigInteger. Their isn't a perfectly natural
> > mapping for SQL that I know of. It has been mentioned in the past that we
> > could have server side flags to disable/enable the ability to create
> > columns of certain types to prevent users from creating tables that are
> not
> > readable by certain integrations. This problem exists today with the
> BINARY
> > column type.
> >
>
> I'm somewhat against such a configuration. This being a server-side
> configuration results in Kudu deployments in different environments having
> different sets of available types, which seems very difficult for
> downstream users to deal with. Even though "least common denominator" kind
> of sucks, it's also not a bad policy for software that aims to be part of a
> pretty diverse ecosystem.
>
>
>
> >
> > > Why not just _not_ expose it and only expose decimal.
> >
> >
> > Technically decimal only supports 28 9's where INT128 can support
> slightly
> > larger numbers. Their may also be more overhead dealing with a decimal
> > type. Though I am not positive about that.
> >
>
> I think without clear user demand for >28 digits it's just not worth the
> complexity.
>
>
> >
> > Encoders: like Dan mentioned, it seems like we might not be able to do a
> > > very efficient job of encoding these very large integers. Stuff like
> > > bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > > values. So, I'm a little afraid that we'll end up only with PLAIN and
> > > people will be upset with the storage overhead and performance.
> >
> >
> >  Aren't we going to need efficient encodings in order to make decimal
> work
> > > well, anyway?
> >
> >
> > We will need to ensure performant encoding exists for INT128 to make
> > decimals with a precisions >= 18 work well anyway. We should likely have
> > parity
> > with the other integer types to reduce any confusion about differing
> > precisions having different encoding considerations. Although Presto
> > documents that precision >= 18 are slower than the others. We could do
> > something similar and follow on with improvements.
> >
> > In the current int128 internal patch I know that the RLE doesn't work for
> > int128. I don't have a lot of background on Kudu's encoding details, so
> > investigating encodings further is one of my next steps.
> >
>
> That's a good point. However, I'm guessing that users are more likely to
> intuitively know that "9 digits is enough" more easily than they will know
> that "64 bits is enough". In my experience people underestimate the range
> of 64-bit integers and might choose INT128 if available even if they have
> no need for anywhere near that range.
>
> -Todd
>
>
> >
> > On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <da...@apache.org>
> > wrote:
> >
> > > Aren't we going to need efficient encodings in order to make decimal
> work
> > > well, anyway?
> > >
> > > - Dan
> > >
> > > On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > >> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org>
> > >> wrote:
> > >>
> > >> > I think it would be useful.  As far as I've seen the main costs in
> > >> > carrying data types are in writing performant encoders, and updating
> > >> > integrations to work with them.  I'm guessing with 128 bit integers
> > >> there
> > >> > would be some integrations that can't or won't support it, which
> might
> > >> be a
> > >> > cause for confusion.  Overall, though, I think the upsides of
> > efficiency
> > >> > and decreased storage space are compelling.   Do you have a sense
> yet
> > of
> > >> > what encodings are going to be supported down the road (will we get
> to
> > >> full
> > >> > parity with 32/64)?
> > >> >
> > >>
> > >> Yea, my concerns are:
> > >>
> > >> 1) Integrations: do we have a compatible SQL type to map this to in
> > Spark
> > >> SQL, Impala, Presto, etc? What type would we map to in Java? It seems
> > like
> > >> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So,
> if
> > >> we're going to map it the same as decimal anyway, why not just _not_
> > >> expose
> > >> it and only expose decimal? If someone wants to store a 128-bit hash
> as
> > a
> > >> DECIMAL(39) they are free to, of course. Postgres's built-in int types
> > >> only
> > >> go up to 64-bit (bigint)
> > >>
> > >> In addition to the choice of DECIMAL, for things like fixed-length
> > binary
> > >> maybe we are better off later adding a fixed-length BINARY type, like
> > >> BINARY(16) which could be used for storing large hashes? There is
> > >> precedent
> > >> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
> > >>
> > >>
> > >> 2) Encoders: like Dan mentioned, it seems like we might not be able to
> > do
> > >> a
> > >> very efficient job of encoding these very large integers. Stuff like
> > >> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > >> values. So, I'm a little afraid that we'll end up only with PLAIN and
> > >> people will be upset with the storage overhead and performance.
> > >>
> > >> -Todd
> > >>
> > >> >
> > >> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com>
> > >> wrote:
> > >> >
> > >> >> Hi all,
> > >> >>
> > >> >> As a part of adding DECIMAL support to Kudu it was necessary to add
> > >> >> internal support for 128 bit integers. Taking that one step further
> > and
> > >> >> supporting public columns and APIs for 128 bit integers would not
> be
> > >> too
> > >> >> much additional work. However, I wanted to gauge the interest from
> > the
> > >> >> community.
> > >> >>
> > >> >> My initial thoughts are that having an INT128 column type could be
> > >> useful
> > >> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
> > >> types
> > >> >> of data.
> > >> >>
> > >> >> Is there any interest or uses for a INT128 column type? Is anyone
> > >> >> currently using a STRING or BINARY column for 128 bit data?
> > >> >>
> > >> >> Thank you,
> > >> >> Grant
> > >> >> --
> > >> >> Grant Henke
> > >> >> Software Engineer | Cloudera
> > >> >> grant@cloudera.com | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > >> >>
> > >> >
> > >> >
> > >>
> > >>
> > >> --
> > >> Todd Lipcon
> > >> Software Engineer, Cloudera
> > >>
> > >
> > >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: INT128 Column Support Interest

Posted by Grant Henke <gh...@cloudera.com>.

>
> I'm somewhat against such a configuration. This being a server-side
> configuration results in Kudu deployments in different environments having
> different sets of available types, which seems very difficult for
> downstream users to deal with.


Yeah I agree. I am not super into the idea.

Even though "least common denominator" kind
> of sucks, it's also not a bad policy for software that aims to be part of a
> pretty diverse ecosystem.


I think because Kudu is generally the "bottom" layer it would be best to
build
new features/types from the bottom up where possible. As opposed to always
playing catchup in the ecosystem. That said, I think thats only true given
there is
interest or demand for the feature or data type. It doesn't look like that
demand exists in this case though.

I think without clear user demand for >28 digits it's just not worth the
> complexity.


Agreed. Not much response here so we should drop this for now.

 That's a good point. However, I'm guessing that users are more likely to

intuitively know that "9 digits is enough" more easily than they will know
> that "64 bits is enough". In my experience people underestimate the range
> of 64-bit integers and might choose INT128 if available even if they have
> no need for anywhere near that range


That makes sense. Instead of supporting INT128 for larger ranges,
if there is demand for more digits we could add support for decimal
precisions 39 to 77 with internal INT256 support (or VarInt).


On Mon, Nov 20, 2017 at 6:51 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Mon, Nov 20, 2017 at 1:12 PM, Grant Henke <gh...@cloudera.com> wrote:
>
> > Thank you for the feedback. Below are some responses.
> >
> > Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> > > Presto, etc? What type would we map to in Java?
> >
> >
> > In Java we would Map to a BigInteger. Their isn't a perfectly natural
> > mapping for SQL that I know of. It has been mentioned in the past that we
> > could have server side flags to disable/enable the ability to create
> > columns of certain types to prevent users from creating tables that are
> not
> > readable by certain integrations. This problem exists today with the
> BINARY
> > column type.
> >
>
> I'm somewhat against such a configuration. This being a server-side
> configuration results in Kudu deployments in different environments having
> different sets of available types, which seems very difficult for
> downstream users to deal with. Even though "least common denominator" kind
> of sucks, it's also not a bad policy for software that aims to be part of a
> pretty diverse ecosystem.
>
>
>
> >
> > > Why not just _not_ expose it and only expose decimal.
> >
> >
> > Technically decimal only supports 28 9's where INT128 can support
> slightly
> > larger numbers. Their may also be more overhead dealing with a decimal
> > type. Though I am not positive about that.
> >
>
> I think without clear user demand for >28 digits it's just not worth the
> complexity.
>
>
> >
> > Encoders: like Dan mentioned, it seems like we might not be able to do a
> > > very efficient job of encoding these very large integers. Stuff like
> > > bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > > values. So, I'm a little afraid that we'll end up only with PLAIN and
> > > people will be upset with the storage overhead and performance.
> >
> >
> >  Aren't we going to need efficient encodings in order to make decimal
> work
> > > well, anyway?
> >
> >
> > We will need to ensure performant encoding exists for INT128 to make
> > decimals with a precisions >= 18 work well anyway. We should likely have
> > parity
> > with the other integer types to reduce any confusion about differing
> > precisions having different encoding considerations. Although Presto
> > documents that precision >= 18 are slower than the others. We could do
> > something similar and follow on with improvements.
> >
> > In the current int128 internal patch I know that the RLE doesn't work for
> > int128. I don't have a lot of background on Kudu's encoding details, so
> > investigating encodings further is one of my next steps.
> >
>
> That's a good point. However, I'm guessing that users are more likely to
> intuitively know that "9 digits is enough" more easily than they will know
> that "64 bits is enough". In my experience people underestimate the range
> of 64-bit integers and might choose INT128 if available even if they have
> no need for anywhere near that range.
>
> -Todd
>
>
> >
> > On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <da...@apache.org>
> > wrote:
> >
> > > Aren't we going to need efficient encodings in order to make decimal
> work
> > > well, anyway?
> > >
> > > - Dan
> > >
> > > On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > >> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org>
> > >> wrote:
> > >>
> > >> > I think it would be useful.  As far as I've seen the main costs in
> > >> > carrying data types are in writing performant encoders, and updating
> > >> > integrations to work with them.  I'm guessing with 128 bit integers
> > >> there
> > >> > would be some integrations that can't or won't support it, which
> might
> > >> be a
> > >> > cause for confusion.  Overall, though, I think the upsides of
> > efficiency
> > >> > and decreased storage space are compelling.   Do you have a sense
> yet
> > of
> > >> > what encodings are going to be supported down the road (will we get
> to
> > >> full
> > >> > parity with 32/64)?
> > >> >
> > >>
> > >> Yea, my concerns are:
> > >>
> > >> 1) Integrations: do we have a compatible SQL type to map this to in
> > Spark
> > >> SQL, Impala, Presto, etc? What type would we map to in Java? It seems
> > like
> > >> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So,
> if
> > >> we're going to map it the same as decimal anyway, why not just _not_
> > >> expose
> > >> it and only expose decimal? If someone wants to store a 128-bit hash
> as
> > a
> > >> DECIMAL(39) they are free to, of course. Postgres's built-in int types
> > >> only
> > >> go up to 64-bit (bigint)
> > >>
> > >> In addition to the choice of DECIMAL, for things like fixed-length
> > binary
> > >> maybe we are better off later adding a fixed-length BINARY type, like
> > >> BINARY(16) which could be used for storing large hashes? There is
> > >> precedent
> > >> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
> > >>
> > >>
> > >> 2) Encoders: like Dan mentioned, it seems like we might not be able to
> > do
> > >> a
> > >> very efficient job of encoding these very large integers. Stuff like
> > >> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > >> values. So, I'm a little afraid that we'll end up only with PLAIN and
> > >> people will be upset with the storage overhead and performance.
> > >>
> > >> -Todd
> > >>
> > >> >
> > >> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com>
> > >> wrote:
> > >> >
> > >> >> Hi all,
> > >> >>
> > >> >> As a part of adding DECIMAL support to Kudu it was necessary to add
> > >> >> internal support for 128 bit integers. Taking that one step further
> > and
> > >> >> supporting public columns and APIs for 128 bit integers would not
> be
> > >> too
> > >> >> much additional work. However, I wanted to gauge the interest from
> > the
> > >> >> community.
> > >> >>
> > >> >> My initial thoughts are that having an INT128 column type could be
> > >> useful
> > >> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
> > >> types
> > >> >> of data.
> > >> >>
> > >> >> Is there any interest or uses for a INT128 column type? Is anyone
> > >> >> currently using a STRING or BINARY column for 128 bit data?
> > >> >>
> > >> >> Thank you,
> > >> >> Grant
> > >> >> --
> > >> >> Grant Henke
> > >> >> Software Engineer | Cloudera
> > >> >> grant@cloudera.com | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > >> >>
> > >> >
> > >> >
> > >>
> > >>
> > >> --
> > >> Todd Lipcon
> > >> Software Engineer, Cloudera
> > >>
> > >
> > >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: INT128 Column Support Interest

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Nov 20, 2017 at 1:12 PM, Grant Henke <gh...@cloudera.com> wrote:

> Thank you for the feedback. Below are some responses.
>
> Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> > Presto, etc? What type would we map to in Java?
>
>
> In Java we would Map to a BigInteger. Their isn't a perfectly natural
> mapping for SQL that I know of. It has been mentioned in the past that we
> could have server side flags to disable/enable the ability to create
> columns of certain types to prevent users from creating tables that are not
> readable by certain integrations. This problem exists today with the BINARY
> column type.
>

I'm somewhat against such a configuration. This being a server-side
configuration results in Kudu deployments in different environments having
different sets of available types, which seems very difficult for
downstream users to deal with. Even though "least common denominator" kind
of sucks, it's also not a bad policy for software that aims to be part of a
pretty diverse ecosystem.



>
> > Why not just _not_ expose it and only expose decimal.
>
>
> Technically decimal only supports 28 9's where INT128 can support slightly
> larger numbers. Their may also be more overhead dealing with a decimal
> type. Though I am not positive about that.
>

I think without clear user demand for >28 digits it's just not worth the
complexity.


>
> Encoders: like Dan mentioned, it seems like we might not be able to do a
> > very efficient job of encoding these very large integers. Stuff like
> > bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > values. So, I'm a little afraid that we'll end up only with PLAIN and
> > people will be upset with the storage overhead and performance.
>
>
>  Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
>
>
> We will need to ensure performant encoding exists for INT128 to make
> decimals with a precisions >= 18 work well anyway. We should likely have
> parity
> with the other integer types to reduce any confusion about differing
> precisions having different encoding considerations. Although Presto
> documents that precision >= 18 are slower than the others. We could do
> something similar and follow on with improvements.
>
> In the current int128 internal patch I know that the RLE doesn't work for
> int128. I don't have a lot of background on Kudu's encoding details, so
> investigating encodings further is one of my next steps.
>

That's a good point. However, I'm guessing that users are more likely to
intuitively know that "9 digits is enough" more easily than they will know
that "64 bits is enough". In my experience people underestimate the range
of 64-bit integers and might choose INT128 if available even if they have
no need for anywhere near that range.

-Todd


>
> On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <da...@apache.org>
> wrote:
>
> > Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
> >
> > - Dan
> >
> > On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org>
> >> wrote:
> >>
> >> > I think it would be useful.  As far as I've seen the main costs in
> >> > carrying data types are in writing performant encoders, and updating
> >> > integrations to work with them.  I'm guessing with 128 bit integers
> >> there
> >> > would be some integrations that can't or won't support it, which might
> >> be a
> >> > cause for confusion.  Overall, though, I think the upsides of
> efficiency
> >> > and decreased storage space are compelling.   Do you have a sense yet
> of
> >> > what encodings are going to be supported down the road (will we get to
> >> full
> >> > parity with 32/64)?
> >> >
> >>
> >> Yea, my concerns are:
> >>
> >> 1) Integrations: do we have a compatible SQL type to map this to in
> Spark
> >> SQL, Impala, Presto, etc? What type would we map to in Java? It seems
> like
> >> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
> >> we're going to map it the same as decimal anyway, why not just _not_
> >> expose
> >> it and only expose decimal? If someone wants to store a 128-bit hash as
> a
> >> DECIMAL(39) they are free to, of course. Postgres's built-in int types
> >> only
> >> go up to 64-bit (bigint)
> >>
> >> In addition to the choice of DECIMAL, for things like fixed-length
> binary
> >> maybe we are better off later adding a fixed-length BINARY type, like
> >> BINARY(16) which could be used for storing large hashes? There is
> >> precedent
> >> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
> >>
> >>
> >> 2) Encoders: like Dan mentioned, it seems like we might not be able to
> do
> >> a
> >> very efficient job of encoding these very large integers. Stuff like
> >> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> >> values. So, I'm a little afraid that we'll end up only with PLAIN and
> >> people will be upset with the storage overhead and performance.
> >>
> >> -Todd
> >>
> >> >
> >> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com>
> >> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> As a part of adding DECIMAL support to Kudu it was necessary to add
> >> >> internal support for 128 bit integers. Taking that one step further
> and
> >> >> supporting public columns and APIs for 128 bit integers would not be
> >> too
> >> >> much additional work. However, I wanted to gauge the interest from
> the
> >> >> community.
> >> >>
> >> >> My initial thoughts are that having an INT128 column type could be
> >> useful
> >> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
> >> types
> >> >> of data.
> >> >>
> >> >> Is there any interest or uses for a INT128 column type? Is anyone
> >> >> currently using a STRING or BINARY column for 128 bit data?
> >> >>
> >> >> Thank you,
> >> >> Grant
> >> >> --
> >> >> Grant Henke
> >> >> Software Engineer | Cloudera
> >> >> grant@cloudera.com | twitter.com/gchenke |
> linkedin.com/in/granthenke
> >> >>
> >> >
> >> >
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
> >
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: INT128 Column Support Interest

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Nov 20, 2017 at 1:12 PM, Grant Henke <gh...@cloudera.com> wrote:

> Thank you for the feedback. Below are some responses.
>
> Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> > Presto, etc? What type would we map to in Java?
>
>
> In Java we would Map to a BigInteger. Their isn't a perfectly natural
> mapping for SQL that I know of. It has been mentioned in the past that we
> could have server side flags to disable/enable the ability to create
> columns of certain types to prevent users from creating tables that are not
> readable by certain integrations. This problem exists today with the BINARY
> column type.
>

I'm somewhat against such a configuration. This being a server-side
configuration results in Kudu deployments in different environments having
different sets of available types, which seems very difficult for
downstream users to deal with. Even though "least common denominator" kind
of sucks, it's also not a bad policy for software that aims to be part of a
pretty diverse ecosystem.



>
> > Why not just _not_ expose it and only expose decimal.
>
>
> Technically decimal only supports 28 9's where INT128 can support slightly
> larger numbers. Their may also be more overhead dealing with a decimal
> type. Though I am not positive about that.
>

I think without clear user demand for >28 digits it's just not worth the
complexity.


>
> Encoders: like Dan mentioned, it seems like we might not be able to do a
> > very efficient job of encoding these very large integers. Stuff like
> > bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > values. So, I'm a little afraid that we'll end up only with PLAIN and
> > people will be upset with the storage overhead and performance.
>
>
>  Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
>
>
> We will need to ensure performant encoding exists for INT128 to make
> decimals with a precisions >= 18 work well anyway. We should likely have
> parity
> with the other integer types to reduce any confusion about differing
> precisions having different encoding considerations. Although Presto
> documents that precision >= 18 are slower than the others. We could do
> something similar and follow on with improvements.
>
> In the current int128 internal patch I know that the RLE doesn't work for
> int128. I don't have a lot of background on Kudu's encoding details, so
> investigating encodings further is one of my next steps.
>

That's a good point. However, I'm guessing that users are more likely to
intuitively know that "9 digits is enough" more easily than they will know
that "64 bits is enough". In my experience people underestimate the range
of 64-bit integers and might choose INT128 if available even if they have
no need for anywhere near that range.

-Todd


>
> On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <da...@apache.org>
> wrote:
>
> > Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
> >
> > - Dan
> >
> > On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org>
> >> wrote:
> >>
> >> > I think it would be useful.  As far as I've seen the main costs in
> >> > carrying data types are in writing performant encoders, and updating
> >> > integrations to work with them.  I'm guessing with 128 bit integers
> >> there
> >> > would be some integrations that can't or won't support it, which might
> >> be a
> >> > cause for confusion.  Overall, though, I think the upsides of
> efficiency
> >> > and decreased storage space are compelling.   Do you have a sense yet
> of
> >> > what encodings are going to be supported down the road (will we get to
> >> full
> >> > parity with 32/64)?
> >> >
> >>
> >> Yea, my concerns are:
> >>
> >> 1) Integrations: do we have a compatible SQL type to map this to in
> Spark
> >> SQL, Impala, Presto, etc? What type would we map to in Java? It seems
> like
> >> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
> >> we're going to map it the same as decimal anyway, why not just _not_
> >> expose
> >> it and only expose decimal? If someone wants to store a 128-bit hash as
> a
> >> DECIMAL(39) they are free to, of course. Postgres's built-in int types
> >> only
> >> go up to 64-bit (bigint)
> >>
> >> In addition to the choice of DECIMAL, for things like fixed-length
> binary
> >> maybe we are better off later adding a fixed-length BINARY type, like
> >> BINARY(16) which could be used for storing large hashes? There is
> >> precedent
> >> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
> >>
> >>
> >> 2) Encoders: like Dan mentioned, it seems like we might not be able to
> do
> >> a
> >> very efficient job of encoding these very large integers. Stuff like
> >> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> >> values. So, I'm a little afraid that we'll end up only with PLAIN and
> >> people will be upset with the storage overhead and performance.
> >>
> >> -Todd
> >>
> >> >
> >> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com>
> >> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> As a part of adding DECIMAL support to Kudu it was necessary to add
> >> >> internal support for 128 bit integers. Taking that one step further
> and
> >> >> supporting public columns and APIs for 128 bit integers would not be
> >> too
> >> >> much additional work. However, I wanted to gauge the interest from
> the
> >> >> community.
> >> >>
> >> >> My initial thoughts are that having an INT128 column type could be
> >> useful
> >> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
> >> types
> >> >> of data.
> >> >>
> >> >> Is there any interest or uses for a INT128 column type? Is anyone
> >> >> currently using a STRING or BINARY column for 128 bit data?
> >> >>
> >> >> Thank you,
> >> >> Grant
> >> >> --
> >> >> Grant Henke
> >> >> Software Engineer | Cloudera
> >> >> grant@cloudera.com | twitter.com/gchenke |
> linkedin.com/in/granthenke
> >> >>
> >> >
> >> >
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
> >
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: INT128 Column Support Interest

Posted by Grant Henke <gh...@cloudera.com>.

Thank you for the feedback. Below are some responses.

Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> Presto, etc? What type would we map to in Java?


In Java we would Map to a BigInteger. Their isn't a perfectly natural
mapping for SQL that I know of. It has been mentioned in the past that we
could have server side flags to disable/enable the ability to create
columns of certain types to prevent users from creating tables that are not
readable by certain integrations. This problem exists today with the BINARY
column type.

Why not just _not_ expose it and only expose decimal.


Technically decimal only supports 28 9's where INT128 can support slightly
larger numbers. Their may also be more overhead dealing with a decimal
type. Though I am not positive about that.

Encoders: like Dan mentioned, it seems like we might not be able to do a
> very efficient job of encoding these very large integers. Stuff like
> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> values. So, I'm a little afraid that we'll end up only with PLAIN and
> people will be upset with the storage overhead and performance.


 Aren't we going to need efficient encodings in order to make decimal work
> well, anyway?


We will need to ensure performant encoding exists for INT128 to make
decimals with a precisions >= 18 work well anyway. We should likely have parity
with the other integer types to reduce any confusion about differing
precisions having different encoding considerations. Although Presto
documents that precision >= 18 are slower than the others. We could do
something similar and follow on with improvements.

In the current int128 internal patch I know that the RLE doesn't work for
int128. I don't have a lot of background on Kudu's encoding details, so
investigating encodings further is one of my next steps.

Thank you,
Grant





On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <da...@apache.org> wrote:

> Aren't we going to need efficient encodings in order to make decimal work
> well, anyway?
>
> - Dan
>
> On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org>
>> wrote:
>>
>> > I think it would be useful.  As far as I've seen the main costs in
>> > carrying data types are in writing performant encoders, and updating
>> > integrations to work with them.  I'm guessing with 128 bit integers
>> there
>> > would be some integrations that can't or won't support it, which might
>> be a
>> > cause for confusion.  Overall, though, I think the upsides of efficiency
>> > and decreased storage space are compelling.   Do you have a sense yet of
>> > what encodings are going to be supported down the road (will we get to
>> full
>> > parity with 32/64)?
>> >
>>
>> Yea, my concerns are:
>>
>> 1) Integrations: do we have a compatible SQL type to map this to in Spark
>> SQL, Impala, Presto, etc? What type would we map to in Java? It seems like
>> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
>> we're going to map it the same as decimal anyway, why not just _not_
>> expose
>> it and only expose decimal? If someone wants to store a 128-bit hash as a
>> DECIMAL(39) they are free to, of course. Postgres's built-in int types
>> only
>> go up to 64-bit (bigint)
>>
>> In addition to the choice of DECIMAL, for things like fixed-length binary
>> maybe we are better off later adding a fixed-length BINARY type, like
>> BINARY(16) which could be used for storing large hashes? There is
>> precedent
>> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
>>
>>
>> 2) Encoders: like Dan mentioned, it seems like we might not be able to do
>> a
>> very efficient job of encoding these very large integers. Stuff like
>> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
>> values. So, I'm a little afraid that we'll end up only with PLAIN and
>> people will be upset with the storage overhead and performance.
>>
>> -Todd
>>
>> >
>> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com>
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >> As a part of adding DECIMAL support to Kudu it was necessary to add
>> >> internal support for 128 bit integers. Taking that one step further and
>> >> supporting public columns and APIs for 128 bit integers would not be
>> too
>> >> much additional work. However, I wanted to gauge the interest from the
>> >> community.
>> >>
>> >> My initial thoughts are that having an INT128 column type could be
>> useful
>> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
>> types
>> >> of data.
>> >>
>> >> Is there any interest or uses for a INT128 column type? Is anyone
>> >> currently using a STRING or BINARY column for 128 bit data?
>> >>
>> >> Thank you,
>> >> Grant
>> >> --
>> >> Grant Henke
>> >> Software Engineer | Cloudera
>> >> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>> >>
>> >
>> >
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: INT128 Column Support Interest

Posted by Grant Henke <gh...@cloudera.com>.

Thank you for the feedback. Below are some responses.

Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> Presto, etc? What type would we map to in Java?


In Java we would Map to a BigInteger. Their isn't a perfectly natural
mapping for SQL that I know of. It has been mentioned in the past that we
could have server side flags to disable/enable the ability to create
columns of certain types to prevent users from creating tables that are not
readable by certain integrations. This problem exists today with the BINARY
column type.

Why not just _not_ expose it and only expose decimal.


Technically decimal only supports 28 9's where INT128 can support slightly
larger numbers. Their may also be more overhead dealing with a decimal
type. Though I am not positive about that.

Encoders: like Dan mentioned, it seems like we might not be able to do a
> very efficient job of encoding these very large integers. Stuff like
> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> values. So, I'm a little afraid that we'll end up only with PLAIN and
> people will be upset with the storage overhead and performance.


 Aren't we going to need efficient encodings in order to make decimal work
> well, anyway?


We will need to ensure performant encoding exists for INT128 to make
decimals with a precisions >= 18 work well anyway. We should likely have parity
with the other integer types to reduce any confusion about differing
precisions having different encoding considerations. Although Presto
documents that precision >= 18 are slower than the others. We could do
something similar and follow on with improvements.

In the current int128 internal patch I know that the RLE doesn't work for
int128. I don't have a lot of background on Kudu's encoding details, so
investigating encodings further is one of my next steps.

Thank you,
Grant





On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <da...@apache.org> wrote:

> Aren't we going to need efficient encodings in order to make decimal work
> well, anyway?
>
> - Dan
>
> On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org>
>> wrote:
>>
>> > I think it would be useful.  As far as I've seen the main costs in
>> > carrying data types are in writing performant encoders, and updating
>> > integrations to work with them.  I'm guessing with 128 bit integers
>> there
>> > would be some integrations that can't or won't support it, which might
>> be a
>> > cause for confusion.  Overall, though, I think the upsides of efficiency
>> > and decreased storage space are compelling.   Do you have a sense yet of
>> > what encodings are going to be supported down the road (will we get to
>> full
>> > parity with 32/64)?
>> >
>>
>> Yea, my concerns are:
>>
>> 1) Integrations: do we have a compatible SQL type to map this to in Spark
>> SQL, Impala, Presto, etc? What type would we map to in Java? It seems like
>> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
>> we're going to map it the same as decimal anyway, why not just _not_
>> expose
>> it and only expose decimal? If someone wants to store a 128-bit hash as a
>> DECIMAL(39) they are free to, of course. Postgres's built-in int types
>> only
>> go up to 64-bit (bigint)
>>
>> In addition to the choice of DECIMAL, for things like fixed-length binary
>> maybe we are better off later adding a fixed-length BINARY type, like
>> BINARY(16) which could be used for storing large hashes? There is
>> precedent
>> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
>>
>>
>> 2) Encoders: like Dan mentioned, it seems like we might not be able to do
>> a
>> very efficient job of encoding these very large integers. Stuff like
>> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
>> values. So, I'm a little afraid that we'll end up only with PLAIN and
>> people will be upset with the storage overhead and performance.
>>
>> -Todd
>>
>> >
>> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com>
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >> As a part of adding DECIMAL support to Kudu it was necessary to add
>> >> internal support for 128 bit integers. Taking that one step further and
>> >> supporting public columns and APIs for 128 bit integers would not be
>> too
>> >> much additional work. However, I wanted to gauge the interest from the
>> >> community.
>> >>
>> >> My initial thoughts are that having an INT128 column type could be
>> useful
>> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
>> types
>> >> of data.
>> >>
>> >> Is there any interest or uses for a INT128 column type? Is anyone
>> >> currently using a STRING or BINARY column for 128 bit data?
>> >>
>> >> Thank you,
>> >> Grant
>> >> --
>> >> Grant Henke
>> >> Software Engineer | Cloudera
>> >> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>> >>
>> >
>> >
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: INT128 Column Support Interest

Posted by Dan Burkert <da...@apache.org>.

Aren't we going to need efficient encodings in order to make decimal work
well, anyway?

- Dan

On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org>
> wrote:
>
> > I think it would be useful.  As far as I've seen the main costs in
> > carrying data types are in writing performant encoders, and updating
> > integrations to work with them.  I'm guessing with 128 bit integers there
> > would be some integrations that can't or won't support it, which might
> be a
> > cause for confusion.  Overall, though, I think the upsides of efficiency
> > and decreased storage space are compelling.   Do you have a sense yet of
> > what encodings are going to be supported down the road (will we get to
> full
> > parity with 32/64)?
> >
>
> Yea, my concerns are:
>
> 1) Integrations: do we have a compatible SQL type to map this to in Spark
> SQL, Impala, Presto, etc? What type would we map to in Java? It seems like
> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
> we're going to map it the same as decimal anyway, why not just _not_ expose
> it and only expose decimal? If someone wants to store a 128-bit hash as a
> DECIMAL(39) they are free to, of course. Postgres's built-in int types only
> go up to 64-bit (bigint)
>
> In addition to the choice of DECIMAL, for things like fixed-length binary
> maybe we are better off later adding a fixed-length BINARY type, like
> BINARY(16) which could be used for storing large hashes? There is precedent
> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
>
>
> 2) Encoders: like Dan mentioned, it seems like we might not be able to do a
> very efficient job of encoding these very large integers. Stuff like
> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> values. So, I'm a little afraid that we'll end up only with PLAIN and
> people will be upset with the storage overhead and performance.
>
> -Todd
>
> >
> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com>
> wrote:
> >
> >> Hi all,
> >>
> >> As a part of adding DECIMAL support to Kudu it was necessary to add
> >> internal support for 128 bit integers. Taking that one step further and
> >> supporting public columns and APIs for 128 bit integers would not be too
> >> much additional work. However, I wanted to gauge the interest from the
> >> community.
> >>
> >> My initial thoughts are that having an INT128 column type could be
> useful
> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
> types
> >> of data.
> >>
> >> Is there any interest or uses for a INT128 column type? Is anyone
> >> currently using a STRING or BINARY column for 128 bit data?
> >>
> >> Thank you,
> >> Grant
> >> --
> >> Grant Henke
> >> Software Engineer | Cloudera
> >> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >>
> >
> >
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: INT128 Column Support Interest

Posted by Dan Burkert <da...@apache.org>.

Aren't we going to need efficient encodings in order to make decimal work
well, anyway?

- Dan

On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org>
> wrote:
>
> > I think it would be useful.  As far as I've seen the main costs in
> > carrying data types are in writing performant encoders, and updating
> > integrations to work with them.  I'm guessing with 128 bit integers there
> > would be some integrations that can't or won't support it, which might
> be a
> > cause for confusion.  Overall, though, I think the upsides of efficiency
> > and decreased storage space are compelling.   Do you have a sense yet of
> > what encodings are going to be supported down the road (will we get to
> full
> > parity with 32/64)?
> >
>
> Yea, my concerns are:
>
> 1) Integrations: do we have a compatible SQL type to map this to in Spark
> SQL, Impala, Presto, etc? What type would we map to in Java? It seems like
> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
> we're going to map it the same as decimal anyway, why not just _not_ expose
> it and only expose decimal? If someone wants to store a 128-bit hash as a
> DECIMAL(39) they are free to, of course. Postgres's built-in int types only
> go up to 64-bit (bigint)
>
> In addition to the choice of DECIMAL, for things like fixed-length binary
> maybe we are better off later adding a fixed-length BINARY type, like
> BINARY(16) which could be used for storing large hashes? There is precedent
> for fixed-length CHAR(n) in SQL, but no such precedent for int128.
>
>
> 2) Encoders: like Dan mentioned, it seems like we might not be able to do a
> very efficient job of encoding these very large integers. Stuff like
> bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> values. So, I'm a little afraid that we'll end up only with PLAIN and
> people will be upset with the storage overhead and performance.
>
> -Todd
>
> >
> > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com>
> wrote:
> >
> >> Hi all,
> >>
> >> As a part of adding DECIMAL support to Kudu it was necessary to add
> >> internal support for 128 bit integers. Taking that one step further and
> >> supporting public columns and APIs for 128 bit integers would not be too
> >> much additional work. However, I wanted to gauge the interest from the
> >> community.
> >>
> >> My initial thoughts are that having an INT128 column type could be
> useful
> >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar
> types
> >> of data.
> >>
> >> Is there any interest or uses for a INT128 column type? Is anyone
> >> currently using a STRING or BINARY column for 128 bit data?
> >>
> >> Thank you,
> >> Grant
> >> --
> >> Grant Henke
> >> Software Engineer | Cloudera
> >> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >>
> >
> >
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: INT128 Column Support Interest

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org> wrote:

> I think it would be useful.  As far as I've seen the main costs in
> carrying data types are in writing performant encoders, and updating
> integrations to work with them.  I'm guessing with 128 bit integers there
> would be some integrations that can't or won't support it, which might be a
> cause for confusion.  Overall, though, I think the upsides of efficiency
> and decreased storage space are compelling.   Do you have a sense yet of
> what encodings are going to be supported down the road (will we get to full
> parity with 32/64)?
>

Yea, my concerns are:

1) Integrations: do we have a compatible SQL type to map this to in Spark
SQL, Impala, Presto, etc? What type would we map to in Java? It seems like
the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
we're going to map it the same as decimal anyway, why not just _not_ expose
it and only expose decimal? If someone wants to store a 128-bit hash as a
DECIMAL(39) they are free to, of course. Postgres's built-in int types only
go up to 64-bit (bigint)

In addition to the choice of DECIMAL, for things like fixed-length binary
maybe we are better off later adding a fixed-length BINARY type, like
BINARY(16) which could be used for storing large hashes? There is precedent
for fixed-length CHAR(n) in SQL, but no such precedent for int128.

2) Encoders: like Dan mentioned, it seems like we might not be able to do a
very efficient job of encoding these very large integers. Stuff like
bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
values. So, I'm a little afraid that we'll end up only with PLAIN and
people will be upset with the storage overhead and performance.

-Todd

>
> On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com> wrote:
>
>> Hi all,
>>
>> As a part of adding DECIMAL support to Kudu it was necessary to add
>> internal support for 128 bit integers. Taking that one step further and
>> supporting public columns and APIs for 128 bit integers would not be too
>> much additional work. However, I wanted to gauge the interest from the
>> community.
>>
>> My initial thoughts are that having an INT128 column type could be useful
>> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar types
>> of data.
>>
>> Is there any interest or uses for a INT128 column type? Is anyone
>> currently using a STRING or BINARY column for 128 bit data?
>>
>> Thank you,
>> Grant
>> --
>> Grant Henke
>> Software Engineer | Cloudera
>> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>>
>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: INT128 Column Support Interest

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <da...@apache.org> wrote:

> I think it would be useful.  As far as I've seen the main costs in
> carrying data types are in writing performant encoders, and updating
> integrations to work with them.  I'm guessing with 128 bit integers there
> would be some integrations that can't or won't support it, which might be a
> cause for confusion.  Overall, though, I think the upsides of efficiency
> and decreased storage space are compelling.   Do you have a sense yet of
> what encodings are going to be supported down the road (will we get to full
> parity with 32/64)?
>

Yea, my concerns are:

1) Integrations: do we have a compatible SQL type to map this to in Spark
SQL, Impala, Presto, etc? What type would we map to in Java? It seems like
the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
we're going to map it the same as decimal anyway, why not just _not_ expose
it and only expose decimal? If someone wants to store a 128-bit hash as a
DECIMAL(39) they are free to, of course. Postgres's built-in int types only
go up to 64-bit (bigint)

In addition to the choice of DECIMAL, for things like fixed-length binary
maybe we are better off later adding a fixed-length BINARY type, like
BINARY(16) which could be used for storing large hashes? There is precedent
for fixed-length CHAR(n) in SQL, but no such precedent for int128.

2) Encoders: like Dan mentioned, it seems like we might not be able to do a
very efficient job of encoding these very large integers. Stuff like
bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
values. So, I'm a little afraid that we'll end up only with PLAIN and
people will be upset with the storage overhead and performance.

-Todd

>
> On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com> wrote:
>
>> Hi all,
>>
>> As a part of adding DECIMAL support to Kudu it was necessary to add
>> internal support for 128 bit integers. Taking that one step further and
>> supporting public columns and APIs for 128 bit integers would not be too
>> much additional work. However, I wanted to gauge the interest from the
>> community.
>>
>> My initial thoughts are that having an INT128 column type could be useful
>> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar types
>> of data.
>>
>> Is there any interest or uses for a INT128 column type? Is anyone
>> currently using a STRING or BINARY column for 128 bit data?
>>
>> Thank you,
>> Grant
>> --
>> Grant Henke
>> Software Engineer | Cloudera
>> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>>
>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: INT128 Column Support Interest

Posted by Dan Burkert <da...@apache.org>.

I think it would be useful.  As far as I've seen the main costs in carrying
data types are in writing performant encoders, and updating integrations to
work with them.  I'm guessing with 128 bit integers there would be some
integrations that can't or won't support it, which might be a cause for
confusion.  Overall, though, I think the upsides of efficiency and
decreased storage space are compelling.   Do you have a sense yet of what
encodings are going to be supported down the road (will we get to full
parity with 32/64)?

- Dan

On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com> wrote:

> Hi all,
>
> As a part of adding DECIMAL support to Kudu it was necessary to add
> internal support for 128 bit integers. Taking that one step further and
> supporting public columns and APIs for 128 bit integers would not be too
> much additional work. However, I wanted to gauge the interest from the
> community.
>
> My initial thoughts are that having an INT128 column type could be useful
> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar types
> of data.
>
> Is there any interest or uses for a INT128 column type? Is anyone
> currently using a STRING or BINARY column for 128 bit data?
>
> Thank you,
> Grant
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: INT128 Column Support Interest

Posted by Dan Burkert <da...@apache.org>.

I think it would be useful.  As far as I've seen the main costs in carrying
data types are in writing performant encoders, and updating integrations to
work with them.  I'm guessing with 128 bit integers there would be some
integrations that can't or won't support it, which might be a cause for
confusion.  Overall, though, I think the upsides of efficiency and
decreased storage space are compelling.   Do you have a sense yet of what
encodings are going to be supported down the road (will we get to full
parity with 32/64)?

- Dan

On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <gh...@cloudera.com> wrote:

> Hi all,
>
> As a part of adding DECIMAL support to Kudu it was necessary to add
> internal support for 128 bit integers. Taking that one step further and
> supporting public columns and APIs for 128 bit integers would not be too
> much additional work. However, I wanted to gauge the interest from the
> community.
>
> My initial thoughts are that having an INT128 column type could be useful
> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar types
> of data.
>
> Is there any interest or uses for a INT128 column type? Is anyone
> currently using a STRING or BINARY column for 128 bit data?
>
> Thank you,
> Grant
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>