You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Roman Karlstetter <ro...@gmail.com> on 2018/12/18 12:10:49 UTC
Proposal for new floating point encodings

Hi,

another alternative would be to not introduce a new type and instead just introduce additional encoding definitions which are usable with float and double. That is actually more what such QuantizedFloats would be, another (in that case, lossy) data encoding algorithm.
Speaking of encoding: there are actually some other floating point encoding algorithms (lossy, lossless) that could be implemented for that, e.g., zfp: https://github.com/LLNL/zfp. 

Again, my question would be: how would something like that be integrated into parquet? Do I start by creating pull-requests to the parquet-format project or should I first try to implement a proof-of-concept for such an encoding for one of the implementations of parquet?

Regards,
Roman

Von: Zoltan Ivanfi
Gesendet: Dienstag, 20. November 2018 13:25
An: dev@parquet.apache.org
Cc: rblue@netflix.com
Betreff: Re: Proposal for new LogicalType: QuantizedFloat

Hi,

If we introduced such a type, personally I would prefer restricting its
range to regular numbers. I would leave -0, ±inf and the various NaNs to
the real float and double types. NULL will always be a possiblity of
course, which already provides some flexibility.

Br,

Zoltan

On Tue, Nov 20, 2018 at 9:19 AM Roman Karlstetter <
roman.karlstetter@gmail.com> wrote:

> Hi,
>
> thanks for your response.
> I already thought about using half-precision. I think that it might be a
> good alternative for use-cases where the values span a very wide range.
> However, when we deal with things like temperature sensor measurements, we
> "waste" precision for high absolute values (that occur only very rarely or
> never at all) and lose precision for small values (which occur frequently).
> In addition to that, half precision is centered at zero (like float and
> double), and that might not be the case for all types of measurement
> values. But I think it makes sense to add support for half precision to
> parquet anyway.
>
> One possible mapping from the encoded representation to the actual value
> is, e.g., to linearly map from given min and max measurement values to a
> range of integers with a given bit-width.
> This easily allows to trade precision with storage space by using more or
> less bits.
>
> Concerning the definition for special values: there are for sure things
> that need special treatment, like handling NaNs, or handling values that
> fall outside the representable range.
> Possible alternatives are:
>  - clipping too large/small values to max/min values. That would include
> +-inf.
>  - NaNs: use one of the encoded values, e.g., 0 or all-bits-1 for NaN
>  - denormal/subnormal or +-zero values: these could just be rounded to the
> closest value that is representable with the chosen encoding.
>
> Now that I think about it again, the name “QuantizedFloat” is probably
> also not ideal, because a IEEE float or double is of course also quantized.
> It’s just that what I have in mind is more regularly quantized in the
> supported interval.
>
> Any further opinions on that?
>
> Roman
>
>
> Von: Ryan Blue
> Gesendet: Freitag, 16. November 2018 18:47
> An: Parquet Dev
> Betreff: Re: Proposal for new LogicalType: QuantizedFloat
>
> I like this idea because we don't really have any good encoding for
> floating point values other than dictionary encoding. The most effective
> recommendation I have for our users is to know when to use float instead of
> double, which is along the same lines.
>
> I think the next thing to do is to make sure we have a solid definition for
> quantized float. Is it just dropping bits from the significand? What about
> limiting the exponent? How does it work for denormal values?
>
> It may make sense to add support for half-precision (16-bit) floats
> instead. Have you considered that option?
>
> rb
>
> On Thu, Nov 8, 2018 at 1:39 PM Roman Karlstetter <
> roman.karlstetter@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I want to propose a new LogicalType for parquet-format.
> >
> > First, I want to provide some motivation for that type.
> > In a lot of cases for sensor measurement data, the value read from the
> > sensor (ADC) is provided in an integer format, in many cases with a
> > precision of 8 to 16 bit (and almost never 32 bit).
> > However, the raw value is (almost) always converted in some way to a
> > physical unit which is then further processed by applications.
> > A simple example might be a temperature sensor that has an measurement
> > range of -55°C to +125°C and has a precision of 0.0625°C (-> requires 12
> > bit).
> >
> > Applications want to process such data with (single precision) floating
> > point logic.
> > Currently, for that reason, we would store such sensor measurement data
> as
> > well as analysis results (statistics, ...) as floating point values in
> the
> > parquet format.
> > However, that is of course not optimal, as we're blowing up the 12 bit
> from
> > the sensor to 32 bit of floating point data. Moreover, the floating point
> > representation cannot be compressed/encoded so easily in comparison
> integer
> > representation, especially with the currently supported encodings for
> > floating point values.
> > The DECIMAL logical type cannot represent all such cases, as it is
> centered
> > around 0 and does not support precisions like in the example above.
> >
> > Now to my actual request:
> > I suggest to introduce a new LogicalType QuantizedFloat (name to be
> > discussed), which makes it possible to represent such sensor data
> > efficiently in the parquet format in integer presentation, but which is
> > transformed to floating point values when read in the application.
> > That would require some kind of specification for the mapping of stored
> > values to floating point representation, in the simplest case a linear
> > mapping to a complete range of bits (for the example above: min:-128°C,
> > max:127.9375°C mapped to signed 12 bit integer - the same bits might also
> > be interpreted as Kelvin or even Fahrenheit, and only the min/max range
> > would have to be changed).
> > The uses for such a type would be manifold: it would be capable of
> storing
> > floating point data which is known to cover only a certain absolute range
> > with a limited number of bits. This is of course a lossy representation
> of
> > values, but in many scientific or engineering applications, this is
> > acceptable, especially when saving storage space.
> >
> > What it the process of adding something like that and what needs to be
> > implemented?
> >
> > Kind Regards,
> > Roman
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>