You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "sarutak (via GitHub)" <gi...@apache.org> on 2023/08/23 16:05:29 UTC

[GitHub] [avro] sarutak opened a new pull request, #2463: AVRO-3841: [Spec] Align the specification of the way to encode NaN to the actual implementations

sarutak opened a new pull request, #2463:
URL: https://github.com/apache/avro/pull/2463

   AVRO-3841
   
   ## What is the purpose of the change
   This PR proposes to slightly change the specification of the way to encode float/double values.
   
   The specification says about the way to encode float/double like as follows.
   ```
   a float is written as 4 bytes. The float is converted into a 32-bit integer using a method equivalent to Java’s floatToIntBits and then encoded in little-endian format.
   a double is written as 8 bytes. The double is converted into a 64-bit integer using a method equivalent to Java’s doubleToLongBits and then encoded in little-endian format.
   ```
   
   But the actual implementation in Java uses `floatToRawIntBits`/`doubleToRawLongBits` rather than `floatToIntBits`/`doubleToLongBits`.
   
   The they are different in the way to encode `NaN`.
   `floatToIntBits`/`doubleToLongBits` doesn't distinguish between `NaN` and `-NaN` but `floatToRawIntBits`/`doubleToRawLongBits` does.
   
   I confirmed all the implementation distinguish between `NaN` and `-NaN`.
   So, I think it's better to modify the specification.
   
   ## Verifying this change
   All the implementations distinguish between `NaN` and `-NaN`.
   
   * Java
   ```
     public static int encodeFloat(float f, byte[] buf, int pos) {
       final int bits = Float.floatToRawIntBits(f);
       buf[pos + 3] = (byte) (bits >>> 24);
       buf[pos + 2] = (byte) (bits >>> 16);
       buf[pos + 1] = (byte) (bits >>> 8);
       buf[pos] = (byte) (bits);
       return 4;
     }
   
     public static int encodeDouble(double d, byte[] buf, int pos) {
       final long bits = Double.doubleToRawLongBits(d);
       int first = (int) (bits & 0xFFFFFFFF);
       int second = (int) ((bits >>> 32) & 0xFFFFFFFF);
       // the compiler seems to execute this order the best, likely due to
       // register allocation -- the lifetime of constants is minimized.
       buf[pos] = (byte) (first);
       buf[pos + 4] = (byte) (second);
       buf[pos + 5] = (byte) (second >>> 8);
       buf[pos + 1] = (byte) (first >>> 8);
       buf[pos + 2] = (byte) (first >>> 16);
       buf[pos + 6] = (byte) (second >>> 16);
       buf[pos + 7] = (byte) (second >>> 24);
       buf[pos + 3] = (byte) (first >>> 24);
       return 8;
     }
   ```
   
   * Rust
   ```
   Value::Float(x) => buffer.extend_from_slice(&x.to_le_bytes()),
   Value::Double(x) => buffer.extend_from_slice(&x.to_le_bytes()),
   ```
   
   * Python
   ```
       def write_float(self, datum: float) -> None:                                                                                                  
           """                                                                                                                                       
           A float is written as 4 bytes.                                                                                                            
           The float is converted into a 32-bit integer using a method equivalent to                                                                 
           Java's floatToIntBits and then encoded in little-endian format.                                                                           
           """                                                                                                                                       
           self.write(STRUCT_FLOAT.pack(datum)) 
   
       def write_double(self, datum: float) -> None:                                                                                                 
           """                                                                                                                                       
           A double is written as 8 bytes.                                                                                                           
           The double is converted into a 64-bit integer using a method equivalent to                                                                
           Java's doubleToLongBits and then encoded in little-endian format.                                                                         
           """                                                                                                                                       
           self.write(STRUCT_DOUBLE.pack(datum))
   ```
   
   * C
   ```
   static int write_float(avro_writer_t writer, const float f)
   {
   #if AVRO_PLATFORM_IS_BIG_ENDIAN
           uint8_t buf[4];
   #endif
           union {
                   float f;
                   int32_t i;
           } v;
   
           v.f = f;
   #if AVRO_PLATFORM_IS_BIG_ENDIAN
           buf[0] = (uint8_t) (v.i >> 0);
           buf[1] = (uint8_t) (v.i >> 8);
           buf[2] = (uint8_t) (v.i >> 16);
           buf[3] = (uint8_t) (v.i >> 24);
           AVRO_WRITE(writer, buf, 4);
   #else
           AVRO_WRITE(writer, (void *)&v.i, 4);
   #endif
           return 0;
   }
   
   static int write_double(avro_writer_t writer, const double d)
   {
   #if AVRO_PLATFORM_IS_BIG_ENDIAN
           uint8_t buf[8];
   #endif
           union {
                   double d;
                   int64_t l;
           } v;
   
           v.d = d;
   #if AVRO_PLATFORM_IS_BIG_ENDIAN
           buf[0] = (uint8_t) (v.l >> 0);
           buf[1] = (uint8_t) (v.l >> 8);
           buf[2] = (uint8_t) (v.l >> 16);
           buf[3] = (uint8_t) (v.l >> 24);
           buf[4] = (uint8_t) (v.l >> 32);
           buf[5] = (uint8_t) (v.l >> 40);
           buf[6] = (uint8_t) (v.l >> 48);
           buf[7] = (uint8_t) (v.l >> 56);
           AVRO_WRITE(writer, buf, 8);
   #else
           AVRO_WRITE(writer, (void *)&v.l, 8);
   #endif
           return 0;
   }
   ```
   
   * C++
   ```
   void BinaryEncoder::encodeFloat(float f) {
       const auto *p = reinterpret_cast<const uint8_t *>(&f);
       out_.writeBytes(p, sizeof(float));
   }
   
   void BinaryEncoder::encodeDouble(double d) {
       const auto *p = reinterpret_cast<const uint8_t *>(&d);
       out_.writeBytes(p, sizeof(double));
   }
   ```
   
   * C#
   ```
           public void WriteFloat(float value)
           {
               byte[] buffer = BitConverter.GetBytes(value);
               if (!BitConverter.IsLittleEndian) Array.Reverse(buffer);
               writeBytes(buffer);
           }
   
           public void WriteDouble(double value)
           {
               long bits = BitConverter.DoubleToInt64Bits(value);
   
               writeByte((byte)(bits & 0xFF));
               writeByte((byte)((bits >> 8) & 0xFF));
               writeByte((byte)((bits >> 16) & 0xFF));
               writeByte((byte)((bits >> 24) & 0xFF));
               writeByte((byte)((bits >> 32) & 0xFF));
               writeByte((byte)((bits >> 40) & 0xFF));
               writeByte((byte)((bits >> 48) & 0xFF));
               writeByte((byte)((bits >> 56) & 0xFF));
   
           }
   ```
   
   * Ruby
   ```
         def read_float
           # A float is written as 4 bytes.
           # The float is converted into a 32-bit integer using a method
           # equivalent to Java's floatToRawIntBits and then encoded in
           # little-endian format.
           read_and_unpack(4, 'e')
         end
   
         def read_double
           #  A double is written as 8 bytes.
           # The double is converted into a 64-bit integer using a method
           # equivalent to Java's doubleToRawLongBits and then encoded in
           # little-endian format.
           read_and_unpack(8, 'E')
         end
   ```
   
   * Perl
   ```
   sub encode_float {
       my $class = shift;
       my ($schema, $data, $cb) = @_;
       my $enc = pack "f<", $data;
       $cb->(\$enc);
   }
   
   sub encode_double {
       my $class = shift;
       my ($schema, $data, $cb) = @_;
       my $enc = pack "d<", $data;
       $cb->(\$enc);
   }
   ```
   
   * PHP
   ```
       public static function floatToIntBits($float)
       {
           return pack('g', (float) $float);
       }
   
       public static function doubleToLongBits($double)
       {
           return pack('e', (double) $double);
       }
   ```
   
   * JavaScript
   ```
   Tap.prototype.writeFloat = function (f) {
     var buf = this.buf;
     var pos = this.pos;
     this.pos += 4;
     if (this.pos > buf.length) {
       return;
     }
     return this.buf.writeFloatLE(f, pos);
   };
   
   Tap.prototype.writeDouble = function (d) {
     var buf = this.buf;
     var pos = this.pos;
     this.pos += 8;
     if (this.pos > buf.length) {
       return;
     }
     return this.buf.writeDoubleLE(d, pos);
   };
   ```
   
   ## Documentation
   
   This change includes document modification.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] AVRO-3841: [Spec] Align the specification of the way to encode NaN to the actual implementations [avro]

Posted by "opwvhk (via GitHub)" <gi...@apache.org>.

opwvhk merged PR #2463:
URL: https://github.com/apache/avro/pull/2463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org