You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/01 02:47:17 UTC

[GitHub] [arrow] kiszk commented on a change in pull request #7789: PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

kiszk commented on a change in pull request #7789:
URL: https://github.com/apache/arrow/pull/7789#discussion_r463911559



##########
File path: cpp/src/arrow/util/compression_lz4.cc
##########
@@ -349,6 +350,90 @@ class Lz4Codec : public Codec {
   const char* name() const override { return "lz4_raw"; }
 };
 
+// ----------------------------------------------------------------------
+// Lz4 Hadoop "raw" codec implementation
+
+class Lz4HadoopCodec : public Lz4Codec {
+ public:
+  Result<int64_t> Decompress(int64_t input_len, const uint8_t* input,
+                             int64_t output_buffer_len, uint8_t* output_buffer) override {
+    // The following variables only make sense if the parquet file being read was
+    // compressed using the Hadoop Lz4Codec.
+    //
+    // Parquet files written with the Hadoop Lz4Codec contain at the beginning
+    // of the input buffer two uint32_t's representing (in this order) expected
+    // decompressed size in bytes and expected compressed size in bytes.
+    const uint32_t* input_as_uint32 = reinterpret_cast<const uint32_t*>(input);
+    uint32_t expected_decompressed_size = ARROW_BYTE_SWAP32(input_as_uint32[0]);
+    uint32_t expected_compressed_size = ARROW_BYTE_SWAP32(input_as_uint32[1]);

Review comment:
       Yes, to use `ARROW_BYTE_SWAP..` is not endian-independent.   
   It is also good to refer to the format of the header or source code of the header of Hadoop LZ4 codec for ease of review.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org