You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/03/09 18:24:46 UTC

[GitHub] [parquet-format] pitrou opened a new pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

pitrou opened a new pull request #168:
URL: https://github.com/apache/parquet-format/pull/168


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on a change in pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#discussion_r590644296



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -481,9 +481,10 @@ enum CompressionCodec {
   SNAPPY = 1;
   GZIP = 2;
   LZO = 3;
-  BROTLI = 4; // Added in 2.4
-  LZ4 = 5;    // Added in 2.4
-  ZSTD = 6;   // Added in 2.4
+  BROTLI = 4;  // Added in 2.4
+  LZ4 = 5;     // DEPRECATED (Added in 2.4)
+  ZSTD = 6;    // Added in 2.4
+  LZ4_RAW = 7; // Added in 2.9

Review comment:
       This is proposing the LZ4 _block_ format, not the LZ4 _frame_ format which the dependent blocks error seems to be stemming from.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] emkornfield commented on a change in pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#discussion_r590639574



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -481,9 +481,10 @@ enum CompressionCodec {
   SNAPPY = 1;
   GZIP = 2;
   LZO = 3;
-  BROTLI = 4; // Added in 2.4
-  LZ4 = 5;    // Added in 2.4
-  ZSTD = 6;   // Added in 2.4
+  BROTLI = 4;  // Added in 2.4
+  LZ4 = 5;     // DEPRECATED (Added in 2.4)
+  ZSTD = 6;    // Added in 2.4
+  LZ4_RAW = 7; // Added in 2.9

Review comment:
       I don't think we should be too hasty to add LZ4 back into the spec, or at least we should narrow the scope.  The integration of LZ4 with arrow has exposed problematic java implementations of LZ4: https://github.com/apache/arrow/pull/8949#issuecomment-794124675




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] gszadovszky commented on a change in pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#discussion_r591231620



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -481,9 +481,10 @@ enum CompressionCodec {
   SNAPPY = 1;
   GZIP = 2;
   LZO = 3;
-  BROTLI = 4; // Added in 2.4
-  LZ4 = 5;    // Added in 2.4
-  ZSTD = 6;   // Added in 2.4
+  BROTLI = 4;  // Added in 2.4

Review comment:
       Since the thrift file (and the generated code) can be a starting point to the parquet spec it would be a good idea to add a reference to the new compressions spec in the comment.

##########
File path: Compression.md
##########
@@ -0,0 +1,86 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet compression definitions
+
+This document contains the specification of all supported compression codecs.
+
+## Overview
+
+Parquet allows the data block inside dictionary pages and data pages to
+be compressed for better space efficiency.  The Parquet specification allows
+several compression codecs, most of which are optional to implement.  Only the

Review comment:
       I am not sure it is right place to say if a codec is optional to implement or not. Are they all optional BTW? I think we should leave it to the core features document to list if a codec is required or optional.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou merged pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou merged pull request #168:
URL: https://github.com/apache/parquet-format/pull/168


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#issuecomment-795136194


   > What do you think about adding links to the codec specifications directly instead of to the C/C++ implementations?
   
   I was not aware that those even existed. I'll add them as far as I can find them.
   
   > Is there a reason behind you consequently use double spaces between sentences in the .md file?
   
   Ah. This is just muscle memory at this point, since this is the convention in the CPython docs (though that seems to have been [relaxed](https://devguide.python.org/documenting/#use-of-whitespace)) :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#issuecomment-796877818


   It does have a couple [other things](https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md), including a magic number and an optional checksum (computed using xxHash32 rather than CRC32). None of them seem interesting to us, though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#issuecomment-795853899


   Thanks for the reference. I'll wait a bit in case I get some feedback from the LZ4 mailing-list.
   
   In any case, what is the procedure for merging an update in the parquet-format repo?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#issuecomment-797009554


   I've opened https://issues.apache.org/jira/browse/PARQUET-1998 for the C++ implementation. @ggershinsky Can I let you do the same for the Java implementation?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] emkornfield commented on a change in pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#discussion_r590647828



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -481,9 +481,10 @@ enum CompressionCodec {
   SNAPPY = 1;
   GZIP = 2;
   LZO = 3;
-  BROTLI = 4; // Added in 2.4
-  LZ4 = 5;    // Added in 2.4
-  ZSTD = 6;   // Added in 2.4
+  BROTLI = 4;  // Added in 2.4
+  LZ4 = 5;     // DEPRECATED (Added in 2.4)
+  ZSTD = 6;    // Added in 2.4
+  LZ4_RAW = 7; // Added in 2.9

Review comment:
       Hmm, OK, as long as we confirm that consumption can happen bidirectionally we the java library I'm happy :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] gszadovszky commented on pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#issuecomment-796889548


   Thanks for the link. I agree we are good to use the raw format. I cannot see any potential benefit for using the framed one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on a change in pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#discussion_r590649223



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -481,9 +481,10 @@ enum CompressionCodec {
   SNAPPY = 1;
   GZIP = 2;
   LZO = 3;
-  BROTLI = 4; // Added in 2.4
-  LZ4 = 5;    // Added in 2.4
-  ZSTD = 6;   // Added in 2.4
+  BROTLI = 4;  // Added in 2.4
+  LZ4 = 5;     // DEPRECATED (Added in 2.4)
+  ZSTD = 6;    // Added in 2.4
+  LZ4_RAW = 7; // Added in 2.9

Review comment:
       It certainly can theoretically happen. It's not the task of this PR to write an implementation, though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on a change in pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#discussion_r591254094



##########
File path: Compression.md
##########
@@ -0,0 +1,86 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet compression definitions
+
+This document contains the specification of all supported compression codecs.
+
+## Overview
+
+Parquet allows the data block inside dictionary pages and data pages to
+be compressed for better space efficiency.  The Parquet specification allows
+several compression codecs, most of which are optional to implement.  Only the

Review comment:
       Fair enough. The core features document is indeed a good place for such considerations.

##########
File path: src/main/thrift/parquet.thrift
##########
@@ -481,9 +481,10 @@ enum CompressionCodec {
   SNAPPY = 1;
   GZIP = 2;
   LZO = 3;
-  BROTLI = 4; // Added in 2.4
-  LZ4 = 5;    // Added in 2.4
-  ZSTD = 6;   // Added in 2.4
+  BROTLI = 4;  // Added in 2.4

Review comment:
       Will do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#issuecomment-796687561


   Ok, distilling here the feedback from Yann Collet (the author of LZ4): the frame format is beneficial as it provides a standard encoding for the compressed and uncompressed size of the data. This allows various tools to consume LZ4-compressed files without needing any sideband metadata about data sizes.
   
   My take: Parquet already encodes the compressed and uncompressed size separately, and the data blocks are embedded deep inside the Parquet format, so the benefits of the frame format wouldn't apply to us.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] pitrou commented on pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#issuecomment-795390108


   I've also sent [a message](https://groups.google.com/g/lz4c/c/d6V12JKr5Fw) to the LZ4 mailing-list to gather their feedback on this proposal.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-format] gszadovszky commented on pull request #168: PARQUET-1996: [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on pull request #168:
URL: https://github.com/apache/parquet-format/pull/168#issuecomment-796815780


   If the framed format only contains metadata that we already encode in the parquet footer I agree to not use it and keep the spec as is with the raw LZ4 format.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org