You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Antoine Pitrou <an...@python.org> on 2021/03/06 15:39:32 UTC

Re: Request deprecation / removal of LZ4 compression

Hello,

We should move this forward.
What should be the procedure for updating parquet-format?

Best regards

Antoine.



On Tue, 16 Feb 2021 17:25:49 +0100
Gabor Szadovszky <ga...@apache.org> wrote:
> Thank you for the detailed summary of the LZ4 situation, Antoine!
> 
> The Parquet file format should be properly specified for every
> implementation. It was the mistake of the parquet-mr developers that we
> thought the Hadoop implementation of LZ4 is according to the LZ4
> specification and the fault of the Parquet community that we extended the
> parquet-format with the LZ4 option without checking that there are 2
> options to choose (not talking about the 3rd Hadoop one).
> 
> I agree that option 4 is the only way we can step forward from this
> situation. We shall deprecate LZ4 in parquet-format and in the mean time we
> should agree on which officially specified LZ4 format do we want to
> introduce.
> 
> Of course, we may try to improve compatibility with the existing LZ4 files
> but we should not encourage our users to use this compression for now.
> 
> Because we already have parquet-mr releases that uses the under-specified
> Haddop LZ4 codec I don't feel it is that urgent to block the current
> parquet-mr release because of this. We shall update parquet-format to make
> the LZ4 situation clear then create a release. Then, we can start working
> on deprecating/blocking the write path of the current implementation and
> implement the properly specified LZ4 support in all the implementations.
> 
> Regards,
> Gabor
> 
> On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org> wrote:
> 
> >
> > Hello,
> >
> > This is a proposal to deprecate and/or remove LZ4 compression from the
> > Parquet specification.
> >
> > Abstract
> > --------
> >
> > Despite several attempts by the parquet-cpp developers, we were not
> > able to reach the point where LZ4-compressed Parquet files are
> > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > implementations are having, or have had, similar issues.  My conclusion
> > is that the Parquet spec doesn't allow independent reimplementation of
> > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > compression should be removed from the spec (possibly replaced with
> > another enum value for a properly-specified, interoperable, LZ4-backed
> > compression scheme).
> >
> > About LZ4
> > ---------
> >
> > LZ4 is an extremely fast compression format and library.  Decompression
> > speeds of 4GB/s can routinely be achieved in pure software, with
> > compression speeds around 800 MB/s. Compression ratios are close to
> > those achieved by Snappy and LZO, but at vastly higher speeds.
> >
> > Two formats are officially defined by the LZ4 project: the block format
> > and the frame format.  The frame format, as the name suggests, is
> > higher-level and contains all the information required to decompress
> > arbitrary data buffers.  The block format is to be used when such
> > information is made available separately (for example as part of
> > ancillary metadata, protocol headers, etc.).
> >
> > Core issue
> > ----------
> >
> > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > seems to be the point of reference to look to when implementing the
> > Parquet format) uses a home-baked framing around LZ4 block compression
> > that's not specified anywhere. The only way to get information about it
> > is to read the Hadoop source code.
> >
> > Being unspecified, it also doesn't say if there are additional
> > constraints on the parameters (e.g. frame size).  Such constraints will
> > be implementation-defined, undocumented, and only discoverable through
> > tedious testing and iteration, with no guarantee of ever achieving
> > 100% compatibility.
> >
> > Note that LZ4 compression itself officially comes in two formats: a
> > low-level block format, a higher-level framed format.  But parquet-mr
> > uses a third, custom framing format that's not part of the LZ4 format.
> >
> > History of compatibility issues
> > -------------------------------
> >
> > 1)
> >
> > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > Parquet spec meant the LZ4 block format.  After all, information
> > about data size is already available in the Parquet metadata, so no
> > specific framing is theoretically required around the compressed data.
> > However, it quickly occurred that this interpretation was incompatible
> > with files produced by parquet-mr (and vice-versa: files produced by
> > parquet-cpp could not be read with parquet-mr).
> >
> > A first issue was posted to suggest switching the Parquet spec (and
> > parquet-mr) to the LZ4 framed format:
> > https://issues.apache.org/jira/browse/PARQUET-1241
> >
> > However, this didn't come to a resolution, because people didn't want
> > to break compatibility with previous versions of parquet-mr (note that
> > this concern would necessarily switch the burden of compatibility
> > breakage onto other implempentations).
> >
> > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > resolution:
> > https://issues.apache.org/jira/browse/HADOOP-12990
> >
> > 2)
> >
> > To avoid making things worse, parquet-cpp developers then decided to
> > (temporarily?) disable writing LZ4 files from C++:
> > https://issues.apache.org/jira/browse/ARROW-9424
> >
> > (note that this is parquet-cpp deliberately crippling its own feature
> > set in order to workaround an issue created by an undocumented format
> > implemented in parquet-mr)
> >
> > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > read files produced by parquet-mr.  So, besides being frustrating for
> > C++ users (and users of bindings to the C++ library, e.g. Python,
> > Ruby...), this decision did also not make interoperability better in
> > the Java -> C++ direction.
> >
> > 3)
> >
> > parquet-cpp developers decided to implement a format detection so as to
> > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > previous versions of parquet-cpp.
> > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > In addition, the write path was reenabled, this time producing files
> > that (should!) conform to the parquet-mr implementation of LZ4
> > compression.
> >
> > This seemed to work fine.
> >
> > 4)
> >
> > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > was later reported that some files did not read properly.
> >
> > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > However, it turns out that parquet-mr can produce several such frames
> > for larger data pages (it seems to frame at around 128 kiB boundaries).
> >
> > While this seems of course reasonable, the format being undocumented
> > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > we (parquet-cpp developers) pushed a fix for it:
> > https://issues.apache.org/jira/browse/ARROW-11301
> >
> > 5)
> >
> > We were happy after this later fix... only for a short time.  Now the
> > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > be read by parquet-mr:
> >
> > https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> >
> > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > the fact that parquet-mr uses an unspecified framing format with
> > unspecified constraints and limits prevents other implementations from
> > being compatible.
> >
> > So I asked the reporter to open a parquet-mr issue instead, and then
> > took the liberty of marking the issue as blocker:
> > https://issues.apache.org/jira/browse/PARQUET-1974
> >
> > 6)
> >
> > Unsurprisingly, other Parquet implementations are also suffering from
> > this.
> >
> > * fastparquet (a Parquet implementation written partly in Python) had
> > initially used the LZ4 block format, then switched to the LZ4 frame
> > format to achieve compatibility with parquet-cpp (since both are used
> > in the Python ecosystem):
> > https://github.com/dask/fastparquet/issues/314
> >
> > Of course, fastparquet probably won't be able to read or write files
> > compatible with parquet-mr...
> >
> > * The Rust Parquet implementation that's part of the Rust Arrow
> >   implementation are also having compatibility issues due to making the
> >   wrong guess:
> > https://issues.apache.org/jira/browse/ARROW-11381
> >
> > * Impala seems to have had to change their LZ4 implementation to also
> >   match the undocumented parquet-mr framing format
> > https://issues.apache.org/jira/browse/IMPALA-8617
> >
> > (I have no idea whether this solved all LZ4 problems for Impala, and
> > whether it broke compatibility with previous Impala versions)
> >
> > Possible solutions
> > ------------------
> >
> > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > though the official Parquet documentation doesn't say so.
> >
> > 2) parquet-cpp finally achieves compatibility through additional
> > reverse-engineering efforts.  While I have no authority to prevent
> > this, I also have zero personal motivation to achieve it anymore.  I
> > suspect other parquet-cpp developers are not significantly more
> > motivated than I am (but I may be wrong).
> >
> > 3) parquet-mr takes its share of the effort by striving to be
> > compatible with files produced by (the current iteration of)
> > parquet-cpp. I have no idea whether this is easily doable, and whether
> > there is any motivation or workforce to do it. Therefore, I will
> > conservatively rate this as "unlikely in the short term".
> >
> > 4) Admitting that LZ4 compatibility currently is not an achievable
> > goal, the LZ4 compression option is deprecated and removed from the
> > Parquet spec.  Implementations display a warning message when it is
> > being read.  Writing LZ4 files is disabled.
> >
> > Discussion
> > ----------
> >
> > Solution #1 seems evidently undesirable to me (though it also seems
> > that nobody on the Java side was tremendously impatient to solve this
> > issue).
> >
> > As I said, solution #2 is rather unlikely.
> >
> > Java developers would have to evaluate how likely solution #3 is.  We
> > should avoid depending on promises that nobody is really willing to
> > hold, though.  Please only say "we're going to solve this" if you're
> > really willing to push it through.
> >
> > In any case, even if solution #2 or #3 were to be implemented, it would
> > not necessarily help future implementations, as long as the format is
> > still unspecified.
> >
> > In the end, while a bit frustrating for everyone, solution #4 looks like
> > the only reasonable one for the time being.
> >
> > Future LZ4 format
> > -----------------
> >
> > While we should deprecate the current LZ4 format, this doesn't preclude
> > to add another, this time well-specified, LZ4 format in the future.  It
> > would simply have to use a new `CompressionCodec` enum value in the
> > Thrift definition.
> >
> > Most certainly, it should be either of the two formats officially
> > defined by the LZ4 project (the block format or the frame format).
> >
> >
> > Regards
> >
> > Antoine.
> >
> >
> >  
>

Re: Request deprecation / removal of LZ4 compression

Posted by Antoine Pitrou <an...@python.org>.

Hi Gabor,

I've filed https://issues.apache.org/jira/browse/PARQUET-1996 and
submitted https://github.com/apache/parquet-format/pull/168 for the
format updates.

Hopefully this will get the ball rolling.

Best regards

Antoine.


On Mon, 8 Mar 2021 10:56:32 +0100
Gabor Szadovszky <ga...@apache.org> wrote:
> Hi Antoine,
> 
> We might do anything in parquet-format, without releasing and adapting it,
> it would mean nothing. Also, creating a new parquet-format release only for
> deprecating LZ4 makes little sense to me. So, I would suggest adding a
> simple comment to the LZ4 enum type about its deprecation and a reference
> to the newly supported type for LZ4. We also need the new LZ4 type and a
> proper specification for it. It might be a good idea to start a new doc in
> the parquet-format repo about the compression codec specifications. We then
> reference the related doc from the thrift file comments. After this is
> ready we can initiate a parquet-format release and if it's done we can
> start adapting it.
> 
> My idea about adapting this in parquet-mr is to keep the old LZ4 available
> by configuration which default is to use the new one. (Of course, we also
> need to implement/get the proper implementation of the new LZ4 codec.)
> 
> What do you think?
> 
> Cheers,
> Gabor
> 
> On Sat, Mar 6, 2021 at 4:40 PM Antoine Pitrou <an...@python.org> wrote:
> 
> >
> > Hello,
> >
> > We should move this forward.
> > What should be the procedure for updating parquet-format?
> >
> > Best regards
> >
> > Antoine.
> >
> >
> >
> > On Tue, 16 Feb 2021 17:25:49 +0100
> > Gabor Szadovszky <ga...@apache.org> wrote:  
> > > Thank you for the detailed summary of the LZ4 situation, Antoine!
> > >
> > > The Parquet file format should be properly specified for every
> > > implementation. It was the mistake of the parquet-mr developers that we
> > > thought the Hadoop implementation of LZ4 is according to the LZ4
> > > specification and the fault of the Parquet community that we extended the
> > > parquet-format with the LZ4 option without checking that there are 2
> > > options to choose (not talking about the 3rd Hadoop one).
> > >
> > > I agree that option 4 is the only way we can step forward from this
> > > situation. We shall deprecate LZ4 in parquet-format and in the mean time  
> > we  
> > > should agree on which officially specified LZ4 format do we want to
> > > introduce.
> > >
> > > Of course, we may try to improve compatibility with the existing LZ4  
> > files  
> > > but we should not encourage our users to use this compression for now.
> > >
> > > Because we already have parquet-mr releases that uses the under-specified
> > > Haddop LZ4 codec I don't feel it is that urgent to block the current
> > > parquet-mr release because of this. We shall update parquet-format to  
> > make  
> > > the LZ4 situation clear then create a release. Then, we can start working
> > > on deprecating/blocking the write path of the current implementation and
> > > implement the properly specified LZ4 support in all the implementations.
> > >
> > > Regards,
> > > Gabor
> > >
> > > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org>  
> > wrote:  
> > >  
> > > >
> > > > Hello,
> > > >
> > > > This is a proposal to deprecate and/or remove LZ4 compression from the
> > > > Parquet specification.
> > > >
> > > > Abstract
> > > > --------
> > > >
> > > > Despite several attempts by the parquet-cpp developers, we were not
> > > > able to reach the point where LZ4-compressed Parquet files are
> > > > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > > > implementations are having, or have had, similar issues.  My conclusion
> > > > is that the Parquet spec doesn't allow independent reimplementation of
> > > > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > > > compression should be removed from the spec (possibly replaced with
> > > > another enum value for a properly-specified, interoperable, LZ4-backed
> > > > compression scheme).
> > > >
> > > > About LZ4
> > > > ---------
> > > >
> > > > LZ4 is an extremely fast compression format and library.  Decompression
> > > > speeds of 4GB/s can routinely be achieved in pure software, with
> > > > compression speeds around 800 MB/s. Compression ratios are close to
> > > > those achieved by Snappy and LZO, but at vastly higher speeds.
> > > >
> > > > Two formats are officially defined by the LZ4 project: the block format
> > > > and the frame format.  The frame format, as the name suggests, is
> > > > higher-level and contains all the information required to decompress
> > > > arbitrary data buffers.  The block format is to be used when such
> > > > information is made available separately (for example as part of
> > > > ancillary metadata, protocol headers, etc.).
> > > >
> > > > Core issue
> > > > ----------
> > > >
> > > > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > > > seems to be the point of reference to look to when implementing the
> > > > Parquet format) uses a home-baked framing around LZ4 block compression
> > > > that's not specified anywhere. The only way to get information about it
> > > > is to read the Hadoop source code.
> > > >
> > > > Being unspecified, it also doesn't say if there are additional
> > > > constraints on the parameters (e.g. frame size).  Such constraints will
> > > > be implementation-defined, undocumented, and only discoverable through
> > > > tedious testing and iteration, with no guarantee of ever achieving
> > > > 100% compatibility.
> > > >
> > > > Note that LZ4 compression itself officially comes in two formats: a
> > > > low-level block format, a higher-level framed format.  But parquet-mr
> > > > uses a third, custom framing format that's not part of the LZ4 format.
> > > >
> > > > History of compatibility issues
> > > > -------------------------------
> > > >
> > > > 1)
> > > >
> > > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > > > Parquet spec meant the LZ4 block format.  After all, information
> > > > about data size is already available in the Parquet metadata, so no
> > > > specific framing is theoretically required around the compressed data.
> > > > However, it quickly occurred that this interpretation was incompatible
> > > > with files produced by parquet-mr (and vice-versa: files produced by
> > > > parquet-cpp could not be read with parquet-mr).
> > > >
> > > > A first issue was posted to suggest switching the Parquet spec (and
> > > > parquet-mr) to the LZ4 framed format:
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > >
> > > > However, this didn't come to a resolution, because people didn't want
> > > > to break compatibility with previous versions of parquet-mr (note that
> > > > this concern would necessarily switch the burden of compatibility
> > > > breakage onto other implempentations).
> > > >
> > > > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > > > resolution:
> > > > https://issues.apache.org/jira/browse/HADOOP-12990
> > > >
> > > > 2)
> > > >
> > > > To avoid making things worse, parquet-cpp developers then decided to
> > > > (temporarily?) disable writing LZ4 files from C++:
> > > > https://issues.apache.org/jira/browse/ARROW-9424
> > > >
> > > > (note that this is parquet-cpp deliberately crippling its own feature
> > > > set in order to workaround an issue created by an undocumented format
> > > > implemented in parquet-mr)
> > > >
> > > > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > > > read files produced by parquet-mr.  So, besides being frustrating for
> > > > C++ users (and users of bindings to the C++ library, e.g. Python,
> > > > Ruby...), this decision did also not make interoperability better in
> > > > the Java -> C++ direction.
> > > >
> > > > 3)
> > > >
> > > > parquet-cpp developers decided to implement a format detection so as to
> > > > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > > > previous versions of parquet-cpp.
> > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > > >
> > > > In addition, the write path was reenabled, this time producing files
> > > > that (should!) conform to the parquet-mr implementation of LZ4
> > > > compression.
> > > >
> > > > This seemed to work fine.
> > > >
> > > > 4)
> > > >
> > > > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > > > was later reported that some files did not read properly.
> > > >
> > > > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > > > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > > > However, it turns out that parquet-mr can produce several such frames
> > > > for larger data pages (it seems to frame at around 128 kiB boundaries).
> > > >
> > > > While this seems of course reasonable, the format being undocumented
> > > > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > > > we (parquet-cpp developers) pushed a fix for it:
> > > > https://issues.apache.org/jira/browse/ARROW-11301
> > > >
> > > > 5)
> > > >
> > > > We were happy after this later fix... only for a short time.  Now the
> > > > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > > > be read by parquet-mr:
> > > >
> > > >  
> > https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168  
> > > >
> > > > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > > > the fact that parquet-mr uses an unspecified framing format with
> > > > unspecified constraints and limits prevents other implementations from
> > > > being compatible.
> > > >
> > > > So I asked the reporter to open a parquet-mr issue instead, and then
> > > > took the liberty of marking the issue as blocker:
> > > > https://issues.apache.org/jira/browse/PARQUET-1974
> > > >
> > > > 6)
> > > >
> > > > Unsurprisingly, other Parquet implementations are also suffering from
> > > > this.
> > > >
> > > > * fastparquet (a Parquet implementation written partly in Python) had
> > > > initially used the LZ4 block format, then switched to the LZ4 frame
> > > > format to achieve compatibility with parquet-cpp (since both are used
> > > > in the Python ecosystem):
> > > > https://github.com/dask/fastparquet/issues/314
> > > >
> > > > Of course, fastparquet probably won't be able to read or write files
> > > > compatible with parquet-mr...
> > > >
> > > > * The Rust Parquet implementation that's part of the Rust Arrow
> > > >   implementation are also having compatibility issues due to making the
> > > >   wrong guess:
> > > > https://issues.apache.org/jira/browse/ARROW-11381
> > > >
> > > > * Impala seems to have had to change their LZ4 implementation to also
> > > >   match the undocumented parquet-mr framing format
> > > > https://issues.apache.org/jira/browse/IMPALA-8617
> > > >
> > > > (I have no idea whether this solved all LZ4 problems for Impala, and
> > > > whether it broke compatibility with previous Impala versions)
> > > >
> > > > Possible solutions
> > > > ------------------
> > > >
> > > > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > > > though the official Parquet documentation doesn't say so.
> > > >
> > > > 2) parquet-cpp finally achieves compatibility through additional
> > > > reverse-engineering efforts.  While I have no authority to prevent
> > > > this, I also have zero personal motivation to achieve it anymore.  I
> > > > suspect other parquet-cpp developers are not significantly more
> > > > motivated than I am (but I may be wrong).
> > > >
> > > > 3) parquet-mr takes its share of the effort by striving to be
> > > > compatible with files produced by (the current iteration of)
> > > > parquet-cpp. I have no idea whether this is easily doable, and whether
> > > > there is any motivation or workforce to do it. Therefore, I will
> > > > conservatively rate this as "unlikely in the short term".
> > > >
> > > > 4) Admitting that LZ4 compatibility currently is not an achievable
> > > > goal, the LZ4 compression option is deprecated and removed from the
> > > > Parquet spec.  Implementations display a warning message when it is
> > > > being read.  Writing LZ4 files is disabled.
> > > >
> > > > Discussion
> > > > ----------
> > > >
> > > > Solution #1 seems evidently undesirable to me (though it also seems
> > > > that nobody on the Java side was tremendously impatient to solve this
> > > > issue).
> > > >
> > > > As I said, solution #2 is rather unlikely.
> > > >
> > > > Java developers would have to evaluate how likely solution #3 is.  We
> > > > should avoid depending on promises that nobody is really willing to
> > > > hold, though.  Please only say "we're going to solve this" if you're
> > > > really willing to push it through.
> > > >
> > > > In any case, even if solution #2 or #3 were to be implemented, it would
> > > > not necessarily help future implementations, as long as the format is
> > > > still unspecified.
> > > >
> > > > In the end, while a bit frustrating for everyone, solution #4 looks  
> > like  
> > > > the only reasonable one for the time being.
> > > >
> > > > Future LZ4 format
> > > > -----------------
> > > >
> > > > While we should deprecate the current LZ4 format, this doesn't preclude
> > > > to add another, this time well-specified, LZ4 format in the future.  It
> > > > would simply have to use a new `CompressionCodec` enum value in the
> > > > Thrift definition.
> > > >
> > > > Most certainly, it should be either of the two formats officially
> > > > defined by the LZ4 project (the block format or the frame format).
> > > >
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >  
> > >  
> >
> >
> >
> >  
>

Re: Request deprecation / removal of LZ4 compression

Posted by Gabor Szadovszky <ga...@apache.org>.

Hi Antoine,

We might do anything in parquet-format, without releasing and adapting it,
it would mean nothing. Also, creating a new parquet-format release only for
deprecating LZ4 makes little sense to me. So, I would suggest adding a
simple comment to the LZ4 enum type about its deprecation and a reference
to the newly supported type for LZ4. We also need the new LZ4 type and a
proper specification for it. It might be a good idea to start a new doc in
the parquet-format repo about the compression codec specifications. We then
reference the related doc from the thrift file comments. After this is
ready we can initiate a parquet-format release and if it's done we can
start adapting it.

My idea about adapting this in parquet-mr is to keep the old LZ4 available
by configuration which default is to use the new one. (Of course, we also
need to implement/get the proper implementation of the new LZ4 codec.)

What do you think?

Cheers,
Gabor

On Sat, Mar 6, 2021 at 4:40 PM Antoine Pitrou <an...@python.org> wrote:

>
> Hello,
>
> We should move this forward.
> What should be the procedure for updating parquet-format?
>
> Best regards
>
> Antoine.
>
>
>
> On Tue, 16 Feb 2021 17:25:49 +0100
> Gabor Szadovszky <ga...@apache.org> wrote:
> > Thank you for the detailed summary of the LZ4 situation, Antoine!
> >
> > The Parquet file format should be properly specified for every
> > implementation. It was the mistake of the parquet-mr developers that we
> > thought the Hadoop implementation of LZ4 is according to the LZ4
> > specification and the fault of the Parquet community that we extended the
> > parquet-format with the LZ4 option without checking that there are 2
> > options to choose (not talking about the 3rd Hadoop one).
> >
> > I agree that option 4 is the only way we can step forward from this
> > situation. We shall deprecate LZ4 in parquet-format and in the mean time
> we
> > should agree on which officially specified LZ4 format do we want to
> > introduce.
> >
> > Of course, we may try to improve compatibility with the existing LZ4
> files
> > but we should not encourage our users to use this compression for now.
> >
> > Because we already have parquet-mr releases that uses the under-specified
> > Haddop LZ4 codec I don't feel it is that urgent to block the current
> > parquet-mr release because of this. We shall update parquet-format to
> make
> > the LZ4 situation clear then create a release. Then, we can start working
> > on deprecating/blocking the write path of the current implementation and
> > implement the properly specified LZ4 support in all the implementations.
> >
> > Regards,
> > Gabor
> >
> > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > This is a proposal to deprecate and/or remove LZ4 compression from the
> > > Parquet specification.
> > >
> > > Abstract
> > > --------
> > >
> > > Despite several attempts by the parquet-cpp developers, we were not
> > > able to reach the point where LZ4-compressed Parquet files are
> > > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > > implementations are having, or have had, similar issues.  My conclusion
> > > is that the Parquet spec doesn't allow independent reimplementation of
> > > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > > compression should be removed from the spec (possibly replaced with
> > > another enum value for a properly-specified, interoperable, LZ4-backed
> > > compression scheme).
> > >
> > > About LZ4
> > > ---------
> > >
> > > LZ4 is an extremely fast compression format and library.  Decompression
> > > speeds of 4GB/s can routinely be achieved in pure software, with
> > > compression speeds around 800 MB/s. Compression ratios are close to
> > > those achieved by Snappy and LZO, but at vastly higher speeds.
> > >
> > > Two formats are officially defined by the LZ4 project: the block format
> > > and the frame format.  The frame format, as the name suggests, is
> > > higher-level and contains all the information required to decompress
> > > arbitrary data buffers.  The block format is to be used when such
> > > information is made available separately (for example as part of
> > > ancillary metadata, protocol headers, etc.).
> > >
> > > Core issue
> > > ----------
> > >
> > > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > > seems to be the point of reference to look to when implementing the
> > > Parquet format) uses a home-baked framing around LZ4 block compression
> > > that's not specified anywhere. The only way to get information about it
> > > is to read the Hadoop source code.
> > >
> > > Being unspecified, it also doesn't say if there are additional
> > > constraints on the parameters (e.g. frame size).  Such constraints will
> > > be implementation-defined, undocumented, and only discoverable through
> > > tedious testing and iteration, with no guarantee of ever achieving
> > > 100% compatibility.
> > >
> > > Note that LZ4 compression itself officially comes in two formats: a
> > > low-level block format, a higher-level framed format.  But parquet-mr
> > > uses a third, custom framing format that's not part of the LZ4 format.
> > >
> > > History of compatibility issues
> > > -------------------------------
> > >
> > > 1)
> > >
> > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > > Parquet spec meant the LZ4 block format.  After all, information
> > > about data size is already available in the Parquet metadata, so no
> > > specific framing is theoretically required around the compressed data.
> > > However, it quickly occurred that this interpretation was incompatible
> > > with files produced by parquet-mr (and vice-versa: files produced by
> > > parquet-cpp could not be read with parquet-mr).
> > >
> > > A first issue was posted to suggest switching the Parquet spec (and
> > > parquet-mr) to the LZ4 framed format:
> > > https://issues.apache.org/jira/browse/PARQUET-1241
> > >
> > > However, this didn't come to a resolution, because people didn't want
> > > to break compatibility with previous versions of parquet-mr (note that
> > > this concern would necessarily switch the burden of compatibility
> > > breakage onto other implempentations).
> > >
> > > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > > resolution:
> > > https://issues.apache.org/jira/browse/HADOOP-12990
> > >
> > > 2)
> > >
> > > To avoid making things worse, parquet-cpp developers then decided to
> > > (temporarily?) disable writing LZ4 files from C++:
> > > https://issues.apache.org/jira/browse/ARROW-9424
> > >
> > > (note that this is parquet-cpp deliberately crippling its own feature
> > > set in order to workaround an issue created by an undocumented format
> > > implemented in parquet-mr)
> > >
> > > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > > read files produced by parquet-mr.  So, besides being frustrating for
> > > C++ users (and users of bindings to the C++ library, e.g. Python,
> > > Ruby...), this decision did also not make interoperability better in
> > > the Java -> C++ direction.
> > >
> > > 3)
> > >
> > > parquet-cpp developers decided to implement a format detection so as to
> > > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > > previous versions of parquet-cpp.
> > > https://issues.apache.org/jira/browse/PARQUET-1878
> > >
> > > In addition, the write path was reenabled, this time producing files
> > > that (should!) conform to the parquet-mr implementation of LZ4
> > > compression.
> > >
> > > This seemed to work fine.
> > >
> > > 4)
> > >
> > > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > > was later reported that some files did not read properly.
> > >
> > > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > > However, it turns out that parquet-mr can produce several such frames
> > > for larger data pages (it seems to frame at around 128 kiB boundaries).
> > >
> > > While this seems of course reasonable, the format being undocumented
> > > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > > we (parquet-cpp developers) pushed a fix for it:
> > > https://issues.apache.org/jira/browse/ARROW-11301
> > >
> > > 5)
> > >
> > > We were happy after this later fix... only for a short time.  Now the
> > > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > > be read by parquet-mr:
> > >
> > >
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> > >
> > > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > > the fact that parquet-mr uses an unspecified framing format with
> > > unspecified constraints and limits prevents other implementations from
> > > being compatible.
> > >
> > > So I asked the reporter to open a parquet-mr issue instead, and then
> > > took the liberty of marking the issue as blocker:
> > > https://issues.apache.org/jira/browse/PARQUET-1974
> > >
> > > 6)
> > >
> > > Unsurprisingly, other Parquet implementations are also suffering from
> > > this.
> > >
> > > * fastparquet (a Parquet implementation written partly in Python) had
> > > initially used the LZ4 block format, then switched to the LZ4 frame
> > > format to achieve compatibility with parquet-cpp (since both are used
> > > in the Python ecosystem):
> > > https://github.com/dask/fastparquet/issues/314
> > >
> > > Of course, fastparquet probably won't be able to read or write files
> > > compatible with parquet-mr...
> > >
> > > * The Rust Parquet implementation that's part of the Rust Arrow
> > >   implementation are also having compatibility issues due to making the
> > >   wrong guess:
> > > https://issues.apache.org/jira/browse/ARROW-11381
> > >
> > > * Impala seems to have had to change their LZ4 implementation to also
> > >   match the undocumented parquet-mr framing format
> > > https://issues.apache.org/jira/browse/IMPALA-8617
> > >
> > > (I have no idea whether this solved all LZ4 problems for Impala, and
> > > whether it broke compatibility with previous Impala versions)
> > >
> > > Possible solutions
> > > ------------------
> > >
> > > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > > though the official Parquet documentation doesn't say so.
> > >
> > > 2) parquet-cpp finally achieves compatibility through additional
> > > reverse-engineering efforts.  While I have no authority to prevent
> > > this, I also have zero personal motivation to achieve it anymore.  I
> > > suspect other parquet-cpp developers are not significantly more
> > > motivated than I am (but I may be wrong).
> > >
> > > 3) parquet-mr takes its share of the effort by striving to be
> > > compatible with files produced by (the current iteration of)
> > > parquet-cpp. I have no idea whether this is easily doable, and whether
> > > there is any motivation or workforce to do it. Therefore, I will
> > > conservatively rate this as "unlikely in the short term".
> > >
> > > 4) Admitting that LZ4 compatibility currently is not an achievable
> > > goal, the LZ4 compression option is deprecated and removed from the
> > > Parquet spec.  Implementations display a warning message when it is
> > > being read.  Writing LZ4 files is disabled.
> > >
> > > Discussion
> > > ----------
> > >
> > > Solution #1 seems evidently undesirable to me (though it also seems
> > > that nobody on the Java side was tremendously impatient to solve this
> > > issue).
> > >
> > > As I said, solution #2 is rather unlikely.
> > >
> > > Java developers would have to evaluate how likely solution #3 is.  We
> > > should avoid depending on promises that nobody is really willing to
> > > hold, though.  Please only say "we're going to solve this" if you're
> > > really willing to push it through.
> > >
> > > In any case, even if solution #2 or #3 were to be implemented, it would
> > > not necessarily help future implementations, as long as the format is
> > > still unspecified.
> > >
> > > In the end, while a bit frustrating for everyone, solution #4 looks
> like
> > > the only reasonable one for the time being.
> > >
> > > Future LZ4 format
> > > -----------------
> > >
> > > While we should deprecate the current LZ4 format, this doesn't preclude
> > > to add another, this time well-specified, LZ4 format in the future.  It
> > > would simply have to use a new `CompressionCodec` enum value in the
> > > Thrift definition.
> > >
> > > Most certainly, it should be either of the two formats officially
> > > defined by the LZ4 project (the block format or the frame format).
> > >
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>
>
>
>