You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Antoine Pitrou <an...@python.org> on 2021/02/16 14:14:01 UTC

Request deprecation / removal of LZ4 compression

Hello,

This is a proposal to deprecate and/or remove LZ4 compression from the
Parquet specification.

Abstract
--------

Despite several attempts by the parquet-cpp developers, we were not
able to reach the point where LZ4-compressed Parquet files are
bidirectionally compatible between parquet-cpp and parquet-mr. Other
implementations are having, or have had, similar issues.  My conclusion
is that the Parquet spec doesn't allow independent reimplementation of
the LZ4 compression format required by parquet-mr. Therefore, LZ4
compression should be removed from the spec (possibly replaced with
another enum value for a properly-specified, interoperable, LZ4-backed
compression scheme).

About LZ4
---------

LZ4 is an extremely fast compression format and library.  Decompression
speeds of 4GB/s can routinely be achieved in pure software, with
compression speeds around 800 MB/s. Compression ratios are close to
those achieved by Snappy and LZO, but at vastly higher speeds.

Two formats are officially defined by the LZ4 project: the block format
and the frame format.  The frame format, as the name suggests, is
higher-level and contains all the information required to decompress
arbitrary data buffers.  The block format is to be used when such
information is made available separately (for example as part of
ancillary metadata, protocol headers, etc.).

Core issue
----------

LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
seems to be the point of reference to look to when implementing the
Parquet format) uses a home-baked framing around LZ4 block compression
that's not specified anywhere. The only way to get information about it
is to read the Hadoop source code.

Being unspecified, it also doesn't say if there are additional
constraints on the parameters (e.g. frame size).  Such constraints will
be implementation-defined, undocumented, and only discoverable through
tedious testing and iteration, with no guarantee of ever achieving
100% compatibility.

Note that LZ4 compression itself officially comes in two formats: a
low-level block format, a higher-level framed format.  But parquet-mr
uses a third, custom framing format that's not part of the LZ4 format.

History of compatibility issues
-------------------------------

1)

parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
Parquet spec meant the LZ4 block format.  After all, information
about data size is already available in the Parquet metadata, so no
specific framing is theoretically required around the compressed data.
However, it quickly occurred that this interpretation was incompatible
with files produced by parquet-mr (and vice-versa: files produced by
parquet-cpp could not be read with parquet-mr).

A first issue was posted to suggest switching the Parquet spec (and
parquet-mr) to the LZ4 framed format:
https://issues.apache.org/jira/browse/PARQUET-1241

However, this didn't come to a resolution, because people didn't want
to break compatibility with previous versions of parquet-mr (note that
this concern would necessarily switch the burden of compatibility
breakage onto other implempentations).

Relatedly, there is an issue open for Hadoop, which also didn't get a
resolution:
https://issues.apache.org/jira/browse/HADOOP-12990

2)

To avoid making things worse, parquet-cpp developers then decided to
(temporarily?) disable writing LZ4 files from C++:
https://issues.apache.org/jira/browse/ARROW-9424

(note that this is parquet-cpp deliberately crippling its own feature
set in order to workaround an issue created by an undocumented format
implemented in parquet-mr)

At this point, though, the LZ4 *reader* in parquet-cpp could still not
read files produced by parquet-mr.  So, besides being frustrating for
C++ users (and users of bindings to the C++ library, e.g. Python,
Ruby...), this decision did also not make interoperability better in
the Java -> C++ direction.

3)

parquet-cpp developers decided to implement a format detection so as to
read LZ4 files produced by parquet-mr, but also LZ4 files produced by
previous versions of parquet-cpp.
https://issues.apache.org/jira/browse/PARQUET-1878

In addition, the write path was reenabled, this time producing files
that (should!) conform to the parquet-mr implementation of LZ4
compression.

This seemed to work fine.

4)

While PARQUET-1878 (see above) seemed to work fine in basic tests, it
was later reported that some files did not read properly.

Indeed, our (parquet-cpp) compatibility code assumed that a single
"parquet-mr LZ4 frame" was ever written out for a single data page.
However, it turns out that parquet-mr can produce several such frames
for larger data pages (it seems to frame at around 128 kiB boundaries).

While this seems of course reasonable, the format being undocumented
prevented us from foreseeing this situation.  Once the issue diagnosed,
we (parquet-cpp developers) pushed a fix for it:
https://issues.apache.org/jira/browse/ARROW-11301

5)

We were happy after this later fix... only for a short time.  Now the
reverse problem happened: some LZ4 files produced by parquet-cpp cannot
be read by parquet-mr:
https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168

I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
the fact that parquet-mr uses an unspecified framing format with
unspecified constraints and limits prevents other implementations from
being compatible.

So I asked the reporter to open a parquet-mr issue instead, and then
took the liberty of marking the issue as blocker:
https://issues.apache.org/jira/browse/PARQUET-1974

6)

Unsurprisingly, other Parquet implementations are also suffering from
this.

* fastparquet (a Parquet implementation written partly in Python) had
initially used the LZ4 block format, then switched to the LZ4 frame
format to achieve compatibility with parquet-cpp (since both are used
in the Python ecosystem):
https://github.com/dask/fastparquet/issues/314

Of course, fastparquet probably won't be able to read or write files
compatible with parquet-mr...

* The Rust Parquet implementation that's part of the Rust Arrow
  implementation are also having compatibility issues due to making the
  wrong guess:
https://issues.apache.org/jira/browse/ARROW-11381

* Impala seems to have had to change their LZ4 implementation to also
  match the undocumented parquet-mr framing format
https://issues.apache.org/jira/browse/IMPALA-8617

(I have no idea whether this solved all LZ4 problems for Impala, and
whether it broke compatibility with previous Impala versions)

Possible solutions
------------------

1) Do nothing.  The Parquet ecosystem is forever fragmented, even
though the official Parquet documentation doesn't say so.

2) parquet-cpp finally achieves compatibility through additional
reverse-engineering efforts.  While I have no authority to prevent
this, I also have zero personal motivation to achieve it anymore.  I
suspect other parquet-cpp developers are not significantly more
motivated than I am (but I may be wrong).

3) parquet-mr takes its share of the effort by striving to be
compatible with files produced by (the current iteration of)
parquet-cpp. I have no idea whether this is easily doable, and whether
there is any motivation or workforce to do it. Therefore, I will
conservatively rate this as "unlikely in the short term".

4) Admitting that LZ4 compatibility currently is not an achievable
goal, the LZ4 compression option is deprecated and removed from the
Parquet spec.  Implementations display a warning message when it is
being read.  Writing LZ4 files is disabled.

Discussion
----------

Solution #1 seems evidently undesirable to me (though it also seems
that nobody on the Java side was tremendously impatient to solve this
issue).

As I said, solution #2 is rather unlikely.

Java developers would have to evaluate how likely solution #3 is.  We
should avoid depending on promises that nobody is really willing to
hold, though.  Please only say "we're going to solve this" if you're
really willing to push it through.

In any case, even if solution #2 or #3 were to be implemented, it would
not necessarily help future implementations, as long as the format is
still unspecified.

In the end, while a bit frustrating for everyone, solution #4 looks like
the only reasonable one for the time being.

Future LZ4 format
-----------------

While we should deprecate the current LZ4 format, this doesn't preclude
to add another, this time well-specified, LZ4 format in the future.  It
would simply have to use a new `CompressionCodec` enum value in the
Thrift definition.

Most certainly, it should be either of the two formats officially
defined by the LZ4 project (the block format or the frame format).


Regards

Antoine.



Re: Request deprecation / removal of LZ4 compression

Posted by Micah Kornfield <em...@gmail.com>.
>
> For LZ4 we might choose the 2nd option so the user can still use it if it
> understand the risks (e.g. using only java-based components so it should
> work fine). After we have the new properly specified LZ4 available in
> parquet-format and in the implementations we should completely disable (3rd
> option) the write path of the original LZ4.

+1, I think this is the most user-friendly approach.

 I agree that we need more comprehensive integration testing. (Created a
> jira about my ideas: PARQUET-1985
> <https://issues.apache.org/jira/browse/PARQUET-1985>) Meanwhile, it also
> turned out that for features like compression codecs the specification is
> the key. We might always have specific data that we did not include in the
> integration tests and still fails at runtime.

+1, Both specification and integration tests are important.  I think they
complement each other quite well.

On Wed, Feb 17, 2021 at 2:18 AM Gabor Szadovszky <ga...@apache.org> wrote:

> When we are talking about deprecating an already released feature (this
> time LZ4) we shall always support the read path for the existing files. I
> think this is a fundamental requirement for every file format. The
> deprecation is more about the write path which has different levels. E.g.
> we don't suggest users to use such features, we disable the related feature
> by default with the option for the user to enable it or we disable the
> feature so the user is not able to use anymore.
>
> For LZ4 we might choose the 2nd option so the user can still use it if it
> understand the risks (e.g. using only java-based components so it should
> work fine). After we have the new properly specified LZ4 available in
> parquet-format and in the implementations we should completely disable (3rd
> option) the write path of the original LZ4.
>
> I agree that we need more comprehensive integration testing. (Created a
> jira about my ideas: PARQUET-1985
> <https://issues.apache.org/jira/browse/PARQUET-1985>) Meanwhile, it also
> turned out that for features like compression codecs the specification is
> the key. We might always have specific data that we did not include in the
> integration tests and still fails at runtime.
>
> On Wed, Feb 17, 2021 at 4:59 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > >
> > > I think it would be a mistake for someone who has written Hadoop-Lz4
> for
> > > several years with parquet-mr to all of sudden be no longer able to
> read
> > > their files. (I believe that parquet-mr with this pattern has been
> > > incorporated into various libraries for several years now--correct me
> if
> > > I'm wrong.)
> >
> > I think this aligns with option #4 proposed above?
> >
> > It seems simply by usage #2 (more reverse engineering on the C++ side) or
> > #4 would be the way to go.  I'll admit I am also not currently motivated
> to
> > reverse engineer this at the moment.  I think on the C++ side at least we
> > should disable writing to avoid further fragmentation.   It also isn't
> > clear to me if we have solid data to suggest the other compression
> formats
> > are also not broken.
> >
> > I think this has come up before but more comprehensive integration
> testing
> > is key here (there was another recent bug [1] where some of the
> pyarrow/C++
> > written files aren't readable by some java libraries, so I think there
> are
> > potentially more important issues we need to deal with).
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-11629
> >
> > On Tue, Feb 16, 2021 at 10:09 AM Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > There is some ambiguity in the discussion and proposals here around
> > > deprecating future writing versus supporting reading of already written
> > > data and what it means to deprecate something in the format
> > specification.
> > >
> > > I think it would be a mistake for someone who has written Hadoop-Lz4
> for
> > > several years with parquet-mr to all of sudden be no longer able to
> read
> > > their files. (I believe that parquet-mr with this pattern has been
> > > incorporated into various libraries for several years now--correct me
> if
> > > I'm wrong.)
> > >
> > > On Tue, Feb 16, 2021 at 8:26 AM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> > >
> > > > Thank you for the detailed summary of the LZ4 situation, Antoine!
> > > >
> > > > The Parquet file format should be properly specified for every
> > > > implementation. It was the mistake of the parquet-mr developers that
> we
> > > > thought the Hadoop implementation of LZ4 is according to the LZ4
> > > > specification and the fault of the Parquet community that we extended
> > the
> > > > parquet-format with the LZ4 option without checking that there are 2
> > > > options to choose (not talking about the 3rd Hadoop one).
> > > >
> > > > I agree that option 4 is the only way we can step forward from this
> > > > situation. We shall deprecate LZ4 in parquet-format and in the mean
> > time
> > > we
> > > > should agree on which officially specified LZ4 format do we want to
> > > > introduce.
> > > >
> > > > Of course, we may try to improve compatibility with the existing LZ4
> > > files
> > > > but we should not encourage our users to use this compression for
> now.
> > > >
> > > > Because we already have parquet-mr releases that uses the
> > under-specified
> > > > Haddop LZ4 codec I don't feel it is that urgent to block the current
> > > > parquet-mr release because of this. We shall update parquet-format to
> > > make
> > > > the LZ4 situation clear then create a release. Then, we can start
> > working
> > > > on deprecating/blocking the write path of the current implementation
> > and
> > > > implement the properly specified LZ4 support in all the
> > implementations.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > > > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > This is a proposal to deprecate and/or remove LZ4 compression from
> > the
> > > > > Parquet specification.
> > > > >
> > > > > Abstract
> > > > > --------
> > > > >
> > > > > Despite several attempts by the parquet-cpp developers, we were not
> > > > > able to reach the point where LZ4-compressed Parquet files are
> > > > > bidirectionally compatible between parquet-cpp and parquet-mr.
> Other
> > > > > implementations are having, or have had, similar issues.  My
> > conclusion
> > > > > is that the Parquet spec doesn't allow independent reimplementation
> > of
> > > > > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > > > > compression should be removed from the spec (possibly replaced with
> > > > > another enum value for a properly-specified, interoperable,
> > LZ4-backed
> > > > > compression scheme).
> > > > >
> > > > > About LZ4
> > > > > ---------
> > > > >
> > > > > LZ4 is an extremely fast compression format and library.
> > Decompression
> > > > > speeds of 4GB/s can routinely be achieved in pure software, with
> > > > > compression speeds around 800 MB/s. Compression ratios are close to
> > > > > those achieved by Snappy and LZO, but at vastly higher speeds.
> > > > >
> > > > > Two formats are officially defined by the LZ4 project: the block
> > format
> > > > > and the frame format.  The frame format, as the name suggests, is
> > > > > higher-level and contains all the information required to
> decompress
> > > > > arbitrary data buffers.  The block format is to be used when such
> > > > > information is made available separately (for example as part of
> > > > > ancillary metadata, protocol headers, etc.).
> > > > >
> > > > > Core issue
> > > > > ----------
> > > > >
> > > > > LZ4 compression in Parquet (or, more accurately, in parquet-mr,
> which
> > > > > seems to be the point of reference to look to when implementing the
> > > > > Parquet format) uses a home-baked framing around LZ4 block
> > compression
> > > > > that's not specified anywhere. The only way to get information
> about
> > it
> > > > > is to read the Hadoop source code.
> > > > >
> > > > > Being unspecified, it also doesn't say if there are additional
> > > > > constraints on the parameters (e.g. frame size).  Such constraints
> > will
> > > > > be implementation-defined, undocumented, and only discoverable
> > through
> > > > > tedious testing and iteration, with no guarantee of ever achieving
> > > > > 100% compatibility.
> > > > >
> > > > > Note that LZ4 compression itself officially comes in two formats: a
> > > > > low-level block format, a higher-level framed format.  But
> parquet-mr
> > > > > uses a third, custom framing format that's not part of the LZ4
> > format.
> > > > >
> > > > > History of compatibility issues
> > > > > -------------------------------
> > > > >
> > > > > 1)
> > > > >
> > > > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in
> > the
> > > > > Parquet spec meant the LZ4 block format.  After all, information
> > > > > about data size is already available in the Parquet metadata, so no
> > > > > specific framing is theoretically required around the compressed
> > data.
> > > > > However, it quickly occurred that this interpretation was
> > incompatible
> > > > > with files produced by parquet-mr (and vice-versa: files produced
> by
> > > > > parquet-cpp could not be read with parquet-mr).
> > > > >
> > > > > A first issue was posted to suggest switching the Parquet spec (and
> > > > > parquet-mr) to the LZ4 framed format:
> > > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > >
> > > > > However, this didn't come to a resolution, because people didn't
> want
> > > > > to break compatibility with previous versions of parquet-mr (note
> > that
> > > > > this concern would necessarily switch the burden of compatibility
> > > > > breakage onto other implempentations).
> > > > >
> > > > > Relatedly, there is an issue open for Hadoop, which also didn't
> get a
> > > > > resolution:
> > > > > https://issues.apache.org/jira/browse/HADOOP-12990
> > > > >
> > > > > 2)
> > > > >
> > > > > To avoid making things worse, parquet-cpp developers then decided
> to
> > > > > (temporarily?) disable writing LZ4 files from C++:
> > > > > https://issues.apache.org/jira/browse/ARROW-9424
> > > > >
> > > > > (note that this is parquet-cpp deliberately crippling its own
> feature
> > > > > set in order to workaround an issue created by an undocumented
> format
> > > > > implemented in parquet-mr)
> > > > >
> > > > > At this point, though, the LZ4 *reader* in parquet-cpp could still
> > not
> > > > > read files produced by parquet-mr.  So, besides being frustrating
> for
> > > > > C++ users (and users of bindings to the C++ library, e.g. Python,
> > > > > Ruby...), this decision did also not make interoperability better
> in
> > > > > the Java -> C++ direction.
> > > > >
> > > > > 3)
> > > > >
> > > > > parquet-cpp developers decided to implement a format detection so
> as
> > to
> > > > > read LZ4 files produced by parquet-mr, but also LZ4 files produced
> by
> > > > > previous versions of parquet-cpp.
> > > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > > > >
> > > > > In addition, the write path was reenabled, this time producing
> files
> > > > > that (should!) conform to the parquet-mr implementation of LZ4
> > > > > compression.
> > > > >
> > > > > This seemed to work fine.
> > > > >
> > > > > 4)
> > > > >
> > > > > While PARQUET-1878 (see above) seemed to work fine in basic tests,
> it
> > > > > was later reported that some files did not read properly.
> > > > >
> > > > > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > > > > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > > > > However, it turns out that parquet-mr can produce several such
> frames
> > > > > for larger data pages (it seems to frame at around 128 kiB
> > boundaries).
> > > > >
> > > > > While this seems of course reasonable, the format being
> undocumented
> > > > > prevented us from foreseeing this situation.  Once the issue
> > diagnosed,
> > > > > we (parquet-cpp developers) pushed a fix for it:
> > > > > https://issues.apache.org/jira/browse/ARROW-11301
> > > > >
> > > > > 5)
> > > > >
> > > > > We were happy after this later fix... only for a short time.  Now
> the
> > > > > reverse problem happened: some LZ4 files produced by parquet-cpp
> > cannot
> > > > > be read by parquet-mr:
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> > > > >
> > > > > I decided that this couldn't be a parquet-cpp issue anymore.
> > Evidently,
> > > > > the fact that parquet-mr uses an unspecified framing format with
> > > > > unspecified constraints and limits prevents other implementations
> > from
> > > > > being compatible.
> > > > >
> > > > > So I asked the reporter to open a parquet-mr issue instead, and
> then
> > > > > took the liberty of marking the issue as blocker:
> > > > > https://issues.apache.org/jira/browse/PARQUET-1974
> > > > >
> > > > > 6)
> > > > >
> > > > > Unsurprisingly, other Parquet implementations are also suffering
> from
> > > > > this.
> > > > >
> > > > > * fastparquet (a Parquet implementation written partly in Python)
> had
> > > > > initially used the LZ4 block format, then switched to the LZ4 frame
> > > > > format to achieve compatibility with parquet-cpp (since both are
> used
> > > > > in the Python ecosystem):
> > > > > https://github.com/dask/fastparquet/issues/314
> > > > >
> > > > > Of course, fastparquet probably won't be able to read or write
> files
> > > > > compatible with parquet-mr...
> > > > >
> > > > > * The Rust Parquet implementation that's part of the Rust Arrow
> > > > >   implementation are also having compatibility issues due to making
> > the
> > > > >   wrong guess:
> > > > > https://issues.apache.org/jira/browse/ARROW-11381
> > > > >
> > > > > * Impala seems to have had to change their LZ4 implementation to
> also
> > > > >   match the undocumented parquet-mr framing format
> > > > > https://issues.apache.org/jira/browse/IMPALA-8617
> > > > >
> > > > > (I have no idea whether this solved all LZ4 problems for Impala,
> and
> > > > > whether it broke compatibility with previous Impala versions)
> > > > >
> > > > > Possible solutions
> > > > > ------------------
> > > > >
> > > > > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > > > > though the official Parquet documentation doesn't say so.
> > > > >
> > > > > 2) parquet-cpp finally achieves compatibility through additional
> > > > > reverse-engineering efforts.  While I have no authority to prevent
> > > > > this, I also have zero personal motivation to achieve it anymore.
> I
> > > > > suspect other parquet-cpp developers are not significantly more
> > > > > motivated than I am (but I may be wrong).
> > > > >
> > > > > 3) parquet-mr takes its share of the effort by striving to be
> > > > > compatible with files produced by (the current iteration of)
> > > > > parquet-cpp. I have no idea whether this is easily doable, and
> > whether
> > > > > there is any motivation or workforce to do it. Therefore, I will
> > > > > conservatively rate this as "unlikely in the short term".
> > > > >
> > > > > 4) Admitting that LZ4 compatibility currently is not an achievable
> > > > > goal, the LZ4 compression option is deprecated and removed from the
> > > > > Parquet spec.  Implementations display a warning message when it is
> > > > > being read.  Writing LZ4 files is disabled.
> > > > >
> > > > > Discussion
> > > > > ----------
> > > > >
> > > > > Solution #1 seems evidently undesirable to me (though it also seems
> > > > > that nobody on the Java side was tremendously impatient to solve
> this
> > > > > issue).
> > > > >
> > > > > As I said, solution #2 is rather unlikely.
> > > > >
> > > > > Java developers would have to evaluate how likely solution #3 is.
> We
> > > > > should avoid depending on promises that nobody is really willing to
> > > > > hold, though.  Please only say "we're going to solve this" if
> you're
> > > > > really willing to push it through.
> > > > >
> > > > > In any case, even if solution #2 or #3 were to be implemented, it
> > would
> > > > > not necessarily help future implementations, as long as the format
> is
> > > > > still unspecified.
> > > > >
> > > > > In the end, while a bit frustrating for everyone, solution #4 looks
> > > like
> > > > > the only reasonable one for the time being.
> > > > >
> > > > > Future LZ4 format
> > > > > -----------------
> > > > >
> > > > > While we should deprecate the current LZ4 format, this doesn't
> > preclude
> > > > > to add another, this time well-specified, LZ4 format in the future.
> > It
> > > > > would simply have to use a new `CompressionCodec` enum value in the
> > > > > Thrift definition.
> > > > >
> > > > > Most certainly, it should be either of the two formats officially
> > > > > defined by the LZ4 project (the block format or the frame format).
> > > > >
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Request deprecation / removal of LZ4 compression

Posted by Gabor Szadovszky <ga...@apache.org>.
When we are talking about deprecating an already released feature (this
time LZ4) we shall always support the read path for the existing files. I
think this is a fundamental requirement for every file format. The
deprecation is more about the write path which has different levels. E.g.
we don't suggest users to use such features, we disable the related feature
by default with the option for the user to enable it or we disable the
feature so the user is not able to use anymore.

For LZ4 we might choose the 2nd option so the user can still use it if it
understand the risks (e.g. using only java-based components so it should
work fine). After we have the new properly specified LZ4 available in
parquet-format and in the implementations we should completely disable (3rd
option) the write path of the original LZ4.

I agree that we need more comprehensive integration testing. (Created a
jira about my ideas: PARQUET-1985
<https://issues.apache.org/jira/browse/PARQUET-1985>) Meanwhile, it also
turned out that for features like compression codecs the specification is
the key. We might always have specific data that we did not include in the
integration tests and still fails at runtime.

On Wed, Feb 17, 2021 at 4:59 AM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > I think it would be a mistake for someone who has written Hadoop-Lz4 for
> > several years with parquet-mr to all of sudden be no longer able to read
> > their files. (I believe that parquet-mr with this pattern has been
> > incorporated into various libraries for several years now--correct me if
> > I'm wrong.)
>
> I think this aligns with option #4 proposed above?
>
> It seems simply by usage #2 (more reverse engineering on the C++ side) or
> #4 would be the way to go.  I'll admit I am also not currently motivated to
> reverse engineer this at the moment.  I think on the C++ side at least we
> should disable writing to avoid further fragmentation.   It also isn't
> clear to me if we have solid data to suggest the other compression formats
> are also not broken.
>
> I think this has come up before but more comprehensive integration testing
> is key here (there was another recent bug [1] where some of the pyarrow/C++
> written files aren't readable by some java libraries, so I think there are
> potentially more important issues we need to deal with).
>
> [1] https://issues.apache.org/jira/browse/ARROW-11629
>
> On Tue, Feb 16, 2021 at 10:09 AM Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > There is some ambiguity in the discussion and proposals here around
> > deprecating future writing versus supporting reading of already written
> > data and what it means to deprecate something in the format
> specification.
> >
> > I think it would be a mistake for someone who has written Hadoop-Lz4 for
> > several years with parquet-mr to all of sudden be no longer able to read
> > their files. (I believe that parquet-mr with this pattern has been
> > incorporated into various libraries for several years now--correct me if
> > I'm wrong.)
> >
> > On Tue, Feb 16, 2021 at 8:26 AM Gabor Szadovszky <ga...@apache.org>
> wrote:
> >
> > > Thank you for the detailed summary of the LZ4 situation, Antoine!
> > >
> > > The Parquet file format should be properly specified for every
> > > implementation. It was the mistake of the parquet-mr developers that we
> > > thought the Hadoop implementation of LZ4 is according to the LZ4
> > > specification and the fault of the Parquet community that we extended
> the
> > > parquet-format with the LZ4 option without checking that there are 2
> > > options to choose (not talking about the 3rd Hadoop one).
> > >
> > > I agree that option 4 is the only way we can step forward from this
> > > situation. We shall deprecate LZ4 in parquet-format and in the mean
> time
> > we
> > > should agree on which officially specified LZ4 format do we want to
> > > introduce.
> > >
> > > Of course, we may try to improve compatibility with the existing LZ4
> > files
> > > but we should not encourage our users to use this compression for now.
> > >
> > > Because we already have parquet-mr releases that uses the
> under-specified
> > > Haddop LZ4 codec I don't feel it is that urgent to block the current
> > > parquet-mr release because of this. We shall update parquet-format to
> > make
> > > the LZ4 situation clear then create a release. Then, we can start
> working
> > > on deprecating/blocking the write path of the current implementation
> and
> > > implement the properly specified LZ4 support in all the
> implementations.
> > >
> > > Regards,
> > > Gabor
> > >
> > > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org>
> > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > This is a proposal to deprecate and/or remove LZ4 compression from
> the
> > > > Parquet specification.
> > > >
> > > > Abstract
> > > > --------
> > > >
> > > > Despite several attempts by the parquet-cpp developers, we were not
> > > > able to reach the point where LZ4-compressed Parquet files are
> > > > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > > > implementations are having, or have had, similar issues.  My
> conclusion
> > > > is that the Parquet spec doesn't allow independent reimplementation
> of
> > > > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > > > compression should be removed from the spec (possibly replaced with
> > > > another enum value for a properly-specified, interoperable,
> LZ4-backed
> > > > compression scheme).
> > > >
> > > > About LZ4
> > > > ---------
> > > >
> > > > LZ4 is an extremely fast compression format and library.
> Decompression
> > > > speeds of 4GB/s can routinely be achieved in pure software, with
> > > > compression speeds around 800 MB/s. Compression ratios are close to
> > > > those achieved by Snappy and LZO, but at vastly higher speeds.
> > > >
> > > > Two formats are officially defined by the LZ4 project: the block
> format
> > > > and the frame format.  The frame format, as the name suggests, is
> > > > higher-level and contains all the information required to decompress
> > > > arbitrary data buffers.  The block format is to be used when such
> > > > information is made available separately (for example as part of
> > > > ancillary metadata, protocol headers, etc.).
> > > >
> > > > Core issue
> > > > ----------
> > > >
> > > > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > > > seems to be the point of reference to look to when implementing the
> > > > Parquet format) uses a home-baked framing around LZ4 block
> compression
> > > > that's not specified anywhere. The only way to get information about
> it
> > > > is to read the Hadoop source code.
> > > >
> > > > Being unspecified, it also doesn't say if there are additional
> > > > constraints on the parameters (e.g. frame size).  Such constraints
> will
> > > > be implementation-defined, undocumented, and only discoverable
> through
> > > > tedious testing and iteration, with no guarantee of ever achieving
> > > > 100% compatibility.
> > > >
> > > > Note that LZ4 compression itself officially comes in two formats: a
> > > > low-level block format, a higher-level framed format.  But parquet-mr
> > > > uses a third, custom framing format that's not part of the LZ4
> format.
> > > >
> > > > History of compatibility issues
> > > > -------------------------------
> > > >
> > > > 1)
> > > >
> > > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in
> the
> > > > Parquet spec meant the LZ4 block format.  After all, information
> > > > about data size is already available in the Parquet metadata, so no
> > > > specific framing is theoretically required around the compressed
> data.
> > > > However, it quickly occurred that this interpretation was
> incompatible
> > > > with files produced by parquet-mr (and vice-versa: files produced by
> > > > parquet-cpp could not be read with parquet-mr).
> > > >
> > > > A first issue was posted to suggest switching the Parquet spec (and
> > > > parquet-mr) to the LZ4 framed format:
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > >
> > > > However, this didn't come to a resolution, because people didn't want
> > > > to break compatibility with previous versions of parquet-mr (note
> that
> > > > this concern would necessarily switch the burden of compatibility
> > > > breakage onto other implempentations).
> > > >
> > > > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > > > resolution:
> > > > https://issues.apache.org/jira/browse/HADOOP-12990
> > > >
> > > > 2)
> > > >
> > > > To avoid making things worse, parquet-cpp developers then decided to
> > > > (temporarily?) disable writing LZ4 files from C++:
> > > > https://issues.apache.org/jira/browse/ARROW-9424
> > > >
> > > > (note that this is parquet-cpp deliberately crippling its own feature
> > > > set in order to workaround an issue created by an undocumented format
> > > > implemented in parquet-mr)
> > > >
> > > > At this point, though, the LZ4 *reader* in parquet-cpp could still
> not
> > > > read files produced by parquet-mr.  So, besides being frustrating for
> > > > C++ users (and users of bindings to the C++ library, e.g. Python,
> > > > Ruby...), this decision did also not make interoperability better in
> > > > the Java -> C++ direction.
> > > >
> > > > 3)
> > > >
> > > > parquet-cpp developers decided to implement a format detection so as
> to
> > > > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > > > previous versions of parquet-cpp.
> > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > > >
> > > > In addition, the write path was reenabled, this time producing files
> > > > that (should!) conform to the parquet-mr implementation of LZ4
> > > > compression.
> > > >
> > > > This seemed to work fine.
> > > >
> > > > 4)
> > > >
> > > > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > > > was later reported that some files did not read properly.
> > > >
> > > > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > > > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > > > However, it turns out that parquet-mr can produce several such frames
> > > > for larger data pages (it seems to frame at around 128 kiB
> boundaries).
> > > >
> > > > While this seems of course reasonable, the format being undocumented
> > > > prevented us from foreseeing this situation.  Once the issue
> diagnosed,
> > > > we (parquet-cpp developers) pushed a fix for it:
> > > > https://issues.apache.org/jira/browse/ARROW-11301
> > > >
> > > > 5)
> > > >
> > > > We were happy after this later fix... only for a short time.  Now the
> > > > reverse problem happened: some LZ4 files produced by parquet-cpp
> cannot
> > > > be read by parquet-mr:
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> > > >
> > > > I decided that this couldn't be a parquet-cpp issue anymore.
> Evidently,
> > > > the fact that parquet-mr uses an unspecified framing format with
> > > > unspecified constraints and limits prevents other implementations
> from
> > > > being compatible.
> > > >
> > > > So I asked the reporter to open a parquet-mr issue instead, and then
> > > > took the liberty of marking the issue as blocker:
> > > > https://issues.apache.org/jira/browse/PARQUET-1974
> > > >
> > > > 6)
> > > >
> > > > Unsurprisingly, other Parquet implementations are also suffering from
> > > > this.
> > > >
> > > > * fastparquet (a Parquet implementation written partly in Python) had
> > > > initially used the LZ4 block format, then switched to the LZ4 frame
> > > > format to achieve compatibility with parquet-cpp (since both are used
> > > > in the Python ecosystem):
> > > > https://github.com/dask/fastparquet/issues/314
> > > >
> > > > Of course, fastparquet probably won't be able to read or write files
> > > > compatible with parquet-mr...
> > > >
> > > > * The Rust Parquet implementation that's part of the Rust Arrow
> > > >   implementation are also having compatibility issues due to making
> the
> > > >   wrong guess:
> > > > https://issues.apache.org/jira/browse/ARROW-11381
> > > >
> > > > * Impala seems to have had to change their LZ4 implementation to also
> > > >   match the undocumented parquet-mr framing format
> > > > https://issues.apache.org/jira/browse/IMPALA-8617
> > > >
> > > > (I have no idea whether this solved all LZ4 problems for Impala, and
> > > > whether it broke compatibility with previous Impala versions)
> > > >
> > > > Possible solutions
> > > > ------------------
> > > >
> > > > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > > > though the official Parquet documentation doesn't say so.
> > > >
> > > > 2) parquet-cpp finally achieves compatibility through additional
> > > > reverse-engineering efforts.  While I have no authority to prevent
> > > > this, I also have zero personal motivation to achieve it anymore.  I
> > > > suspect other parquet-cpp developers are not significantly more
> > > > motivated than I am (but I may be wrong).
> > > >
> > > > 3) parquet-mr takes its share of the effort by striving to be
> > > > compatible with files produced by (the current iteration of)
> > > > parquet-cpp. I have no idea whether this is easily doable, and
> whether
> > > > there is any motivation or workforce to do it. Therefore, I will
> > > > conservatively rate this as "unlikely in the short term".
> > > >
> > > > 4) Admitting that LZ4 compatibility currently is not an achievable
> > > > goal, the LZ4 compression option is deprecated and removed from the
> > > > Parquet spec.  Implementations display a warning message when it is
> > > > being read.  Writing LZ4 files is disabled.
> > > >
> > > > Discussion
> > > > ----------
> > > >
> > > > Solution #1 seems evidently undesirable to me (though it also seems
> > > > that nobody on the Java side was tremendously impatient to solve this
> > > > issue).
> > > >
> > > > As I said, solution #2 is rather unlikely.
> > > >
> > > > Java developers would have to evaluate how likely solution #3 is.  We
> > > > should avoid depending on promises that nobody is really willing to
> > > > hold, though.  Please only say "we're going to solve this" if you're
> > > > really willing to push it through.
> > > >
> > > > In any case, even if solution #2 or #3 were to be implemented, it
> would
> > > > not necessarily help future implementations, as long as the format is
> > > > still unspecified.
> > > >
> > > > In the end, while a bit frustrating for everyone, solution #4 looks
> > like
> > > > the only reasonable one for the time being.
> > > >
> > > > Future LZ4 format
> > > > -----------------
> > > >
> > > > While we should deprecate the current LZ4 format, this doesn't
> preclude
> > > > to add another, this time well-specified, LZ4 format in the future.
> It
> > > > would simply have to use a new `CompressionCodec` enum value in the
> > > > Thrift definition.
> > > >
> > > > Most certainly, it should be either of the two formats officially
> > > > defined by the LZ4 project (the block format or the frame format).
> > > >
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > >
> >
>

Re: Request deprecation / removal of LZ4 compression

Posted by Micah Kornfield <em...@gmail.com>.
>
> I think it would be a mistake for someone who has written Hadoop-Lz4 for
> several years with parquet-mr to all of sudden be no longer able to read
> their files. (I believe that parquet-mr with this pattern has been
> incorporated into various libraries for several years now--correct me if
> I'm wrong.)

I think this aligns with option #4 proposed above?

It seems simply by usage #2 (more reverse engineering on the C++ side) or
#4 would be the way to go.  I'll admit I am also not currently motivated to
reverse engineer this at the moment.  I think on the C++ side at least we
should disable writing to avoid further fragmentation.   It also isn't
clear to me if we have solid data to suggest the other compression formats
are also not broken.

I think this has come up before but more comprehensive integration testing
is key here (there was another recent bug [1] where some of the pyarrow/C++
written files aren't readable by some java libraries, so I think there are
potentially more important issues we need to deal with).

[1] https://issues.apache.org/jira/browse/ARROW-11629

On Tue, Feb 16, 2021 at 10:09 AM Jacques Nadeau <ja...@apache.org> wrote:

> There is some ambiguity in the discussion and proposals here around
> deprecating future writing versus supporting reading of already written
> data and what it means to deprecate something in the format specification.
>
> I think it would be a mistake for someone who has written Hadoop-Lz4 for
> several years with parquet-mr to all of sudden be no longer able to read
> their files. (I believe that parquet-mr with this pattern has been
> incorporated into various libraries for several years now--correct me if
> I'm wrong.)
>
> On Tue, Feb 16, 2021 at 8:26 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Thank you for the detailed summary of the LZ4 situation, Antoine!
> >
> > The Parquet file format should be properly specified for every
> > implementation. It was the mistake of the parquet-mr developers that we
> > thought the Hadoop implementation of LZ4 is according to the LZ4
> > specification and the fault of the Parquet community that we extended the
> > parquet-format with the LZ4 option without checking that there are 2
> > options to choose (not talking about the 3rd Hadoop one).
> >
> > I agree that option 4 is the only way we can step forward from this
> > situation. We shall deprecate LZ4 in parquet-format and in the mean time
> we
> > should agree on which officially specified LZ4 format do we want to
> > introduce.
> >
> > Of course, we may try to improve compatibility with the existing LZ4
> files
> > but we should not encourage our users to use this compression for now.
> >
> > Because we already have parquet-mr releases that uses the under-specified
> > Haddop LZ4 codec I don't feel it is that urgent to block the current
> > parquet-mr release because of this. We shall update parquet-format to
> make
> > the LZ4 situation clear then create a release. Then, we can start working
> > on deprecating/blocking the write path of the current implementation and
> > implement the properly specified LZ4 support in all the implementations.
> >
> > Regards,
> > Gabor
> >
> > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > This is a proposal to deprecate and/or remove LZ4 compression from the
> > > Parquet specification.
> > >
> > > Abstract
> > > --------
> > >
> > > Despite several attempts by the parquet-cpp developers, we were not
> > > able to reach the point where LZ4-compressed Parquet files are
> > > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > > implementations are having, or have had, similar issues.  My conclusion
> > > is that the Parquet spec doesn't allow independent reimplementation of
> > > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > > compression should be removed from the spec (possibly replaced with
> > > another enum value for a properly-specified, interoperable, LZ4-backed
> > > compression scheme).
> > >
> > > About LZ4
> > > ---------
> > >
> > > LZ4 is an extremely fast compression format and library.  Decompression
> > > speeds of 4GB/s can routinely be achieved in pure software, with
> > > compression speeds around 800 MB/s. Compression ratios are close to
> > > those achieved by Snappy and LZO, but at vastly higher speeds.
> > >
> > > Two formats are officially defined by the LZ4 project: the block format
> > > and the frame format.  The frame format, as the name suggests, is
> > > higher-level and contains all the information required to decompress
> > > arbitrary data buffers.  The block format is to be used when such
> > > information is made available separately (for example as part of
> > > ancillary metadata, protocol headers, etc.).
> > >
> > > Core issue
> > > ----------
> > >
> > > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > > seems to be the point of reference to look to when implementing the
> > > Parquet format) uses a home-baked framing around LZ4 block compression
> > > that's not specified anywhere. The only way to get information about it
> > > is to read the Hadoop source code.
> > >
> > > Being unspecified, it also doesn't say if there are additional
> > > constraints on the parameters (e.g. frame size).  Such constraints will
> > > be implementation-defined, undocumented, and only discoverable through
> > > tedious testing and iteration, with no guarantee of ever achieving
> > > 100% compatibility.
> > >
> > > Note that LZ4 compression itself officially comes in two formats: a
> > > low-level block format, a higher-level framed format.  But parquet-mr
> > > uses a third, custom framing format that's not part of the LZ4 format.
> > >
> > > History of compatibility issues
> > > -------------------------------
> > >
> > > 1)
> > >
> > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > > Parquet spec meant the LZ4 block format.  After all, information
> > > about data size is already available in the Parquet metadata, so no
> > > specific framing is theoretically required around the compressed data.
> > > However, it quickly occurred that this interpretation was incompatible
> > > with files produced by parquet-mr (and vice-versa: files produced by
> > > parquet-cpp could not be read with parquet-mr).
> > >
> > > A first issue was posted to suggest switching the Parquet spec (and
> > > parquet-mr) to the LZ4 framed format:
> > > https://issues.apache.org/jira/browse/PARQUET-1241
> > >
> > > However, this didn't come to a resolution, because people didn't want
> > > to break compatibility with previous versions of parquet-mr (note that
> > > this concern would necessarily switch the burden of compatibility
> > > breakage onto other implempentations).
> > >
> > > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > > resolution:
> > > https://issues.apache.org/jira/browse/HADOOP-12990
> > >
> > > 2)
> > >
> > > To avoid making things worse, parquet-cpp developers then decided to
> > > (temporarily?) disable writing LZ4 files from C++:
> > > https://issues.apache.org/jira/browse/ARROW-9424
> > >
> > > (note that this is parquet-cpp deliberately crippling its own feature
> > > set in order to workaround an issue created by an undocumented format
> > > implemented in parquet-mr)
> > >
> > > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > > read files produced by parquet-mr.  So, besides being frustrating for
> > > C++ users (and users of bindings to the C++ library, e.g. Python,
> > > Ruby...), this decision did also not make interoperability better in
> > > the Java -> C++ direction.
> > >
> > > 3)
> > >
> > > parquet-cpp developers decided to implement a format detection so as to
> > > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > > previous versions of parquet-cpp.
> > > https://issues.apache.org/jira/browse/PARQUET-1878
> > >
> > > In addition, the write path was reenabled, this time producing files
> > > that (should!) conform to the parquet-mr implementation of LZ4
> > > compression.
> > >
> > > This seemed to work fine.
> > >
> > > 4)
> > >
> > > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > > was later reported that some files did not read properly.
> > >
> > > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > > However, it turns out that parquet-mr can produce several such frames
> > > for larger data pages (it seems to frame at around 128 kiB boundaries).
> > >
> > > While this seems of course reasonable, the format being undocumented
> > > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > > we (parquet-cpp developers) pushed a fix for it:
> > > https://issues.apache.org/jira/browse/ARROW-11301
> > >
> > > 5)
> > >
> > > We were happy after this later fix... only for a short time.  Now the
> > > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > > be read by parquet-mr:
> > >
> > >
> >
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> > >
> > > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > > the fact that parquet-mr uses an unspecified framing format with
> > > unspecified constraints and limits prevents other implementations from
> > > being compatible.
> > >
> > > So I asked the reporter to open a parquet-mr issue instead, and then
> > > took the liberty of marking the issue as blocker:
> > > https://issues.apache.org/jira/browse/PARQUET-1974
> > >
> > > 6)
> > >
> > > Unsurprisingly, other Parquet implementations are also suffering from
> > > this.
> > >
> > > * fastparquet (a Parquet implementation written partly in Python) had
> > > initially used the LZ4 block format, then switched to the LZ4 frame
> > > format to achieve compatibility with parquet-cpp (since both are used
> > > in the Python ecosystem):
> > > https://github.com/dask/fastparquet/issues/314
> > >
> > > Of course, fastparquet probably won't be able to read or write files
> > > compatible with parquet-mr...
> > >
> > > * The Rust Parquet implementation that's part of the Rust Arrow
> > >   implementation are also having compatibility issues due to making the
> > >   wrong guess:
> > > https://issues.apache.org/jira/browse/ARROW-11381
> > >
> > > * Impala seems to have had to change their LZ4 implementation to also
> > >   match the undocumented parquet-mr framing format
> > > https://issues.apache.org/jira/browse/IMPALA-8617
> > >
> > > (I have no idea whether this solved all LZ4 problems for Impala, and
> > > whether it broke compatibility with previous Impala versions)
> > >
> > > Possible solutions
> > > ------------------
> > >
> > > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > > though the official Parquet documentation doesn't say so.
> > >
> > > 2) parquet-cpp finally achieves compatibility through additional
> > > reverse-engineering efforts.  While I have no authority to prevent
> > > this, I also have zero personal motivation to achieve it anymore.  I
> > > suspect other parquet-cpp developers are not significantly more
> > > motivated than I am (but I may be wrong).
> > >
> > > 3) parquet-mr takes its share of the effort by striving to be
> > > compatible with files produced by (the current iteration of)
> > > parquet-cpp. I have no idea whether this is easily doable, and whether
> > > there is any motivation or workforce to do it. Therefore, I will
> > > conservatively rate this as "unlikely in the short term".
> > >
> > > 4) Admitting that LZ4 compatibility currently is not an achievable
> > > goal, the LZ4 compression option is deprecated and removed from the
> > > Parquet spec.  Implementations display a warning message when it is
> > > being read.  Writing LZ4 files is disabled.
> > >
> > > Discussion
> > > ----------
> > >
> > > Solution #1 seems evidently undesirable to me (though it also seems
> > > that nobody on the Java side was tremendously impatient to solve this
> > > issue).
> > >
> > > As I said, solution #2 is rather unlikely.
> > >
> > > Java developers would have to evaluate how likely solution #3 is.  We
> > > should avoid depending on promises that nobody is really willing to
> > > hold, though.  Please only say "we're going to solve this" if you're
> > > really willing to push it through.
> > >
> > > In any case, even if solution #2 or #3 were to be implemented, it would
> > > not necessarily help future implementations, as long as the format is
> > > still unspecified.
> > >
> > > In the end, while a bit frustrating for everyone, solution #4 looks
> like
> > > the only reasonable one for the time being.
> > >
> > > Future LZ4 format
> > > -----------------
> > >
> > > While we should deprecate the current LZ4 format, this doesn't preclude
> > > to add another, this time well-specified, LZ4 format in the future.  It
> > > would simply have to use a new `CompressionCodec` enum value in the
> > > Thrift definition.
> > >
> > > Most certainly, it should be either of the two formats officially
> > > defined by the LZ4 project (the block format or the frame format).
> > >
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Re: Request deprecation / removal of LZ4 compression

Posted by Jacques Nadeau <ja...@apache.org>.
There is some ambiguity in the discussion and proposals here around
deprecating future writing versus supporting reading of already written
data and what it means to deprecate something in the format specification.

I think it would be a mistake for someone who has written Hadoop-Lz4 for
several years with parquet-mr to all of sudden be no longer able to read
their files. (I believe that parquet-mr with this pattern has been
incorporated into various libraries for several years now--correct me if
I'm wrong.)

On Tue, Feb 16, 2021 at 8:26 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Thank you for the detailed summary of the LZ4 situation, Antoine!
>
> The Parquet file format should be properly specified for every
> implementation. It was the mistake of the parquet-mr developers that we
> thought the Hadoop implementation of LZ4 is according to the LZ4
> specification and the fault of the Parquet community that we extended the
> parquet-format with the LZ4 option without checking that there are 2
> options to choose (not talking about the 3rd Hadoop one).
>
> I agree that option 4 is the only way we can step forward from this
> situation. We shall deprecate LZ4 in parquet-format and in the mean time we
> should agree on which officially specified LZ4 format do we want to
> introduce.
>
> Of course, we may try to improve compatibility with the existing LZ4 files
> but we should not encourage our users to use this compression for now.
>
> Because we already have parquet-mr releases that uses the under-specified
> Haddop LZ4 codec I don't feel it is that urgent to block the current
> parquet-mr release because of this. We shall update parquet-format to make
> the LZ4 situation clear then create a release. Then, we can start working
> on deprecating/blocking the write path of the current implementation and
> implement the properly specified LZ4 support in all the implementations.
>
> Regards,
> Gabor
>
> On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Hello,
> >
> > This is a proposal to deprecate and/or remove LZ4 compression from the
> > Parquet specification.
> >
> > Abstract
> > --------
> >
> > Despite several attempts by the parquet-cpp developers, we were not
> > able to reach the point where LZ4-compressed Parquet files are
> > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > implementations are having, or have had, similar issues.  My conclusion
> > is that the Parquet spec doesn't allow independent reimplementation of
> > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > compression should be removed from the spec (possibly replaced with
> > another enum value for a properly-specified, interoperable, LZ4-backed
> > compression scheme).
> >
> > About LZ4
> > ---------
> >
> > LZ4 is an extremely fast compression format and library.  Decompression
> > speeds of 4GB/s can routinely be achieved in pure software, with
> > compression speeds around 800 MB/s. Compression ratios are close to
> > those achieved by Snappy and LZO, but at vastly higher speeds.
> >
> > Two formats are officially defined by the LZ4 project: the block format
> > and the frame format.  The frame format, as the name suggests, is
> > higher-level and contains all the information required to decompress
> > arbitrary data buffers.  The block format is to be used when such
> > information is made available separately (for example as part of
> > ancillary metadata, protocol headers, etc.).
> >
> > Core issue
> > ----------
> >
> > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > seems to be the point of reference to look to when implementing the
> > Parquet format) uses a home-baked framing around LZ4 block compression
> > that's not specified anywhere. The only way to get information about it
> > is to read the Hadoop source code.
> >
> > Being unspecified, it also doesn't say if there are additional
> > constraints on the parameters (e.g. frame size).  Such constraints will
> > be implementation-defined, undocumented, and only discoverable through
> > tedious testing and iteration, with no guarantee of ever achieving
> > 100% compatibility.
> >
> > Note that LZ4 compression itself officially comes in two formats: a
> > low-level block format, a higher-level framed format.  But parquet-mr
> > uses a third, custom framing format that's not part of the LZ4 format.
> >
> > History of compatibility issues
> > -------------------------------
> >
> > 1)
> >
> > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > Parquet spec meant the LZ4 block format.  After all, information
> > about data size is already available in the Parquet metadata, so no
> > specific framing is theoretically required around the compressed data.
> > However, it quickly occurred that this interpretation was incompatible
> > with files produced by parquet-mr (and vice-versa: files produced by
> > parquet-cpp could not be read with parquet-mr).
> >
> > A first issue was posted to suggest switching the Parquet spec (and
> > parquet-mr) to the LZ4 framed format:
> > https://issues.apache.org/jira/browse/PARQUET-1241
> >
> > However, this didn't come to a resolution, because people didn't want
> > to break compatibility with previous versions of parquet-mr (note that
> > this concern would necessarily switch the burden of compatibility
> > breakage onto other implempentations).
> >
> > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > resolution:
> > https://issues.apache.org/jira/browse/HADOOP-12990
> >
> > 2)
> >
> > To avoid making things worse, parquet-cpp developers then decided to
> > (temporarily?) disable writing LZ4 files from C++:
> > https://issues.apache.org/jira/browse/ARROW-9424
> >
> > (note that this is parquet-cpp deliberately crippling its own feature
> > set in order to workaround an issue created by an undocumented format
> > implemented in parquet-mr)
> >
> > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > read files produced by parquet-mr.  So, besides being frustrating for
> > C++ users (and users of bindings to the C++ library, e.g. Python,
> > Ruby...), this decision did also not make interoperability better in
> > the Java -> C++ direction.
> >
> > 3)
> >
> > parquet-cpp developers decided to implement a format detection so as to
> > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > previous versions of parquet-cpp.
> > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > In addition, the write path was reenabled, this time producing files
> > that (should!) conform to the parquet-mr implementation of LZ4
> > compression.
> >
> > This seemed to work fine.
> >
> > 4)
> >
> > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > was later reported that some files did not read properly.
> >
> > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > However, it turns out that parquet-mr can produce several such frames
> > for larger data pages (it seems to frame at around 128 kiB boundaries).
> >
> > While this seems of course reasonable, the format being undocumented
> > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > we (parquet-cpp developers) pushed a fix for it:
> > https://issues.apache.org/jira/browse/ARROW-11301
> >
> > 5)
> >
> > We were happy after this later fix... only for a short time.  Now the
> > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > be read by parquet-mr:
> >
> >
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> >
> > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > the fact that parquet-mr uses an unspecified framing format with
> > unspecified constraints and limits prevents other implementations from
> > being compatible.
> >
> > So I asked the reporter to open a parquet-mr issue instead, and then
> > took the liberty of marking the issue as blocker:
> > https://issues.apache.org/jira/browse/PARQUET-1974
> >
> > 6)
> >
> > Unsurprisingly, other Parquet implementations are also suffering from
> > this.
> >
> > * fastparquet (a Parquet implementation written partly in Python) had
> > initially used the LZ4 block format, then switched to the LZ4 frame
> > format to achieve compatibility with parquet-cpp (since both are used
> > in the Python ecosystem):
> > https://github.com/dask/fastparquet/issues/314
> >
> > Of course, fastparquet probably won't be able to read or write files
> > compatible with parquet-mr...
> >
> > * The Rust Parquet implementation that's part of the Rust Arrow
> >   implementation are also having compatibility issues due to making the
> >   wrong guess:
> > https://issues.apache.org/jira/browse/ARROW-11381
> >
> > * Impala seems to have had to change their LZ4 implementation to also
> >   match the undocumented parquet-mr framing format
> > https://issues.apache.org/jira/browse/IMPALA-8617
> >
> > (I have no idea whether this solved all LZ4 problems for Impala, and
> > whether it broke compatibility with previous Impala versions)
> >
> > Possible solutions
> > ------------------
> >
> > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > though the official Parquet documentation doesn't say so.
> >
> > 2) parquet-cpp finally achieves compatibility through additional
> > reverse-engineering efforts.  While I have no authority to prevent
> > this, I also have zero personal motivation to achieve it anymore.  I
> > suspect other parquet-cpp developers are not significantly more
> > motivated than I am (but I may be wrong).
> >
> > 3) parquet-mr takes its share of the effort by striving to be
> > compatible with files produced by (the current iteration of)
> > parquet-cpp. I have no idea whether this is easily doable, and whether
> > there is any motivation or workforce to do it. Therefore, I will
> > conservatively rate this as "unlikely in the short term".
> >
> > 4) Admitting that LZ4 compatibility currently is not an achievable
> > goal, the LZ4 compression option is deprecated and removed from the
> > Parquet spec.  Implementations display a warning message when it is
> > being read.  Writing LZ4 files is disabled.
> >
> > Discussion
> > ----------
> >
> > Solution #1 seems evidently undesirable to me (though it also seems
> > that nobody on the Java side was tremendously impatient to solve this
> > issue).
> >
> > As I said, solution #2 is rather unlikely.
> >
> > Java developers would have to evaluate how likely solution #3 is.  We
> > should avoid depending on promises that nobody is really willing to
> > hold, though.  Please only say "we're going to solve this" if you're
> > really willing to push it through.
> >
> > In any case, even if solution #2 or #3 were to be implemented, it would
> > not necessarily help future implementations, as long as the format is
> > still unspecified.
> >
> > In the end, while a bit frustrating for everyone, solution #4 looks like
> > the only reasonable one for the time being.
> >
> > Future LZ4 format
> > -----------------
> >
> > While we should deprecate the current LZ4 format, this doesn't preclude
> > to add another, this time well-specified, LZ4 format in the future.  It
> > would simply have to use a new `CompressionCodec` enum value in the
> > Thrift definition.
> >
> > Most certainly, it should be either of the two formats officially
> > defined by the LZ4 project (the block format or the frame format).
> >
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: Request deprecation / removal of LZ4 compression

Posted by Antoine Pitrou <an...@python.org>.
Hi Gabor,

I've filed https://issues.apache.org/jira/browse/PARQUET-1996 and
submitted https://github.com/apache/parquet-format/pull/168 for the
format updates.

Hopefully this will get the ball rolling.

Best regards

Antoine.


On Mon, 8 Mar 2021 10:56:32 +0100
Gabor Szadovszky <ga...@apache.org> wrote:
> Hi Antoine,
> 
> We might do anything in parquet-format, without releasing and adapting it,
> it would mean nothing. Also, creating a new parquet-format release only for
> deprecating LZ4 makes little sense to me. So, I would suggest adding a
> simple comment to the LZ4 enum type about its deprecation and a reference
> to the newly supported type for LZ4. We also need the new LZ4 type and a
> proper specification for it. It might be a good idea to start a new doc in
> the parquet-format repo about the compression codec specifications. We then
> reference the related doc from the thrift file comments. After this is
> ready we can initiate a parquet-format release and if it's done we can
> start adapting it.
> 
> My idea about adapting this in parquet-mr is to keep the old LZ4 available
> by configuration which default is to use the new one. (Of course, we also
> need to implement/get the proper implementation of the new LZ4 codec.)
> 
> What do you think?
> 
> Cheers,
> Gabor
> 
> On Sat, Mar 6, 2021 at 4:40 PM Antoine Pitrou <an...@python.org> wrote:
> 
> >
> > Hello,
> >
> > We should move this forward.
> > What should be the procedure for updating parquet-format?
> >
> > Best regards
> >
> > Antoine.
> >
> >
> >
> > On Tue, 16 Feb 2021 17:25:49 +0100
> > Gabor Szadovszky <ga...@apache.org> wrote:  
> > > Thank you for the detailed summary of the LZ4 situation, Antoine!
> > >
> > > The Parquet file format should be properly specified for every
> > > implementation. It was the mistake of the parquet-mr developers that we
> > > thought the Hadoop implementation of LZ4 is according to the LZ4
> > > specification and the fault of the Parquet community that we extended the
> > > parquet-format with the LZ4 option without checking that there are 2
> > > options to choose (not talking about the 3rd Hadoop one).
> > >
> > > I agree that option 4 is the only way we can step forward from this
> > > situation. We shall deprecate LZ4 in parquet-format and in the mean time  
> > we  
> > > should agree on which officially specified LZ4 format do we want to
> > > introduce.
> > >
> > > Of course, we may try to improve compatibility with the existing LZ4  
> > files  
> > > but we should not encourage our users to use this compression for now.
> > >
> > > Because we already have parquet-mr releases that uses the under-specified
> > > Haddop LZ4 codec I don't feel it is that urgent to block the current
> > > parquet-mr release because of this. We shall update parquet-format to  
> > make  
> > > the LZ4 situation clear then create a release. Then, we can start working
> > > on deprecating/blocking the write path of the current implementation and
> > > implement the properly specified LZ4 support in all the implementations.
> > >
> > > Regards,
> > > Gabor
> > >
> > > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org>  
> > wrote:  
> > >  
> > > >
> > > > Hello,
> > > >
> > > > This is a proposal to deprecate and/or remove LZ4 compression from the
> > > > Parquet specification.
> > > >
> > > > Abstract
> > > > --------
> > > >
> > > > Despite several attempts by the parquet-cpp developers, we were not
> > > > able to reach the point where LZ4-compressed Parquet files are
> > > > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > > > implementations are having, or have had, similar issues.  My conclusion
> > > > is that the Parquet spec doesn't allow independent reimplementation of
> > > > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > > > compression should be removed from the spec (possibly replaced with
> > > > another enum value for a properly-specified, interoperable, LZ4-backed
> > > > compression scheme).
> > > >
> > > > About LZ4
> > > > ---------
> > > >
> > > > LZ4 is an extremely fast compression format and library.  Decompression
> > > > speeds of 4GB/s can routinely be achieved in pure software, with
> > > > compression speeds around 800 MB/s. Compression ratios are close to
> > > > those achieved by Snappy and LZO, but at vastly higher speeds.
> > > >
> > > > Two formats are officially defined by the LZ4 project: the block format
> > > > and the frame format.  The frame format, as the name suggests, is
> > > > higher-level and contains all the information required to decompress
> > > > arbitrary data buffers.  The block format is to be used when such
> > > > information is made available separately (for example as part of
> > > > ancillary metadata, protocol headers, etc.).
> > > >
> > > > Core issue
> > > > ----------
> > > >
> > > > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > > > seems to be the point of reference to look to when implementing the
> > > > Parquet format) uses a home-baked framing around LZ4 block compression
> > > > that's not specified anywhere. The only way to get information about it
> > > > is to read the Hadoop source code.
> > > >
> > > > Being unspecified, it also doesn't say if there are additional
> > > > constraints on the parameters (e.g. frame size).  Such constraints will
> > > > be implementation-defined, undocumented, and only discoverable through
> > > > tedious testing and iteration, with no guarantee of ever achieving
> > > > 100% compatibility.
> > > >
> > > > Note that LZ4 compression itself officially comes in two formats: a
> > > > low-level block format, a higher-level framed format.  But parquet-mr
> > > > uses a third, custom framing format that's not part of the LZ4 format.
> > > >
> > > > History of compatibility issues
> > > > -------------------------------
> > > >
> > > > 1)
> > > >
> > > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > > > Parquet spec meant the LZ4 block format.  After all, information
> > > > about data size is already available in the Parquet metadata, so no
> > > > specific framing is theoretically required around the compressed data.
> > > > However, it quickly occurred that this interpretation was incompatible
> > > > with files produced by parquet-mr (and vice-versa: files produced by
> > > > parquet-cpp could not be read with parquet-mr).
> > > >
> > > > A first issue was posted to suggest switching the Parquet spec (and
> > > > parquet-mr) to the LZ4 framed format:
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > >
> > > > However, this didn't come to a resolution, because people didn't want
> > > > to break compatibility with previous versions of parquet-mr (note that
> > > > this concern would necessarily switch the burden of compatibility
> > > > breakage onto other implempentations).
> > > >
> > > > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > > > resolution:
> > > > https://issues.apache.org/jira/browse/HADOOP-12990
> > > >
> > > > 2)
> > > >
> > > > To avoid making things worse, parquet-cpp developers then decided to
> > > > (temporarily?) disable writing LZ4 files from C++:
> > > > https://issues.apache.org/jira/browse/ARROW-9424
> > > >
> > > > (note that this is parquet-cpp deliberately crippling its own feature
> > > > set in order to workaround an issue created by an undocumented format
> > > > implemented in parquet-mr)
> > > >
> > > > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > > > read files produced by parquet-mr.  So, besides being frustrating for
> > > > C++ users (and users of bindings to the C++ library, e.g. Python,
> > > > Ruby...), this decision did also not make interoperability better in
> > > > the Java -> C++ direction.
> > > >
> > > > 3)
> > > >
> > > > parquet-cpp developers decided to implement a format detection so as to
> > > > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > > > previous versions of parquet-cpp.
> > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > > >
> > > > In addition, the write path was reenabled, this time producing files
> > > > that (should!) conform to the parquet-mr implementation of LZ4
> > > > compression.
> > > >
> > > > This seemed to work fine.
> > > >
> > > > 4)
> > > >
> > > > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > > > was later reported that some files did not read properly.
> > > >
> > > > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > > > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > > > However, it turns out that parquet-mr can produce several such frames
> > > > for larger data pages (it seems to frame at around 128 kiB boundaries).
> > > >
> > > > While this seems of course reasonable, the format being undocumented
> > > > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > > > we (parquet-cpp developers) pushed a fix for it:
> > > > https://issues.apache.org/jira/browse/ARROW-11301
> > > >
> > > > 5)
> > > >
> > > > We were happy after this later fix... only for a short time.  Now the
> > > > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > > > be read by parquet-mr:
> > > >
> > > >  
> > https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168  
> > > >
> > > > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > > > the fact that parquet-mr uses an unspecified framing format with
> > > > unspecified constraints and limits prevents other implementations from
> > > > being compatible.
> > > >
> > > > So I asked the reporter to open a parquet-mr issue instead, and then
> > > > took the liberty of marking the issue as blocker:
> > > > https://issues.apache.org/jira/browse/PARQUET-1974
> > > >
> > > > 6)
> > > >
> > > > Unsurprisingly, other Parquet implementations are also suffering from
> > > > this.
> > > >
> > > > * fastparquet (a Parquet implementation written partly in Python) had
> > > > initially used the LZ4 block format, then switched to the LZ4 frame
> > > > format to achieve compatibility with parquet-cpp (since both are used
> > > > in the Python ecosystem):
> > > > https://github.com/dask/fastparquet/issues/314
> > > >
> > > > Of course, fastparquet probably won't be able to read or write files
> > > > compatible with parquet-mr...
> > > >
> > > > * The Rust Parquet implementation that's part of the Rust Arrow
> > > >   implementation are also having compatibility issues due to making the
> > > >   wrong guess:
> > > > https://issues.apache.org/jira/browse/ARROW-11381
> > > >
> > > > * Impala seems to have had to change their LZ4 implementation to also
> > > >   match the undocumented parquet-mr framing format
> > > > https://issues.apache.org/jira/browse/IMPALA-8617
> > > >
> > > > (I have no idea whether this solved all LZ4 problems for Impala, and
> > > > whether it broke compatibility with previous Impala versions)
> > > >
> > > > Possible solutions
> > > > ------------------
> > > >
> > > > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > > > though the official Parquet documentation doesn't say so.
> > > >
> > > > 2) parquet-cpp finally achieves compatibility through additional
> > > > reverse-engineering efforts.  While I have no authority to prevent
> > > > this, I also have zero personal motivation to achieve it anymore.  I
> > > > suspect other parquet-cpp developers are not significantly more
> > > > motivated than I am (but I may be wrong).
> > > >
> > > > 3) parquet-mr takes its share of the effort by striving to be
> > > > compatible with files produced by (the current iteration of)
> > > > parquet-cpp. I have no idea whether this is easily doable, and whether
> > > > there is any motivation or workforce to do it. Therefore, I will
> > > > conservatively rate this as "unlikely in the short term".
> > > >
> > > > 4) Admitting that LZ4 compatibility currently is not an achievable
> > > > goal, the LZ4 compression option is deprecated and removed from the
> > > > Parquet spec.  Implementations display a warning message when it is
> > > > being read.  Writing LZ4 files is disabled.
> > > >
> > > > Discussion
> > > > ----------
> > > >
> > > > Solution #1 seems evidently undesirable to me (though it also seems
> > > > that nobody on the Java side was tremendously impatient to solve this
> > > > issue).
> > > >
> > > > As I said, solution #2 is rather unlikely.
> > > >
> > > > Java developers would have to evaluate how likely solution #3 is.  We
> > > > should avoid depending on promises that nobody is really willing to
> > > > hold, though.  Please only say "we're going to solve this" if you're
> > > > really willing to push it through.
> > > >
> > > > In any case, even if solution #2 or #3 were to be implemented, it would
> > > > not necessarily help future implementations, as long as the format is
> > > > still unspecified.
> > > >
> > > > In the end, while a bit frustrating for everyone, solution #4 looks  
> > like  
> > > > the only reasonable one for the time being.
> > > >
> > > > Future LZ4 format
> > > > -----------------
> > > >
> > > > While we should deprecate the current LZ4 format, this doesn't preclude
> > > > to add another, this time well-specified, LZ4 format in the future.  It
> > > > would simply have to use a new `CompressionCodec` enum value in the
> > > > Thrift definition.
> > > >
> > > > Most certainly, it should be either of the two formats officially
> > > > defined by the LZ4 project (the block format or the frame format).
> > > >
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >  
> > >  
> >
> >
> >
> >  
> 




Re: Request deprecation / removal of LZ4 compression

Posted by Gabor Szadovszky <ga...@apache.org>.
Hi Antoine,

We might do anything in parquet-format, without releasing and adapting it,
it would mean nothing. Also, creating a new parquet-format release only for
deprecating LZ4 makes little sense to me. So, I would suggest adding a
simple comment to the LZ4 enum type about its deprecation and a reference
to the newly supported type for LZ4. We also need the new LZ4 type and a
proper specification for it. It might be a good idea to start a new doc in
the parquet-format repo about the compression codec specifications. We then
reference the related doc from the thrift file comments. After this is
ready we can initiate a parquet-format release and if it's done we can
start adapting it.

My idea about adapting this in parquet-mr is to keep the old LZ4 available
by configuration which default is to use the new one. (Of course, we also
need to implement/get the proper implementation of the new LZ4 codec.)

What do you think?

Cheers,
Gabor

On Sat, Mar 6, 2021 at 4:40 PM Antoine Pitrou <an...@python.org> wrote:

>
> Hello,
>
> We should move this forward.
> What should be the procedure for updating parquet-format?
>
> Best regards
>
> Antoine.
>
>
>
> On Tue, 16 Feb 2021 17:25:49 +0100
> Gabor Szadovszky <ga...@apache.org> wrote:
> > Thank you for the detailed summary of the LZ4 situation, Antoine!
> >
> > The Parquet file format should be properly specified for every
> > implementation. It was the mistake of the parquet-mr developers that we
> > thought the Hadoop implementation of LZ4 is according to the LZ4
> > specification and the fault of the Parquet community that we extended the
> > parquet-format with the LZ4 option without checking that there are 2
> > options to choose (not talking about the 3rd Hadoop one).
> >
> > I agree that option 4 is the only way we can step forward from this
> > situation. We shall deprecate LZ4 in parquet-format and in the mean time
> we
> > should agree on which officially specified LZ4 format do we want to
> > introduce.
> >
> > Of course, we may try to improve compatibility with the existing LZ4
> files
> > but we should not encourage our users to use this compression for now.
> >
> > Because we already have parquet-mr releases that uses the under-specified
> > Haddop LZ4 codec I don't feel it is that urgent to block the current
> > parquet-mr release because of this. We shall update parquet-format to
> make
> > the LZ4 situation clear then create a release. Then, we can start working
> > on deprecating/blocking the write path of the current implementation and
> > implement the properly specified LZ4 support in all the implementations.
> >
> > Regards,
> > Gabor
> >
> > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > This is a proposal to deprecate and/or remove LZ4 compression from the
> > > Parquet specification.
> > >
> > > Abstract
> > > --------
> > >
> > > Despite several attempts by the parquet-cpp developers, we were not
> > > able to reach the point where LZ4-compressed Parquet files are
> > > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > > implementations are having, or have had, similar issues.  My conclusion
> > > is that the Parquet spec doesn't allow independent reimplementation of
> > > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > > compression should be removed from the spec (possibly replaced with
> > > another enum value for a properly-specified, interoperable, LZ4-backed
> > > compression scheme).
> > >
> > > About LZ4
> > > ---------
> > >
> > > LZ4 is an extremely fast compression format and library.  Decompression
> > > speeds of 4GB/s can routinely be achieved in pure software, with
> > > compression speeds around 800 MB/s. Compression ratios are close to
> > > those achieved by Snappy and LZO, but at vastly higher speeds.
> > >
> > > Two formats are officially defined by the LZ4 project: the block format
> > > and the frame format.  The frame format, as the name suggests, is
> > > higher-level and contains all the information required to decompress
> > > arbitrary data buffers.  The block format is to be used when such
> > > information is made available separately (for example as part of
> > > ancillary metadata, protocol headers, etc.).
> > >
> > > Core issue
> > > ----------
> > >
> > > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > > seems to be the point of reference to look to when implementing the
> > > Parquet format) uses a home-baked framing around LZ4 block compression
> > > that's not specified anywhere. The only way to get information about it
> > > is to read the Hadoop source code.
> > >
> > > Being unspecified, it also doesn't say if there are additional
> > > constraints on the parameters (e.g. frame size).  Such constraints will
> > > be implementation-defined, undocumented, and only discoverable through
> > > tedious testing and iteration, with no guarantee of ever achieving
> > > 100% compatibility.
> > >
> > > Note that LZ4 compression itself officially comes in two formats: a
> > > low-level block format, a higher-level framed format.  But parquet-mr
> > > uses a third, custom framing format that's not part of the LZ4 format.
> > >
> > > History of compatibility issues
> > > -------------------------------
> > >
> > > 1)
> > >
> > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > > Parquet spec meant the LZ4 block format.  After all, information
> > > about data size is already available in the Parquet metadata, so no
> > > specific framing is theoretically required around the compressed data.
> > > However, it quickly occurred that this interpretation was incompatible
> > > with files produced by parquet-mr (and vice-versa: files produced by
> > > parquet-cpp could not be read with parquet-mr).
> > >
> > > A first issue was posted to suggest switching the Parquet spec (and
> > > parquet-mr) to the LZ4 framed format:
> > > https://issues.apache.org/jira/browse/PARQUET-1241
> > >
> > > However, this didn't come to a resolution, because people didn't want
> > > to break compatibility with previous versions of parquet-mr (note that
> > > this concern would necessarily switch the burden of compatibility
> > > breakage onto other implempentations).
> > >
> > > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > > resolution:
> > > https://issues.apache.org/jira/browse/HADOOP-12990
> > >
> > > 2)
> > >
> > > To avoid making things worse, parquet-cpp developers then decided to
> > > (temporarily?) disable writing LZ4 files from C++:
> > > https://issues.apache.org/jira/browse/ARROW-9424
> > >
> > > (note that this is parquet-cpp deliberately crippling its own feature
> > > set in order to workaround an issue created by an undocumented format
> > > implemented in parquet-mr)
> > >
> > > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > > read files produced by parquet-mr.  So, besides being frustrating for
> > > C++ users (and users of bindings to the C++ library, e.g. Python,
> > > Ruby...), this decision did also not make interoperability better in
> > > the Java -> C++ direction.
> > >
> > > 3)
> > >
> > > parquet-cpp developers decided to implement a format detection so as to
> > > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > > previous versions of parquet-cpp.
> > > https://issues.apache.org/jira/browse/PARQUET-1878
> > >
> > > In addition, the write path was reenabled, this time producing files
> > > that (should!) conform to the parquet-mr implementation of LZ4
> > > compression.
> > >
> > > This seemed to work fine.
> > >
> > > 4)
> > >
> > > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > > was later reported that some files did not read properly.
> > >
> > > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > > However, it turns out that parquet-mr can produce several such frames
> > > for larger data pages (it seems to frame at around 128 kiB boundaries).
> > >
> > > While this seems of course reasonable, the format being undocumented
> > > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > > we (parquet-cpp developers) pushed a fix for it:
> > > https://issues.apache.org/jira/browse/ARROW-11301
> > >
> > > 5)
> > >
> > > We were happy after this later fix... only for a short time.  Now the
> > > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > > be read by parquet-mr:
> > >
> > >
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> > >
> > > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > > the fact that parquet-mr uses an unspecified framing format with
> > > unspecified constraints and limits prevents other implementations from
> > > being compatible.
> > >
> > > So I asked the reporter to open a parquet-mr issue instead, and then
> > > took the liberty of marking the issue as blocker:
> > > https://issues.apache.org/jira/browse/PARQUET-1974
> > >
> > > 6)
> > >
> > > Unsurprisingly, other Parquet implementations are also suffering from
> > > this.
> > >
> > > * fastparquet (a Parquet implementation written partly in Python) had
> > > initially used the LZ4 block format, then switched to the LZ4 frame
> > > format to achieve compatibility with parquet-cpp (since both are used
> > > in the Python ecosystem):
> > > https://github.com/dask/fastparquet/issues/314
> > >
> > > Of course, fastparquet probably won't be able to read or write files
> > > compatible with parquet-mr...
> > >
> > > * The Rust Parquet implementation that's part of the Rust Arrow
> > >   implementation are also having compatibility issues due to making the
> > >   wrong guess:
> > > https://issues.apache.org/jira/browse/ARROW-11381
> > >
> > > * Impala seems to have had to change their LZ4 implementation to also
> > >   match the undocumented parquet-mr framing format
> > > https://issues.apache.org/jira/browse/IMPALA-8617
> > >
> > > (I have no idea whether this solved all LZ4 problems for Impala, and
> > > whether it broke compatibility with previous Impala versions)
> > >
> > > Possible solutions
> > > ------------------
> > >
> > > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > > though the official Parquet documentation doesn't say so.
> > >
> > > 2) parquet-cpp finally achieves compatibility through additional
> > > reverse-engineering efforts.  While I have no authority to prevent
> > > this, I also have zero personal motivation to achieve it anymore.  I
> > > suspect other parquet-cpp developers are not significantly more
> > > motivated than I am (but I may be wrong).
> > >
> > > 3) parquet-mr takes its share of the effort by striving to be
> > > compatible with files produced by (the current iteration of)
> > > parquet-cpp. I have no idea whether this is easily doable, and whether
> > > there is any motivation or workforce to do it. Therefore, I will
> > > conservatively rate this as "unlikely in the short term".
> > >
> > > 4) Admitting that LZ4 compatibility currently is not an achievable
> > > goal, the LZ4 compression option is deprecated and removed from the
> > > Parquet spec.  Implementations display a warning message when it is
> > > being read.  Writing LZ4 files is disabled.
> > >
> > > Discussion
> > > ----------
> > >
> > > Solution #1 seems evidently undesirable to me (though it also seems
> > > that nobody on the Java side was tremendously impatient to solve this
> > > issue).
> > >
> > > As I said, solution #2 is rather unlikely.
> > >
> > > Java developers would have to evaluate how likely solution #3 is.  We
> > > should avoid depending on promises that nobody is really willing to
> > > hold, though.  Please only say "we're going to solve this" if you're
> > > really willing to push it through.
> > >
> > > In any case, even if solution #2 or #3 were to be implemented, it would
> > > not necessarily help future implementations, as long as the format is
> > > still unspecified.
> > >
> > > In the end, while a bit frustrating for everyone, solution #4 looks
> like
> > > the only reasonable one for the time being.
> > >
> > > Future LZ4 format
> > > -----------------
> > >
> > > While we should deprecate the current LZ4 format, this doesn't preclude
> > > to add another, this time well-specified, LZ4 format in the future.  It
> > > would simply have to use a new `CompressionCodec` enum value in the
> > > Thrift definition.
> > >
> > > Most certainly, it should be either of the two formats officially
> > > defined by the LZ4 project (the block format or the frame format).
> > >
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>
>
>
>

Re: Request deprecation / removal of LZ4 compression

Posted by Antoine Pitrou <an...@python.org>.
Hello,

We should move this forward.
What should be the procedure for updating parquet-format?

Best regards

Antoine.



On Tue, 16 Feb 2021 17:25:49 +0100
Gabor Szadovszky <ga...@apache.org> wrote:
> Thank you for the detailed summary of the LZ4 situation, Antoine!
> 
> The Parquet file format should be properly specified for every
> implementation. It was the mistake of the parquet-mr developers that we
> thought the Hadoop implementation of LZ4 is according to the LZ4
> specification and the fault of the Parquet community that we extended the
> parquet-format with the LZ4 option without checking that there are 2
> options to choose (not talking about the 3rd Hadoop one).
> 
> I agree that option 4 is the only way we can step forward from this
> situation. We shall deprecate LZ4 in parquet-format and in the mean time we
> should agree on which officially specified LZ4 format do we want to
> introduce.
> 
> Of course, we may try to improve compatibility with the existing LZ4 files
> but we should not encourage our users to use this compression for now.
> 
> Because we already have parquet-mr releases that uses the under-specified
> Haddop LZ4 codec I don't feel it is that urgent to block the current
> parquet-mr release because of this. We shall update parquet-format to make
> the LZ4 situation clear then create a release. Then, we can start working
> on deprecating/blocking the write path of the current implementation and
> implement the properly specified LZ4 support in all the implementations.
> 
> Regards,
> Gabor
> 
> On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org> wrote:
> 
> >
> > Hello,
> >
> > This is a proposal to deprecate and/or remove LZ4 compression from the
> > Parquet specification.
> >
> > Abstract
> > --------
> >
> > Despite several attempts by the parquet-cpp developers, we were not
> > able to reach the point where LZ4-compressed Parquet files are
> > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > implementations are having, or have had, similar issues.  My conclusion
> > is that the Parquet spec doesn't allow independent reimplementation of
> > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > compression should be removed from the spec (possibly replaced with
> > another enum value for a properly-specified, interoperable, LZ4-backed
> > compression scheme).
> >
> > About LZ4
> > ---------
> >
> > LZ4 is an extremely fast compression format and library.  Decompression
> > speeds of 4GB/s can routinely be achieved in pure software, with
> > compression speeds around 800 MB/s. Compression ratios are close to
> > those achieved by Snappy and LZO, but at vastly higher speeds.
> >
> > Two formats are officially defined by the LZ4 project: the block format
> > and the frame format.  The frame format, as the name suggests, is
> > higher-level and contains all the information required to decompress
> > arbitrary data buffers.  The block format is to be used when such
> > information is made available separately (for example as part of
> > ancillary metadata, protocol headers, etc.).
> >
> > Core issue
> > ----------
> >
> > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > seems to be the point of reference to look to when implementing the
> > Parquet format) uses a home-baked framing around LZ4 block compression
> > that's not specified anywhere. The only way to get information about it
> > is to read the Hadoop source code.
> >
> > Being unspecified, it also doesn't say if there are additional
> > constraints on the parameters (e.g. frame size).  Such constraints will
> > be implementation-defined, undocumented, and only discoverable through
> > tedious testing and iteration, with no guarantee of ever achieving
> > 100% compatibility.
> >
> > Note that LZ4 compression itself officially comes in two formats: a
> > low-level block format, a higher-level framed format.  But parquet-mr
> > uses a third, custom framing format that's not part of the LZ4 format.
> >
> > History of compatibility issues
> > -------------------------------
> >
> > 1)
> >
> > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > Parquet spec meant the LZ4 block format.  After all, information
> > about data size is already available in the Parquet metadata, so no
> > specific framing is theoretically required around the compressed data.
> > However, it quickly occurred that this interpretation was incompatible
> > with files produced by parquet-mr (and vice-versa: files produced by
> > parquet-cpp could not be read with parquet-mr).
> >
> > A first issue was posted to suggest switching the Parquet spec (and
> > parquet-mr) to the LZ4 framed format:
> > https://issues.apache.org/jira/browse/PARQUET-1241
> >
> > However, this didn't come to a resolution, because people didn't want
> > to break compatibility with previous versions of parquet-mr (note that
> > this concern would necessarily switch the burden of compatibility
> > breakage onto other implempentations).
> >
> > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > resolution:
> > https://issues.apache.org/jira/browse/HADOOP-12990
> >
> > 2)
> >
> > To avoid making things worse, parquet-cpp developers then decided to
> > (temporarily?) disable writing LZ4 files from C++:
> > https://issues.apache.org/jira/browse/ARROW-9424
> >
> > (note that this is parquet-cpp deliberately crippling its own feature
> > set in order to workaround an issue created by an undocumented format
> > implemented in parquet-mr)
> >
> > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > read files produced by parquet-mr.  So, besides being frustrating for
> > C++ users (and users of bindings to the C++ library, e.g. Python,
> > Ruby...), this decision did also not make interoperability better in
> > the Java -> C++ direction.
> >
> > 3)
> >
> > parquet-cpp developers decided to implement a format detection so as to
> > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > previous versions of parquet-cpp.
> > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > In addition, the write path was reenabled, this time producing files
> > that (should!) conform to the parquet-mr implementation of LZ4
> > compression.
> >
> > This seemed to work fine.
> >
> > 4)
> >
> > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > was later reported that some files did not read properly.
> >
> > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > However, it turns out that parquet-mr can produce several such frames
> > for larger data pages (it seems to frame at around 128 kiB boundaries).
> >
> > While this seems of course reasonable, the format being undocumented
> > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > we (parquet-cpp developers) pushed a fix for it:
> > https://issues.apache.org/jira/browse/ARROW-11301
> >
> > 5)
> >
> > We were happy after this later fix... only for a short time.  Now the
> > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > be read by parquet-mr:
> >
> > https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> >
> > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > the fact that parquet-mr uses an unspecified framing format with
> > unspecified constraints and limits prevents other implementations from
> > being compatible.
> >
> > So I asked the reporter to open a parquet-mr issue instead, and then
> > took the liberty of marking the issue as blocker:
> > https://issues.apache.org/jira/browse/PARQUET-1974
> >
> > 6)
> >
> > Unsurprisingly, other Parquet implementations are also suffering from
> > this.
> >
> > * fastparquet (a Parquet implementation written partly in Python) had
> > initially used the LZ4 block format, then switched to the LZ4 frame
> > format to achieve compatibility with parquet-cpp (since both are used
> > in the Python ecosystem):
> > https://github.com/dask/fastparquet/issues/314
> >
> > Of course, fastparquet probably won't be able to read or write files
> > compatible with parquet-mr...
> >
> > * The Rust Parquet implementation that's part of the Rust Arrow
> >   implementation are also having compatibility issues due to making the
> >   wrong guess:
> > https://issues.apache.org/jira/browse/ARROW-11381
> >
> > * Impala seems to have had to change their LZ4 implementation to also
> >   match the undocumented parquet-mr framing format
> > https://issues.apache.org/jira/browse/IMPALA-8617
> >
> > (I have no idea whether this solved all LZ4 problems for Impala, and
> > whether it broke compatibility with previous Impala versions)
> >
> > Possible solutions
> > ------------------
> >
> > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > though the official Parquet documentation doesn't say so.
> >
> > 2) parquet-cpp finally achieves compatibility through additional
> > reverse-engineering efforts.  While I have no authority to prevent
> > this, I also have zero personal motivation to achieve it anymore.  I
> > suspect other parquet-cpp developers are not significantly more
> > motivated than I am (but I may be wrong).
> >
> > 3) parquet-mr takes its share of the effort by striving to be
> > compatible with files produced by (the current iteration of)
> > parquet-cpp. I have no idea whether this is easily doable, and whether
> > there is any motivation or workforce to do it. Therefore, I will
> > conservatively rate this as "unlikely in the short term".
> >
> > 4) Admitting that LZ4 compatibility currently is not an achievable
> > goal, the LZ4 compression option is deprecated and removed from the
> > Parquet spec.  Implementations display a warning message when it is
> > being read.  Writing LZ4 files is disabled.
> >
> > Discussion
> > ----------
> >
> > Solution #1 seems evidently undesirable to me (though it also seems
> > that nobody on the Java side was tremendously impatient to solve this
> > issue).
> >
> > As I said, solution #2 is rather unlikely.
> >
> > Java developers would have to evaluate how likely solution #3 is.  We
> > should avoid depending on promises that nobody is really willing to
> > hold, though.  Please only say "we're going to solve this" if you're
> > really willing to push it through.
> >
> > In any case, even if solution #2 or #3 were to be implemented, it would
> > not necessarily help future implementations, as long as the format is
> > still unspecified.
> >
> > In the end, while a bit frustrating for everyone, solution #4 looks like
> > the only reasonable one for the time being.
> >
> > Future LZ4 format
> > -----------------
> >
> > While we should deprecate the current LZ4 format, this doesn't preclude
> > to add another, this time well-specified, LZ4 format in the future.  It
> > would simply have to use a new `CompressionCodec` enum value in the
> > Thrift definition.
> >
> > Most certainly, it should be either of the two formats officially
> > defined by the LZ4 project (the block format or the frame format).
> >
> >
> > Regards
> >
> > Antoine.
> >
> >
> >  
> 




Re: Request deprecation / removal of LZ4 compression

Posted by Gabor Szadovszky <ga...@apache.org>.
Thank you for the detailed summary of the LZ4 situation, Antoine!

The Parquet file format should be properly specified for every
implementation. It was the mistake of the parquet-mr developers that we
thought the Hadoop implementation of LZ4 is according to the LZ4
specification and the fault of the Parquet community that we extended the
parquet-format with the LZ4 option without checking that there are 2
options to choose (not talking about the 3rd Hadoop one).

I agree that option 4 is the only way we can step forward from this
situation. We shall deprecate LZ4 in parquet-format and in the mean time we
should agree on which officially specified LZ4 format do we want to
introduce.

Of course, we may try to improve compatibility with the existing LZ4 files
but we should not encourage our users to use this compression for now.

Because we already have parquet-mr releases that uses the under-specified
Haddop LZ4 codec I don't feel it is that urgent to block the current
parquet-mr release because of this. We shall update parquet-format to make
the LZ4 situation clear then create a release. Then, we can start working
on deprecating/blocking the write path of the current implementation and
implement the properly specified LZ4 support in all the implementations.

Regards,
Gabor

On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <an...@python.org> wrote:

>
> Hello,
>
> This is a proposal to deprecate and/or remove LZ4 compression from the
> Parquet specification.
>
> Abstract
> --------
>
> Despite several attempts by the parquet-cpp developers, we were not
> able to reach the point where LZ4-compressed Parquet files are
> bidirectionally compatible between parquet-cpp and parquet-mr. Other
> implementations are having, or have had, similar issues.  My conclusion
> is that the Parquet spec doesn't allow independent reimplementation of
> the LZ4 compression format required by parquet-mr. Therefore, LZ4
> compression should be removed from the spec (possibly replaced with
> another enum value for a properly-specified, interoperable, LZ4-backed
> compression scheme).
>
> About LZ4
> ---------
>
> LZ4 is an extremely fast compression format and library.  Decompression
> speeds of 4GB/s can routinely be achieved in pure software, with
> compression speeds around 800 MB/s. Compression ratios are close to
> those achieved by Snappy and LZO, but at vastly higher speeds.
>
> Two formats are officially defined by the LZ4 project: the block format
> and the frame format.  The frame format, as the name suggests, is
> higher-level and contains all the information required to decompress
> arbitrary data buffers.  The block format is to be used when such
> information is made available separately (for example as part of
> ancillary metadata, protocol headers, etc.).
>
> Core issue
> ----------
>
> LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> seems to be the point of reference to look to when implementing the
> Parquet format) uses a home-baked framing around LZ4 block compression
> that's not specified anywhere. The only way to get information about it
> is to read the Hadoop source code.
>
> Being unspecified, it also doesn't say if there are additional
> constraints on the parameters (e.g. frame size).  Such constraints will
> be implementation-defined, undocumented, and only discoverable through
> tedious testing and iteration, with no guarantee of ever achieving
> 100% compatibility.
>
> Note that LZ4 compression itself officially comes in two formats: a
> low-level block format, a higher-level framed format.  But parquet-mr
> uses a third, custom framing format that's not part of the LZ4 format.
>
> History of compatibility issues
> -------------------------------
>
> 1)
>
> parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> Parquet spec meant the LZ4 block format.  After all, information
> about data size is already available in the Parquet metadata, so no
> specific framing is theoretically required around the compressed data.
> However, it quickly occurred that this interpretation was incompatible
> with files produced by parquet-mr (and vice-versa: files produced by
> parquet-cpp could not be read with parquet-mr).
>
> A first issue was posted to suggest switching the Parquet spec (and
> parquet-mr) to the LZ4 framed format:
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> However, this didn't come to a resolution, because people didn't want
> to break compatibility with previous versions of parquet-mr (note that
> this concern would necessarily switch the burden of compatibility
> breakage onto other implempentations).
>
> Relatedly, there is an issue open for Hadoop, which also didn't get a
> resolution:
> https://issues.apache.org/jira/browse/HADOOP-12990
>
> 2)
>
> To avoid making things worse, parquet-cpp developers then decided to
> (temporarily?) disable writing LZ4 files from C++:
> https://issues.apache.org/jira/browse/ARROW-9424
>
> (note that this is parquet-cpp deliberately crippling its own feature
> set in order to workaround an issue created by an undocumented format
> implemented in parquet-mr)
>
> At this point, though, the LZ4 *reader* in parquet-cpp could still not
> read files produced by parquet-mr.  So, besides being frustrating for
> C++ users (and users of bindings to the C++ library, e.g. Python,
> Ruby...), this decision did also not make interoperability better in
> the Java -> C++ direction.
>
> 3)
>
> parquet-cpp developers decided to implement a format detection so as to
> read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> previous versions of parquet-cpp.
> https://issues.apache.org/jira/browse/PARQUET-1878
>
> In addition, the write path was reenabled, this time producing files
> that (should!) conform to the parquet-mr implementation of LZ4
> compression.
>
> This seemed to work fine.
>
> 4)
>
> While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> was later reported that some files did not read properly.
>
> Indeed, our (parquet-cpp) compatibility code assumed that a single
> "parquet-mr LZ4 frame" was ever written out for a single data page.
> However, it turns out that parquet-mr can produce several such frames
> for larger data pages (it seems to frame at around 128 kiB boundaries).
>
> While this seems of course reasonable, the format being undocumented
> prevented us from foreseeing this situation.  Once the issue diagnosed,
> we (parquet-cpp developers) pushed a fix for it:
> https://issues.apache.org/jira/browse/ARROW-11301
>
> 5)
>
> We were happy after this later fix... only for a short time.  Now the
> reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> be read by parquet-mr:
>
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
>
> I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> the fact that parquet-mr uses an unspecified framing format with
> unspecified constraints and limits prevents other implementations from
> being compatible.
>
> So I asked the reporter to open a parquet-mr issue instead, and then
> took the liberty of marking the issue as blocker:
> https://issues.apache.org/jira/browse/PARQUET-1974
>
> 6)
>
> Unsurprisingly, other Parquet implementations are also suffering from
> this.
>
> * fastparquet (a Parquet implementation written partly in Python) had
> initially used the LZ4 block format, then switched to the LZ4 frame
> format to achieve compatibility with parquet-cpp (since both are used
> in the Python ecosystem):
> https://github.com/dask/fastparquet/issues/314
>
> Of course, fastparquet probably won't be able to read or write files
> compatible with parquet-mr...
>
> * The Rust Parquet implementation that's part of the Rust Arrow
>   implementation are also having compatibility issues due to making the
>   wrong guess:
> https://issues.apache.org/jira/browse/ARROW-11381
>
> * Impala seems to have had to change their LZ4 implementation to also
>   match the undocumented parquet-mr framing format
> https://issues.apache.org/jira/browse/IMPALA-8617
>
> (I have no idea whether this solved all LZ4 problems for Impala, and
> whether it broke compatibility with previous Impala versions)
>
> Possible solutions
> ------------------
>
> 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> though the official Parquet documentation doesn't say so.
>
> 2) parquet-cpp finally achieves compatibility through additional
> reverse-engineering efforts.  While I have no authority to prevent
> this, I also have zero personal motivation to achieve it anymore.  I
> suspect other parquet-cpp developers are not significantly more
> motivated than I am (but I may be wrong).
>
> 3) parquet-mr takes its share of the effort by striving to be
> compatible with files produced by (the current iteration of)
> parquet-cpp. I have no idea whether this is easily doable, and whether
> there is any motivation or workforce to do it. Therefore, I will
> conservatively rate this as "unlikely in the short term".
>
> 4) Admitting that LZ4 compatibility currently is not an achievable
> goal, the LZ4 compression option is deprecated and removed from the
> Parquet spec.  Implementations display a warning message when it is
> being read.  Writing LZ4 files is disabled.
>
> Discussion
> ----------
>
> Solution #1 seems evidently undesirable to me (though it also seems
> that nobody on the Java side was tremendously impatient to solve this
> issue).
>
> As I said, solution #2 is rather unlikely.
>
> Java developers would have to evaluate how likely solution #3 is.  We
> should avoid depending on promises that nobody is really willing to
> hold, though.  Please only say "we're going to solve this" if you're
> really willing to push it through.
>
> In any case, even if solution #2 or #3 were to be implemented, it would
> not necessarily help future implementations, as long as the format is
> still unspecified.
>
> In the end, while a bit frustrating for everyone, solution #4 looks like
> the only reasonable one for the time being.
>
> Future LZ4 format
> -----------------
>
> While we should deprecate the current LZ4 format, this doesn't preclude
> to add another, this time well-specified, LZ4 format in the future.  It
> would simply have to use a new `CompressionCodec` enum value in the
> Thrift definition.
>
> Most certainly, it should be either of the two formats officially
> defined by the LZ4 project (the block format or the frame format).
>
>
> Regards
>
> Antoine.
>
>
>